PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

doi:https://doi.org/10.48550/arXiv.2402.08657

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

Authors	M. Dorkenwald N. Barazani C.G.M. Snoek Y.M. Asano
Publication date	2024
Book title	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Book subtitle	CVPR 2024 : Seattle, Washington, USA, 16-22 June 2024 : proceedings
ISBN	9798350353013
ISBN (electronic)	9798350353006
Event	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Pages (from-to)	13548-13558
Publisher	Los Alamitos, California: IEEE Computer Society
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multi-modal data containing mostly captions without explicit spa-tial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spa-tial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.
Document type	Conference contribution
Note	With supplemental materials
Language	English
Published at	https://doi.org/10.48550/arXiv.2402.08657 https://doi.org/10.1109/CVPR52733.2024.01286
Published at	https://openaccess.thecvf.com/content/CVPR2024/html/Dorkenwald_PIN_Positional_Insert_Unlocks_Object_Localisation_Abilities_in_VLMs_CVPR_2024_paper.html
Other links	https://quva-lab.github.io/PIN/ https://www.proceedings.com/76082.html
Downloads	Dorkenwald_PIN_Positional_Insert_Unlocks_Object_Localisation_Abilities_in_VLMs_CVPR_2024_paper (Accepted author manuscript) PIN_Positional_Insert_Unlocks_Object_Localisation_Abilities_in_VLMs (Final published version)
Supplementary materials	Dorkenwald_PIN_Positional_Insert_CVPR_2024_supplemental mm_530000n548
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs