Leveraging Query Expansion and Reformulation for Image Retrieval With Large Language and Vision-Language Models

doi:https://doi.org/10.1109/CBMI62980.2024.10859227

Leveraging Query Expansion and Reformulation for Image Retrieval With Large Language and Vision-Language Models

Authors	S. Frunză S. Rudinac C. Diks
Publication date	2024
Book title	21st International Conference on Content-Based Multimedia Indexing
Book subtitle	CBMI 2024 : September 18-20, 2024, Reykjavik, Iceland : conference proceedings
ISBN	9798350378450
ISBN (electronic)	9798350378443
Event	21st International Conference on Content-based Multimedia Indexing
Pages (from-to)	23-29
Number of pages	7
Publisher	Piscataway, NJ: IEEE
Organisations	Faculty of Science (FNWI) Faculty of Economics and Business (FEB) - Amsterdam Business School Research Institute (ABS-RI) Faculty of Economics and Business (FEB) Faculty of Economics and Business (FEB) - Amsterdam School of Economics Research Institute (ASE-RI)
Abstract	This research builds on novel text-based image retrieval (IR) methods that leverage vision-language models (VLMs) and large language models (LLMs). The study highlights the need for an image retrieval evaluation strategy that reflects the use of conversational IR systems in the real world, and introduces a novel evaluation framework for interactive ad-hoc text-based IR. Unimodal IR models that perform the retrieval based on image captions generated automatically are compared against popular crossmodal IR models, to conclude that the latter remain superior in performance. Several strategies for automated query expansion (QE) and reformulation (QR) are explored. A generative LLM is prompted to generate keywords related to the original query or rephrase it based on artificial user relevance feedback (RF) deployed in the evaluation framework. Particularly, the image captions of the relevant images retrieved in the first IR round are provided as context for the generative LLM. Our main observation is that the retrieval models based on a large VLM, such as BLIP-2, benefit more from QE, and that the QE strategies based on keyword extraction outperform QR alternatives based on summarization. However, the approach should be further investigated to determine how the results are influenced by the type and quality of image annotations.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1109/CBMI62980.2024.10859227
Other links	https://www.proceedings.com/78720.html
Downloads	Leveraging_Query_Expansion_and_Reformulation_for_Image_Retrieval_With_Large_Language_and_Vision-Language_Models (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Leveraging Query Expansion and Reformulation for Image Retrieval With Large Language and Vision-Language Models