Leveraging Query Expansion and Reformulation for Image Retrieval With Large Language and Vision-Language Models

Open Access
Authors
Publication date 2024
Book title 21st International Conference on Content-Based Multimedia Indexing
Book subtitle CBMI 2024 : September 18-20, 2024, Reykjavik, Iceland : conference proceedings
ISBN
  • 9798350378450
ISBN (electronic)
  • 9798350378443
Event 21st International Conference on Content-based Multimedia Indexing
Pages (from-to) 23-29
Number of pages 7
Publisher Piscataway, NJ: IEEE
Organisations
  • Faculty of Science (FNWI)
  • Faculty of Economics and Business (FEB) - Amsterdam Business School Research Institute (ABS-RI)
  • Faculty of Economics and Business (FEB)
  • Faculty of Economics and Business (FEB) - Amsterdam School of Economics Research Institute (ASE-RI)
Abstract
This research builds on novel text-based image retrieval (IR) methods that leverage vision-language models (VLMs) and large language models (LLMs). The study highlights the need for an image retrieval evaluation strategy that reflects the use of conversational IR systems in the real world, and introduces a novel evaluation framework for interactive ad-hoc text-based IR. Unimodal IR models that perform the retrieval based on image captions generated automatically are compared against popular crossmodal IR models, to conclude that the latter remain superior in performance. Several strategies for automated query expansion (QE) and reformulation (QR) are explored. A generative LLM is prompted to generate keywords related to the original query or rephrase it based on artificial user relevance feedback (RF) deployed in the evaluation framework. Particularly, the image captions of the relevant images retrieved in the first IR round are provided as context for the generative LLM. Our main observation is that the retrieval models based on a large VLM, such as BLIP-2, benefit more from QE, and that the QE strategies based on keyword extraction outperform QR alternatives based on summarization. However, the approach should be further investigated to determine how the results are influenced by the type and quality of image annotations.
Document type Conference contribution
Language English
Published at https://doi.org/10.1109/CBMI62980.2024.10859227
Other links https://www.proceedings.com/78720.html
Downloads
Permalink to this page
Back