Beyond Coarse-Grained Matching in Video-Text Retrieval

doi:https://doi.org/10.1007/978-981-96-0908-6_2

Beyond Coarse-Grained Matching in Video-Text Retrieval

Authors	Aozhu Chen Hazel Doughty Xirong Li Cees G.M. Snoek
Publication date	2025
Host editors	Minsu Cho Ivan Laptev Du Tran Angela Yao Hongbin Zha
Book title	Computer Vision – ACCV 2024
Book subtitle	17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8–12, 2024, Proceedings
ISBN	9789819609079
ISBN (electronic)	9789819609086
Series	Lecture Notes in Computer Science
Event	17th Asian Conference on Computer Vision, ACCV 2024
Volume \| Issue number	III
Pages (from-to)	25-43
Publisher	Singapore: Springer Nature Singapore
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-word variations across nouns, verbs, adjectives, adverbs, and prepositions. We perform comprehensive experiments using four state-of-the-art models across two standard benchmarks (MSR-VTT and VATEX) and two specially curated datasets enriched with detailed descriptions (VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) our analyses show that the current evaluation benchmarks fall short in detecting a model’s ability to perceive subtle single-word differences, 2) our fine-grained evaluation highlights the difficulty models face in distinguishing such subtle variations. To enhance fine-grained understanding, we propose a new baseline that can be easily combined with current methods. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model’s ability to understand fine-grained differences.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1007/978-981-96-0908-6_2
Other links	https://www.scopus.com/pages/publications/85213042000
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Beyond Coarse-Grained Matching in Video-Text Retrieval