VTC: Improving Video-Text Retrieval with User Comments

Authors
Publication date 2022
Host editors
  • S. Avidan
  • G. Brostow
  • M. Cissé
  • G.M. Farinella
  • T. Hassner
Book title Computer Vision – ECCV 2022
Book subtitle 17th European Conference, Tel Aviv, Israel, October 23–27, 2022 : proceedings
ISBN
  • 9783031198328
ISBN (electronic)
  • 9783031198335
Series Lecture Notes in Computer Science
Event European Conference on Computer Vision (ECCV), 2022
Volume | Issue number XXXV
Pages (from-to) 616–633
Publisher Cham: Springer
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are well-correlated with the content. Thus, current video-text retrieval literature largely focuses on video titles or audio transcripts, while ignoring user comments, since users often tend to discuss topics only vaguely related to the video. Despite the ubiquity of user comments online, there is currently no multi-modal representation learning datasets that includes comments. In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations. Project page: https://unitaryai.github.io/vtc-paper.
Document type Conference contribution
Note With supplementary material. - Correction publ. online 10 January, 2023.
Language English
Published at https://doi.org/10.1007/978-3-031-19833-5_36
Other links https://doi.org/10.1007/978-3-031-19833-5_43 https://unitaryai.github.io/vtc-paper
Permalink to this page
Back