Predicting Visual Features from Text for Image and Video Caption Retrieval

J. Dong; X. Li; C.G.M. Snoek

doi:https://doi.org/10.1109/TMM.2018.2832602

Predicting Visual Features from Text for Image and Video Caption Retrieval

Authors	J. Dong X. Li C.G.M. Snoek
Publication date	12-2018
Journal	IEEE Transactions on Multimedia
Volume \| Issue number	20 \| 12
Pages (from-to)	3377-3388
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute Word2VisualVec , a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multiscale sentence vectorization and further transferred into a deep visual feature of choice via a simple multilayer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both three-dimensional convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset, and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition, and its state-of-the-art results.
Document type	Article
Language	English
Published at	https://doi.org/10.1109/TMM.2018.2832602 (Final published version)
Other links	https://ivi.fnwi.uva.nl/isis/publications/2018/DongTMM2018
Downloads	Predicting Visual Features from Text for Image and Video Caption Retrieval (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Predicting Visual Features from Text for Image and Video Caption Retrieval