Evaluating the Representational Hub of Language and Vision Models

Open Access
Authors
Publication date 2019
Host editors
  • S. Dobnik
  • S. Chatzikyriakidis
  • V. Demberg
Book title Proceedings of the 13th International Conference on Computational Semantics - Long Papers
Book subtitle IWCS 2019 : 23-27 May, 2019, University of Gothenburg, Gothenburg, Sweden
ISBN (electronic)
  • 9781950737192
Event 13th International Conference on Computational Semantics
Pages (from-to) 211-222
Publisher Stroudsburg, PA: The Association for Computational Linguistics
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
  • Faculty of Science (FNWI)
Abstract
The multimodal models used in the emerging field at the intersection of computational linguistics and computer vision implement the bottom-up processing of the “Hub and Spoke” architecture proposed in cognitive science to represent how the brain processes and combines multi-sensory inputs. In particular, the Hub is implemented as a neural network encoder. We investigate the effect on this encoder of various vision-and-language tasks proposed in the literature: visual question answering, visual reference resolution, and visually grounded dialogue. To measure the quality of the representations learned by the encoder, we use two kinds of analyses. First, we evaluate the encoder pre-trained on the different vision-and-language tasks on an existing “diagnostic task” designed to assess multimodal semantic understanding. Second, we carry out a battery of analyses aimed at studying how the encoder merges and exploits the two modalities.
Document type Conference contribution
Language English
Published at https://doi.org/10.18653/v1/W19-0418
Downloads
W19-0418 (Final published version)
Permalink to this page
Back