<i>Be Different to Be Better!</i>

S. Pezzelle; C. Greco; G. Gandolfi; E. Gualdoni; R. Bernardi

doi:https://doi.org/10.18653/v1/2020.findings-emnlp.248

Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision

Authors	S. Pezzelle C. Greco G. Gandolfi E. Gualdoni R. Bernardi
Publication date	2020
Host editors	T. Cohn Y. He Y. Liu
Book title	Findings of the Association for Computational Linguistics : Findings of ACL: EMNLP 2020
Book subtitle	16-20 November, 2020
ISBN (electronic)	9781952148903
Event	2020 Conference on Empirical Methods in Natural Language Processing
Pages (from-to)	2751-2767
Number of pages	17
Publisher	Stroudsburg, PA: The Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.
Document type	Conference contribution
Language	English
Related dataset	Be Different to Be Better (BD2BB)
Published at	https://doi.org/10.18653/v1/2020.findings-emnlp.248
Downloads	2020.findings-emnlp.248-1 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision