Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

E. Takmaz; S. Pezzelle; L. Beinborn; R. Fernández

doi:https://doi.org/10.18653/v1/2020.emnlp-main.377

Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

Authors	E. Takmaz S. Pezzelle L. Beinborn R. Fernández
Publication date	2020
Host editors	B. Webber T. Cohn Y. He Y. Liu
Book title	2020 Conference on Empirical Methods in Natural Language Processing
Book subtitle	EMNLP 2020 : proceedings of the conference : November 16-20, 2020
ISBN (electronic)	9781952148606
Event	2020 Conference on Empirical Methods in Natural Language Processing
Pages (from-to)	4664–4677
Publisher	Stroudsburg, PA: The Association for Computational Linguistics
Organisations	Faculty of Science (FNWI) Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled sequentially. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural—particularly when gaze is encoded with a dedicated recurrent component.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.18653/v1/2020.emnlp-main.377 (Final published version)
Downloads	2020.emnlp-main.377 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze