Describing Images <i>Fast and Slow</i>: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes<i/>

E. Takmaz; S. Pezzelle; R. Fernández

doi:https://doi.org/10.18653/v1/2024.eacl-long.126

Describing Images Fast and Slow: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes

Authors

E. Takmaz

S. Pezzelle

R. Fernández

Publication date 2024

Host editors

Y. Graham

M. Purver

Book title The 18th Conference of the European Chapter of the Association for Computational Linguistics : Proceedings of the Conference

Book subtitle EACL 2024 : March 17-22, 2024

ISBN (electronic)

9798891760882

Event 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024

Volume | Issue number 1

Pages (from-to) 2072-2087

Number of pages 16

Publisher Kerrville, TX: Association for Computational Linguistics

Organisations

Interfacultary Research - Institute for Logic, Language and Computation (ILLC)

Abstract
There is an intricate relation between the properties of an image and how humans behave while describing the image. This behavior shows ample variation, as manifested in human signals such as eye movements and when humans start to describe the image. Despite the value of such signals of visuo-linguistic variation, they are virtually disregarded in the training of current pretrained models, which motivates further investigation. Using a corpus of Dutch image descriptions with concurrently collected eye-tracking data, we explore the nature of the variation in visuo-linguistic signals, and find that they correlate with each other. Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation. Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.

Document type Conference contribution

Note With supplementary video

Language English

Published at
https://doi.org/10.18653/v1/2024.eacl-long.126 (Final published version)

Other links
https://www.scopus.com/pages/publications/85189942080

Downloads
2024.eacl-long.126 (Final published version)

Supplementary materials
2024.eacl-long.126

Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Describing Images Fast and Slow: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes

Authors	E. Takmaz S. Pezzelle R. Fernández
Publication date	2024
Host editors	Y. Graham M. Purver
Book title	The 18th Conference of the European Chapter of the Association for Computational Linguistics : Proceedings of the Conference
Book subtitle	EACL 2024 : March 17-22, 2024
ISBN (electronic)	9798891760882
Event	18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024
Volume \| Issue number	1
Pages (from-to)	2072-2087
Number of pages	16
Publisher	Kerrville, TX: Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	There is an intricate relation between the properties of an image and how humans behave while describing the image. This behavior shows ample variation, as manifested in human signals such as eye movements and when humans start to describe the image. Despite the value of such signals of visuo-linguistic variation, they are virtually disregarded in the training of current pretrained models, which motivates further investigation. Using a corpus of Dutch image descriptions with concurrently collected eye-tracking data, we explore the nature of the variation in visuo-linguistic signals, and find that they correlate with each other. Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation. Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.
Document type	Conference contribution
Note	With supplementary video
Language	English
Published at	https://doi.org/10.18653/v1/2024.eacl-long.126 (Final published version)
Other links	https://www.scopus.com/pages/publications/85189942080
Downloads	2024.eacl-long.126 (Final published version)
Supplementary materials	2024.eacl-long.126
Permalink to this page