I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue

E. Ghaleb; Bulat Khaertdinov; Aslı Özyürek; R. Fernández

doi:https://doi.org/10.18653/v1/2025.findings-acl.682

I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue

Authors	E. Ghaleb Bulat Khaertdinov Aslı Özyürek R. Fernández
Publication date	2025
Host editors	W. Che J. Nabende E. Shutova M.T. Pilehvar
Book title	The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) : Findings of the Association for Computational Linguistics: ACL 2025
Book subtitle	ACL 2025 : July 27-August 1, 2025
ISBN (electronic)	9798891762565
Event	63rd Annual Meeting of the Association for Computational Linguistics
Pages (from-to)	13191–13206
Publisher	Kerrville, TX: Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.18653/v1/2025.findings-acl.682
Downloads	2025.findings-acl.682 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue