GROOViST: A Metric for Grounding Objects in Visual Storytelling

A. Surikuchi; S. Pezzelle; R. Fernández

doi:https://doi.org/10.18653/v1/2023.emnlp-main.202

GROOViST: A Metric for Grounding Objects in Visual Storytelling

Authors	A. Surikuchi S. Pezzelle R. Fernández
Publication date	2023
Host editors	H. Bouamor J. Pino K. Bali
Book title	The 2023 Conference on Empirical Methods in Natural Language Processing
Book subtitle	EMNLP 2023 : Proceedings of the Conference : December 6-10, 2023
ISBN (electronic)	9798891760608
Event	2023 Conference on Empirical Methods in Natural Language Processing
Pages (from-to)	3331-3339
Publisher	Stroudsburg, PA: Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	A proper evaluation of stories generated for a sequence of images—the task commonly referred to as visual storytelling—must consider multiple aspects, such as coherence, grammatical correctness, and visual grounding. In this work, we focus on evaluating the degree of grounding, that is, the extent to which a story is about the entities shown in the images. We analyze current metrics, both designed for this purpose and for general vision-text alignment. Given their observed shortcomings, we propose a novel evaluation tool, GROOViST, that accounts for cross-modal dependencies, temporal misalignments (the fact that the order in which entities appear in the story and the image sequence may not match), and human intuitions on visual grounding. An additional advantage of GROOViST is its modular design, where the contribution of each component can be assessed and interpreted individually.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.18653/v1/2023.emnlp-main.202 (Final published version)
Other links	https://aclanthology.org/2023.emnlp-main.202.mp4
Downloads	2023.emnlp-main.202 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

GROOViST: A Metric for Grounding Objects in Visual Storytelling