A Novel Evaluation Framework for Image2Text Generation

J.-H. Huang; H. Zhu; Y. Shen; S. Rudinac; A.M. Pacces; E. Kanoulas

A Novel Evaluation Framework for Image2Text Generation

Authors	J.-H. Huang H. Zhu Y. Shen S. Rudinac A.M. Pacces E. Kanoulas
Publication date	2024
Host editors	C. Siro M. Aliannejadi H.A. Rahmani N. Craswell C.L.A. Clarke G. Faggioli B. Mitra P. Thomas E. Yilmaz
Book title	Proceedings of The First Workshop on Large Language Models for Evaluation in Information Retrieval (LLM4Eval 2024)
Book subtitle	co-located with 10th International Conference on Online Publishing (SIGIR 2024) : Washington D.C., USA, July 18, 2024
Series	CEUR Workshop Proceedings
Event	1st Workshop on Large Language Models for Evaluation in Information Retrieval, LLM4Eval 2024
Article number	4
Pages (from-to)	51-65
Number of pages	15
Publisher	Aachen: CEUR-WS
Organisations	Faculty of Law (FdR) Faculty of Economics and Business (FEB) - Amsterdam Business School Research Institute (ABS-RI) Faculty of Science (FNWI) - Informatics Institute (IVI) Faculty of Law (FdR) - Amsterdam Center for Law & Economics (ACLE)
Abstract	Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation offers valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr aim to bridge this gap but often show weak correlations with human judgment. We address this challenge by introducing a novel evaluation framework rooted in a modern large language model (LLM), such as GPT-4 or Gemini, capable of image generation. In our proposed framework, we begin by feeding an input image into a designated image captioning model, chosen for evaluation, to generate a textual description. Using this description, an LLM then creates a new image. By extracting features from both the original and LLM-created images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the image captioning model has accurately generated textual descriptions, while a low similarity score indicates discrepancies, revealing potential shortcomings in the model’s performance. Human-annotated reference captions are not required in our proposed evaluation framework, which serves as a valuable tool for evaluating the effectiveness of image captioning models. Its efficacy is confirmed through human evaluation.
Document type	Conference contribution
Language	English
Other links	https://ceur-ws.org/Vol-3752/ https://www.scopus.com/pages/publications/85203836646
Downloads	paper4-4 (Final published version) paper4-4 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

A Novel Evaluation Framework for Image2Text Generation