Preference-based Evaluation Metrics for Web Image Search

doi:https://doi.org/10.1145/3397271.3401146

Preference-based Evaluation Metrics for Web Image Search

Authors	X. Xie J. Mao Y. Liu M. de Rijke H. Chen M. Zhang S. Ma
Publication date	2020
Book title	SIGIR '20
Book subtitle	proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval : July 25-30, 2020, virtual event, China
ISBN (electronic)	9781450380164
Event	43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020
Pages (from-to)	369-378
Publisher	New York, NY: Association for Computing Machinery
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Following the success of Cranfield-like evaluation approaches to evaluation in web search, web image search has also been evaluated with absolute judgments of (graded) relevance. However, recent research has found that collecting absolute relevance judgments may be difficult in image search scenarios due to the multi-dimensional nature of relevance for image results. Moreover, existing evaluation metrics based on absolute relevance judgments do not correlate well with search users' satisfaction perceptions in web image search. Unlike absolute relevance judgments, preference judgments do not require that relevance grades be pre-defined, i.e., how many levels to use and what those levels mean. Instead of considering each document in isolation, preference judgments consider a pair of documents and require judges to state their relative preference. Such preference judgments are usually more reliable than absolute judgments since the presence of (at least) two items establishes a certain context. While preference judgments have been studied extensively for general web search, there exists no thorough investigation on how preference judgments and preference-based evaluation metrics can be used to evaluate web image search systems. Compared to general web search, web image search may be an even better fit for preference-based evaluation because of its grid-based presentation style. The limited need for fresh results in web image search also makes preference judgments more reusable than for general web search. In this paper, we provide a thorough comparison of variants of preference judgments for web image search. We find that compared to strict preference judgments, weak preference judgments require less time and have better inter-assessor agreement. We also study how absolute relevance levels of two given images affect preference judgments between them. Furthermore, we propose a preference-based evaluation metric named Preference-Winning-Penalty (PWP) to evaluate and compare between two different image search systems. The proposed PWP metric outperforms existing evaluation metrics based on absolute relevance judgments in terms of agreement to system-level preferences of actual users.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1145/3397271.3401146
Downloads	xie-2020-preference-based (Accepted author manuscript)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Preference-based Evaluation Metrics for Web Image Search