A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues

Y. Zhang; P. Ren; M. de Rijke

doi:https://doi.org/10.18653/v1/2021.acl-long.436

A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues

Authors	Y. Zhang P. Ren M. de Rijke
Publication date	2021
Host editors	C. Zong F. Xia W. Li R. Navigli
Book title	The 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing
Book subtitle	ACL-IJCNLP 2021 : proceedings of the conference : August 1-6, 2021
ISBN (electronic)	9781954085527
Event	The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)
Volume \| Issue number	1
Pages (from-to)	5612–5623
Publisher	Stroudsburg, PA: The Association for Computational Linguistics
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Conversational dialogue systems (CDSs) are hard to evaluate due to the complexity of natural language. Automatic evaluation of dialogues often shows insufficient correlation with human judgements. Human evaluation is reliable but labor-intensive. We introduce a human-machine collaborative framework, HMCEval, that can guarantee reliability of the evaluation outcomes with reduced human effort. HMCEval casts dialogue evaluation as a sample assignment problem, where we need to decide to assign a sample to a human or a machine for evaluation. HMCEval includes a model confidence estimation module to estimate the confidence of the predicted sample assignment, and a human effort estimation module to estimate the human effort should the sample be assigned to human evaluation, as well as a sample assignment execution module that finds the optimum assignment solution based on the estimated confidence and effort. We assess the performance of HMCEval on the task of evaluating malevolence in dialogues. The experimental results show that HMCEval achieves around 99% evaluation accuracy with half of the human effort spared, showing that HMCEval provides reliable evaluation outcomes while reducing human effort by a large amount.
Document type	Conference contribution
Note	With supplementary video
Language	English
Published at	https://doi.org/10.18653/v1/2021.acl-long.436 (Final published version)
Downloads	2021.acl-long.436 (Final published version)
Supplementary materials	2021.acl-long.436
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues