ReMedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

S. Tan; C. Monz

doi:https://doi.org/10.18653/v1/2025.emnlp-main.217

ReMedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

Authors	S. Tan C. Monz
Publication date	2025
Host editors	C. Christodoulopoulos T. Chakraborty C. Rose V. Peng
Book title	The 2025 Conference on Empirical Methods in Natural Language Processing : Proceedings of the Conference
Book subtitle	EMNLP 2025 : November 4-9, 2025
ISBN (electronic)	9798891763326
Event	30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
Pages (from-to)	4370-4387
Publisher	Kerrville, TX: Association for Computational Linguistics
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.
Document type	Conference contribution
Note	With checklist
Language	English
Published at	https://doi.org/10.18653/v1/2025.emnlp-main.217
Other links	https://github.com/Smu-Tan/Remedy
Downloads	2025.emnlp-main.217 (Final published version)
Supplementary materials	2025.emnlp-main.217.checklist
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

ReMedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling