ReMedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

Open Access
Authors
Publication date 2025
Host editors
  • C. Christodoulopoulos
  • T. Chakraborty
  • C. Rose
  • V. Peng
Book title The 2025 Conference on Empirical Methods in Natural Language Processing : Proceedings of the Conference
Book subtitle EMNLP 2025 : November 4-9, 2025
ISBN (electronic)
  • 9798891763326
Event 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
Pages (from-to) 4370-4387
Publisher Kerrville, TX: Association for Computational Linguistics
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.
Document type Conference contribution
Note With checklist
Language English
Published at https://doi.org/10.18653/v1/2025.emnlp-main.217
Other links https://github.com/Smu-Tan/Remedy
Downloads
2025.emnlp-main.217 (Final published version)
Supplementary materials
Permalink to this page
Back