ReMedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling
| Authors | |
|---|---|
| Publication date | 2025 |
| Host editors |
|
| Book title | The 2025 Conference on Empirical Methods in Natural Language Processing : Proceedings of the Conference |
| Book subtitle | EMNLP 2025 : November 4-9, 2025 |
| ISBN (electronic) |
|
| Event | 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 |
| Pages (from-to) | 4370-4387 |
| Publisher | Kerrville, TX: Association for Computational Linguistics |
| Organisations |
|
| Abstract |
A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.
|
| Document type | Conference contribution |
| Note | With checklist |
| Language | English |
| Published at | https://doi.org/10.18653/v1/2025.emnlp-main.217 |
| Other links | https://github.com/Smu-Tan/Remedy |
| Downloads |
2025.emnlp-main.217
(Final published version)
|
| Supplementary materials | |
| Permalink to this page | |