An Empirical Analysis of Machine Translation for Expanding Multilingual Benchmarks
| Authors | |
|---|---|
| Publication date | 2025 |
| Host editors |
|
| Book title | Tenth Conference on Machine Translation : Proceedings of the Conference |
| Book subtitle | WMT 2025 : November 8-9, 2025 |
| ISBN (electronic) |
|
| Event | 10th Conference on Machine Translation, WMT 2025 |
| Pages (from-to) | 1-30 |
| Publisher | Kerrville, TX: Association for Computational Linguistics |
| Organisations |
|
| Abstract |
The rapid advancement of large language models (LLMs) has introduced new challenges in their evaluation, particularly for multilingual settings. The limited evaluation data are more pronounced in low-resource languages due to the scarcity of professional annotators, hindering fair progress across languages. In this work, we systematically investigate the viability of using machine translation (MT) as a proxy for evaluation in scenarios where human-annotated test sets are unavailable. Leveraging a state-of-the-art translation model, we translate datasets from four tasks into 198 languages and employ these translations to assess the quality and robustness of MT-based multilingual evaluation under different setups. We analyze task-specific error patterns, identifying when MT-based evaluation is reliable and when it produces misleading results. Our translated benchmark reveals that current language selections in multilingual datasets tend to overestimate LLM performance on low-resource languages. We conclude that although machine translation is not yet a fully reliable method for evaluating multilingual models, overlooking its potential means missing a valuable opportunity to track progress in non-English languages.
|
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.18653/v1/2025.wmt-1.1 |
| Downloads |
2025.wmt-1.1
(Final published version)
|
| Permalink to this page | |