Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Open Access
Authors
Publication date 2025
Host editors
  • Wanxiang Che
  • Joyce Nabende
  • Ekaterina Shutova
  • Mohammad Taher Pilehvar
Book title The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) : Findings of the Association for Computational Linguistics: ACL 2025
Book subtitle ACL 2025 : July 27-August 1, 2025
ISBN (electronic)
  • 9798891762565
Event 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Pages (from-to) 10428-10445
Number of pages 18
Publisher Kerrville, TX: Association for Computational Linguistics
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract

Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13× smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models.

Document type Conference contribution
Language English
Published at https://doi.org/10.18653/v1/2025.findings-acl.543
Other links https://www.scopus.com/pages/publications/105028574934
Downloads
2025.findings-acl.543 (Final published version)
Permalink to this page
Back