Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Kidist Amde Mekonnen; Yosef Worku Alemneh; Maarten de Rijke

doi:https://doi.org/10.18653/v1/2025.findings-acl.543

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Authors	Kidist Amde Mekonnen Yosef Worku Alemneh Maarten de Rijke
Publication date	2025
Host editors	Wanxiang Che Joyce Nabende Ekaterina Shutova Mohammad Taher Pilehvar
Book title	The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) : Findings of the Association for Computational Linguistics: ACL 2025
Book subtitle	ACL 2025 : July 27-August 1, 2025
ISBN (electronic)	9798891762565
Event	63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Pages (from-to)	10428-10445
Number of pages	18
Publisher	Kerrville, TX: Association for Computational Linguistics
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13× smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.18653/v1/2025.findings-acl.543
Other links	https://www.scopus.com/pages/publications/105028574934
Downloads	2025.findings-acl.543 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval