A linguistically-informed comparison between multilingual BERT and language-specific BERT models: The case of differential object marking in Romanian

Open Access
Authors
Publication date 2025
Host editors
  • Galia Angelova
  • Maria Kunilovskaya
  • Marie Escribe
  • Ruslan Mitkov
Book title International Conference on Recent Advances in Natural Language Processing : RANLP 2025
Book subtitle Natural Language Processing in the Generative AI era L proceedings
ISBN (electronic)
  • 9789544520984
Event 15th International Conference on Recent Advances in Natural Language Processing
Pages (from-to) 1271-1281
Number of pages 11
Publisher Shoumen: INCOMA Ltd.
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract
Current linguistic challenge datasets for language models focus on phenomena that exist in English. This may lead to a lack of attention for typological features beyond English. This is particularly an issue for multilingual models, which may be biased towards English by their training data and this bias may be amplified if benchmarks are also English-centered. We present the syntactically and semantically complex language phenomenon of Differential Object Marking (DOM) in Romanian as a challenging Masked Language Modelling task and compare the performance of monolingual and multilingual models. Results indicate that Romanian-specific BERT models perform better than equivalent multilingual one in representing this phenomenon.
Document type Conference contribution
Language English
Published at https://doi.org/10.26615/978-954-452-098-4-147
Published at https://acl-bg.org/proceedings/2025/RANLP%202025/pdf/2025.ranlp-1.147.pdf
Other links https://acl-bg.org/proceedings/2025/RANLP%202025/index.html
Downloads
2025.ranlp-1.147 (Final published version)
Permalink to this page
Back