Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition

Open Access
Authors
Publication date 2020
Host editors
  • F. Karsdorp
  • M. McGillivray
  • A. Nerghes
  • M. Wevers
Book title Proceedings of the Workshop on Computational Humanities Research (CHR 2020)
Book subtitle Amsterdam, the Netherlands, November 18-20, 2020
Series CEUR Workshop Proceedings
Event 1st Workshop on Computational Humanities Research, CHR 2020
Pages (from-to) 310-339
Publisher Aachen: CEUR-WS
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract
Transfer learning in Natural Language Processing, mainly in the form of pre-trained language models, has recently delivered substantial gains across a range of tasks. Scholars and practitioners working with OCRed historical corpora are thus increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by historical documents, including OCR quality and linguistic change, call for a critical assessment of the use of pre-trained language models in this setting. We consider two shared tasks, ICDAR2019 (post-OCR correction) and CLEF-HIPE-2020 (Named Entity Recognition, NER), and systematically assess using pre-trained language models with data in French, German and English. We find that using pre-trained language models helps with NER but less so with post-OCR correction. Pre-trained language models should therefore be used critically when working with OCRed historical corpora. We release our code base, in order to allow replicating our results and testing other pre-trained representations.
Document type Conference contribution
Language English
Published at http://ceur-ws.org/Vol-2723/long32.pdf
Other links http://ceur-ws.org/Vol-2723
Downloads
long32 (Final published version)
Permalink to this page
Back