Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition

Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition

Authors	K. Todorov G. Colavizza
Publication date	2020
Host editors	F. Karsdorp M. McGillivray A. Nerghes M. Wevers
Book title	Proceedings of the Workshop on Computational Humanities Research (CHR 2020)
Book subtitle	Amsterdam, the Netherlands, November 18-20, 2020
Series	CEUR Workshop Proceedings
Event	1st Workshop on Computational Humanities Research, CHR 2020
Pages (from-to)	310-339
Publisher	Aachen: CEUR-WS
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	Transfer learning in Natural Language Processing, mainly in the form of pre-trained language models, has recently delivered substantial gains across a range of tasks. Scholars and practitioners working with OCRed historical corpora are thus increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by historical documents, including OCR quality and linguistic change, call for a critical assessment of the use of pre-trained language models in this setting. We consider two shared tasks, ICDAR2019 (post-OCR correction) and CLEF-HIPE-2020 (Named Entity Recognition, NER), and systematically assess using pre-trained language models with data in French, German and English. We find that using pre-trained language models helps with NER but less so with post-OCR correction. Pre-trained language models should therefore be used critically when working with OCRed historical corpora. We release our code base, in order to allow replicating our results and testing other pre-trained representations.
Document type	Conference contribution
Language	English
Published at	http://ceur-ws.org/Vol-2723/long32.pdf
Other links	http://ceur-ws.org/Vol-2723
Downloads	long32 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition