Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition
| Authors |
|
|---|---|
| Publication date | 2020 |
| Host editors |
|
| Book title | Proceedings of the Workshop on Computational Humanities Research (CHR 2020) |
| Book subtitle | Amsterdam, the Netherlands, November 18-20, 2020 |
| Series | CEUR Workshop Proceedings |
| Event | 1st Workshop on Computational Humanities Research, CHR 2020 |
| Pages (from-to) | 310-339 |
| Publisher | Aachen: CEUR-WS |
| Organisations |
|
| Abstract |
Transfer learning in Natural Language Processing, mainly in the form of pre-trained language models, has recently delivered substantial gains across a range of tasks. Scholars and practitioners working with OCRed historical corpora are thus increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by historical documents, including OCR quality and linguistic change, call for a critical assessment of the use of pre-trained language models in this setting. We consider two shared tasks, ICDAR2019 (post-OCR correction) and CLEF-HIPE-2020 (Named Entity Recognition, NER), and systematically assess using pre-trained language models with data in French, German and English. We find that using pre-trained language models helps with NER but less so with post-OCR correction. Pre-trained language models should therefore be used critically when working with OCRed historical corpora. We release our code base, in order to allow replicating our results and testing other pre-trained representations.
|
| Document type | Conference contribution |
| Language | English |
| Published at | http://ceur-ws.org/Vol-2723/long32.pdf |
| Other links | http://ceur-ws.org/Vol-2723 |
| Downloads |
long32
(Final published version)
|
| Permalink to this page | |
