Making PDFs Accessible for Visually Impaired Users (and Findable for Everybody Else)

Open Access
Authors
Publication date 2023
Host editors
  • O. Alonso
  • H. Cousijn
  • G. Silvello
  • M. Marrero
  • C. Teixeira Lopes
  • S. Marchesin
Book title Linking Theory and Practice of Digital Libraries
Book subtitle 27th International Conference on Theory and Practice of Digital Libraries, TPDL 2023, Zadar, Croatia, September 26–29, 2023 : proceedings
ISBN
  • 9783031438486
ISBN (electronic)
  • 9783031438493
Series Lecture Notes in Computer Science
Event 27th International Conference on Theory and Practice of Digital Libraries
Pages (from-to) 239-245
Number of pages 6
Publisher Cham: Springer
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
We treat documents released under the Dutch Freedom of Information Act as FAIR scientific data and find that they are not findable nor accessible, due to text malformations caused by redaction software. Our aim is to repair these documents. We propose a simple but strong heuristic for detecting wrongly OCRed text segments, and we then repair only these OCR mistakes by prompting a large language model. This makes the documents better findable through full text search, but the repaired PDFs do still not adhere to accessibility standards. Converting them into HTML documents, keeping all essential layout and markup, makes them not only accessible to the visually impaired, but also reduces their size by up to two orders of magnitude. The costs of this way of repairing are roughly one dollar for the 17K pages in our corpus, which is very little compared to the large gains in information quality.
Document type Conference contribution
Language English
Related dataset Increasing Accessibility of Government Documents Dataset
Published at https://doi.org/10.1007/978-3-031-43849-3_21
Other links https://github.com/irlabamsterdam/accessibilifier
Downloads
978-3-031-43849-3_21 (Final published version)
Permalink to this page
Back