Detection of Redacted Text in Legal Documents

Open Access
Authors
Publication date 2023
Host editors
  • O. Alonso
  • H. Cousijn
  • G. Silvello
  • M. Marrero
  • C. Teixeira Lopes
  • S. Marchesin
Book title Linking Theory and Practice of Digital Libraries
Book subtitle 27th International Conference on Theory and Practice of Digital Libraries, TPDL 2023, Zadar, Croatia, September 26–29, 2023 : proceedings
ISBN
  • 9783031438486
ISBN (electronic)
  • 9783031438493
Series Lecture Notes in Computer Science
Event 27th International Conference on Theory and Practice of Digital Libraries
Pages (from-to) 310-316
Number of pages 6
Publisher Cham: Springer
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
We present a technique for automatically detecting redacted text in legal documents, using a combination of Optical Character Recognition (OCR) and morphological operations from the Computer Vision domain, allowing us to detect a wide variety of different types of redaction blocks with little to no training data. As this is a segmentation task, we evaluate our technique using the Panoptic Quality methodology, with the algorithm obtaining F1 scores of 0.79, 0.86 and 0.76 on black, colored and outlined redaction blocks respectively, and an F1 score of 0.62 for gray blocks. The total running time of the algorithm is two seconds on average measured on a thousand pages from a government supplier, with roughly of this time being used by Tesseract and the conversion from PDF to PNG, and by the detection algorithm. Detecting text redaction at scale thus is feasible, allowing a more or less objective measurement of this practice.The redacted text detection code and the manually labelled dataset created for evaluation is released via Github.
Document type Conference contribution
Language English
Related dataset Automatic Text Redaction Dataset
Published at https://doi.org/10.1007/978-3-031-43849-3_28
Other links https://github.com/irlabamsterdam/TPDLTextRedaction https://lakdetector.wooverheid.nl
Downloads
978-3-031-43849-3_28 (Final published version)
Permalink to this page
Back