Low-resource OCR error detection and correction in French Clinical Texts

Abstract : In this paper we present a simple yet effective approach to automatic OCR error detection and correction on a corpus of French clinical reports of variable OCR quality within the domain of foetopathology. While traditional OCR error detection and correction systems rely heavily on external information such as domain-specific lexicons, OCR process information or manually corrected training material, these are not always available given the constraints placed on using medical corpora. We therefore propose a novel method that only needs a representative corpus of acceptable OCR quality in order to train models. Our method uses recurrent neural networks (RNNs) to model sequential information on character level for a given medical text corpus. By inserting noise during the training process we can simultaneously learn the underlying (character-level) language model and as well as learning to detect and eliminate random noise from the textual input. The resulting models are robust to the variability of OCR quality but do not require additional, external information such as lexicons. We compare two different ways of injecting noise into the training process and evaluate our models on a manually corrected data set. We find that the best performing system achieves a 73% accuracy.
Complete list of metadatas

Cited literature [20 references]  Display  Hide  Download

Contributor : Limsi Publications <>
Submitted on : Wednesday, September 11, 2019 - 11:00:19 AM
Last modification on : Thursday, January 2, 2020 - 2:58:03 PM


Files produced by the author(s)


  • HAL Id : hal-01831225, version 1


Eva d'Hondt, Cyril Grouin, Brigitte Grau. Low-resource OCR error detection and correction in French Clinical Texts. International Workshop on Health Text Mining and Information Analysis, ACL, Nov 2016, Austin, United States. ⟨hal-01831225⟩



Record views


Files downloads