Communication Dans Un Congrès Année : 2024

Learning Reading Order via Document Layout with Layout2Pos

Résumé

Due to their remarkable performance, general-purpose multimodal pre-trained language models have gained widespread adoption for Document Understanding tasks. The majority of pre-trained language models rely on serialized text, extracted using either Optical Character Recognition (OCR) or PDF parsing. However, accurately determining the reading order of visually-rich documents (VrDs) is challenging, potentially affecting the accuracy of the extracted text and leading to sub-optimal performance in downstream tasks. For information extraction tasks, where entity recognition is commonly framed as a sequence-labeling task, incorrect reading order can hinder entity labeling. In this work, we avoid reading order issues by discarding sequential position information. Based on the intuition that layout contains the information for correct reading order, we present Layout2Pos – a shallow Transformer designed to generate position embeddings from layout. Incorporated into a BART architecture, our approach demonstrates competitiveness with models dependent on reading order across three benchmark datasets for information extraction. We also show that evaluating models using a reading order different from the one seen during training can result in substantial performance drops, thereby highlighting the importance of not relying on the reading order of documents.

Fichier principal
Vignette du fichier
TPDL_2024___Layout2Pos.pdf (1.26 Mo) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)
Licence

Dates et versions

hal-04718874 , version 1 (18-11-2024)

Licence

Identifiants

Citer

Laura Nguyen, Benjamin Piwowarski, Julio Laborde, Gilles Moyse. Learning Reading Order via Document Layout with Layout2Pos. Linking Theory and Practice of Digital Libraries, Sep 2024, Ljubbljana, Slovenia. pp.3-19, ⟨10.1007/978-3-031-72437-4_1⟩. ⟨hal-04718874⟩
183 Consultations
276 Téléchargements

Altmetric

Partager

  • More