CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond

The surge in digitisation initiatives by Cultural Heritage institutions has facilitated online accessibility to numerous historical manuscripts. However, a substantial portion of these documents exists solely as images, lacking machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into machine-readable formats, enabling researchers and scholars to analyse vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging. We introduce the Consistent Approaches to Transcribing Manuscripts (CATMuS) dataset for medieval manuscripts, which offers (1) a uniform framework for annotation practices for medieval manuscripts, a benchmarking environment (2) for evaluating automatic text recognition models across multiple dimensions thanks to rich metadata (century of production, language, genre, script, etc.), (3) for other tasks (such as script classification or dating approaches), (4) and finally for exploratory work pertaining to computer vision and digital paleography around line-based tasks, such as generative approaches. Developed through collaboration among various institutions and projects, CATMuS provides an inter-compatible dataset spanning more than 200 manuscripts and incunabula in 10 different languages, comprising over 160,000 lines of text and 5 million characters spanning from the 8th century to the 16th. The dataset's consistency in transcription approaches aims to mitigate challenges arising from the diversity in standards for medieval manuscript transcriptions, providing a comprehensive benchmark for evaluating HTR models on historical sources.

Mots clés

Historical sources medieval manuscripts Latin scripts benchmarking dataset multilingual handwritten text recognition

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV] Sciences de l'Homme et Société

Fichier principal

ICDAR24___CATMUS_Medieval-1.pdf (2.41 Mo)

Origine	Fichiers produits par l'(les) auteur(s)
licence	Paternité

Thibault Clérice : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-04453952

Soumis le : lundi 12 février 2024-20:10:53

Dernière modification le : mardi 2 juillet 2024-11:01:55

Archivage à long terme le : lundi 13 mai 2024-20:16:42

Dates et versions

hal-04453952 , version 1 (12-02-2024)

Licence

Paternité

Identifiants

HAL Id : hal-04453952 , version 1

Citer

Thibault Clérice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagué, Jean-Baptiste Camps, et al.. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. 2024 International Conference on Document Analysis and Recognition (ICDAR), 2024, Athens, Greece. ⟨hal-04453952⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON UNIV-LYON3 UNIV-AVIGNON EPHE UNIV-PARIS8 CNRS INRIA UNIV-LYON2 UNIV-CERGY EHESS IRHT CIHAM UPEC INRIA2 PSL CHART ENC CAMPUS-CONDORCET CJM UNIV-PARIS-LUMIERES UDL UNIV-PARIS-NANTERRE UNIV-PARIS8-OA

618 Consultations

255 Téléchargements