Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains

Vincent Segonne; Aidan Mannion; Laura Cristina Alonzo Canul; Alexandre Audibert; Xingyu Liu; Cécile Macaire; Adrien Pupier; Yongxin Zhou; Mathilde Aguiar; Felix Herron; Magali Norré; Massih-Reza Amini; Pierrette Bouillon; Iris Eshkol-Taravella; Emmanuelle Esperança-Rodier; Thomas François; Lorraine Goeuriot; Jérôme Goulian; Mathieu Lafourcade; Benjamin Lecouteux; François Portet; Fabien Ringeval; Vincent Vandeghinste; Maximin Coavoux; Marco Dinarelli; Didier Schwab

Communication Dans Un Congrès Année : 2024

Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains

(1) , (2, 3) , (4, 5, 3) , (4, 5, 6) , (4, 3) , (4, 3) , (4, 3) , (4, 3) , (7) , (4, 8) , (9, 10) , (4, 6) , (10) , (11) , (3, 4, 5) , (9) , (4, 2) , (3, 4) , (12) , (4, 5, 3) , (4, 3) , (3, 4) , (13) , (4, 3) , (3, 4) , (4, 3)

1
2
3
4
5
6
7
8
9
10
11
12
13

Vincent Segonne

Fonction : Auteur
PersonId : 1372931

Institut de Recherche en Informatique et Systèmes Aléatoires

Aidan Mannion

Fonction : Auteur
PersonId : 749373
IdHAL : aidan-mannion

Modélisation et Recherche d’Information Multimédia [Grenoble]

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Laura Cristina Alonzo Canul

Fonction : Auteur

Laboratoire d'Informatique de Grenoble

Université Grenoble Alpes

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Alexandre Audibert

Fonction : Auteur

Laboratoire d'Informatique de Grenoble

Université Grenoble Alpes

Algorithms, Principles and TheorIes for collaborative Knowledge acquisition And Learning

Xingyu Liu

Fonction : Auteur
PersonId : 1144132

Laboratoire d'Informatique de Grenoble

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Cécile Macaire

Fonction : Auteur
PersonId : 1120002
IdHAL : cecile-macaire
ORCID : 0000-0003-1407-8880

Laboratoire d'Informatique de Grenoble

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Adrien Pupier

Fonction : Auteur
PersonId : 1254663
IdHAL : adrien-pupier
ORCID : 0009-0007-9458-341X

Laboratoire d'Informatique de Grenoble

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Yongxin Zhou

Fonction : Auteur
PersonId : 1372932

Laboratoire d'Informatique de Grenoble

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Mathilde Aguiar

Fonction : Auteur
PersonId : 1371958
IdHAL : mathilde-aguiar
ORCID : 0009-0000-9380-2624

Laboratoire Interdisciplinaire des Sciences du Numérique

Felix Herron

Fonction : Auteur
PersonId : 1395830
IdHAL : herronf

Laboratoire d'Informatique de Grenoble

Laboratoire d'analyse et modélisation de systèmes pour l'aide à la décision

Magali Norré

Fonction : Auteur
PersonId : 1262376

Université Catholique de Louvain = Catholic University of Louvain

Université de Genève = University of Geneva

Massih-Reza Amini

Fonction : Auteur
PersonId : 747054
IdHAL : massih-reza-amini
ORCID : 0000-0001-9032-4233
IdRef : 132277042

Laboratoire d'Informatique de Grenoble

Algorithms, Principles and TheorIes for collaborative Knowledge acquisition And Learning

Pierrette Bouillon

Fonction : Auteur
PersonId : 840929

Université de Genève = University of Geneva

Iris Eshkol-Taravella

Fonction : Auteur
PersonId : 18520
IdHAL : iris-eshkol-taravella
ORCID : 0000-0003-0814-3623
IdRef : 074195158

Modèles, Dynamiques, Corpus

Emmanuelle Esperança-Rodier

Fonction : Auteur
PersonId : 745181
IdHAL : emmanuelle-esperanca-rodier

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Laboratoire d'Informatique de Grenoble

Université Grenoble Alpes

Thomas François

Fonction : Auteur
PersonId : 1025015

Université Catholique de Louvain = Catholic University of Louvain

Lorraine Goeuriot

Fonction : Auteur
PersonId : 169704
IdHAL : lorraine-goeuriot
ORCID : 0000-0001-7491-1980
IdRef : 143794957

Laboratoire d'Informatique de Grenoble

Modélisation et Recherche d’Information Multimédia [Grenoble]

Jérôme Goulian

Fonction : Auteur
PersonId : 1200019

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Laboratoire d'Informatique de Grenoble

Mathieu Lafourcade

Fonction : Auteur
PersonId : 172381
IdHAL : mathieu-lafourcade
ORCID : 0000-0003-2832-2143

Exploration et exploitation de données textuelles

Benjamin Lecouteux

Fonction : Auteur
PersonId : 7847
IdHAL : benjamin-lecouteux
ORCID : 0000-0003-3000-6190
IdRef : 135355060

Laboratoire d'Informatique de Grenoble

Université Grenoble Alpes

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

François Portet

Fonction : Auteur
PersonId : 1069
IdHAL : francois-portet
ORCID : 0000-0003-2542-0661
IdRef : 098179160

Laboratoire d'Informatique de Grenoble

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Fabien Ringeval

Fonction : Auteur
PersonId : 13134
IdHAL : fabien-ringeval
ORCID : 0000-0002-9213-4529
IdRef : 154573078

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Laboratoire d'Informatique de Grenoble

Vincent Vandeghinste

Fonction : Auteur
PersonId : 1371961

Catholic University of Leuven = Katholieke Universiteit Leuven

Maximin Coavoux

Fonction : Auteur
PersonId : 13643
IdHAL : maximin-coavoux
ORCID : 0000-0003-4089-4558

Laboratoire d'Informatique de Grenoble

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Marco Dinarelli

Fonction : Auteur
PersonId : 12699
IdHAL : marco-dinarelli
IdRef : 22461939X

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Laboratoire d'Informatique de Grenoble

Didier Schwab

Fonction : Auteur
PersonId : 4261
IdHAL : didier-schwab
ORCID : 0000-0002-2462-8148
IdRef : 069192359

Laboratoire d'Informatique de Grenoble

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Résumé

Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on eﬀicient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.

Mots clés

Self-supervised learning Pretrained language models Evaluation benchmark Biomedical document processing Legal document processing Speech transcription

Domaines

Informatique Intelligence artificielle [cs.AI] Traitement du texte et du document

Fichier principal

FB2_domaines_specialises_LREC_COLING24.pdf (156.2 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Didier Schwab : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04535557

Soumis le : samedi 6 avril 2024-17:41:03

Dernière modification le : mercredi 18 décembre 2024-09:46:10

Archivage à long terme le : dimanche 7 juillet 2024-18:16:11

Dates et versions

hal-04535557 , version 1 (06-04-2024)

Identifiants

HAL Id : hal-04535557 , version 1

Citer

Vincent Segonne, Aidan Mannion, Laura Cristina Alonzo Canul, Alexandre Audibert, Xingyu Liu, et al.. Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains. LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation, May 2024, Turin, Italy. ⟨hal-04535557⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA UNIV-DAUPHINE INSA-RENNES IRISA MODYCO LIG LIG_TDCGE_GETALP TEXTE LIRMM CENTRALESUPELEC LAMSADE-DAUPHINE GENCI PSL UR1-MATH-STIC UNIV-PARIS-SACLAY UR1-UFR-ISTIC UNIV-MONTPELLIER UNIV-RENNES INSA-GROUPE UNIV-PARIS-LUMIERES MIAI ANR UR1-MATH-NUM LISN UNIV-PARIS-NANTERRE GS-COMPUTER-SCIENCE LIG_SIDCH LIG_SIDCH_APTIKAL

379 Consultations

309 Téléchargements

Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager