The Rare Word Issue in Natural Language Generation: A Character-Based Solution - Sorbonne Université Accéder directement au contenu
Article Dans Une Revue Informatics Année : 2021

The Rare Word Issue in Natural Language Generation: A Character-Based Solution

Résumé

In this paper, we analyze the problem of generating fluent English utterances from tabular data, focusing on the development of a sequence-to-sequence neural model which shows two major features: the ability to read and generate character-wise, and the ability to switch between generating and copying characters from the input: an essential feature when inputs contain rare words like proper names, telephone numbers, or foreign words. Working with characters instead of words is a challenge that can bring problems such as increasing the difficulty of the training phase and a bigger error probability during inference. Nevertheless, our work shows that these issues can be solved and efforts are repaid by the creation of a fully end-to-end system, whose inputs and outputs are not constrained to be part of a predefined vocabulary, like in word-based models. Furthermore, our copying technique is integrated with an innovative shift mechanism, which enhances the ability to produce outputs directly from inputs. We assess performance on the E2E dataset, the benchmark used for the E2E NLG challenge, and on a modified version of it, created to highlight the rare word copying capabilities of our model. The results demonstrate clear improvements over the baseline and promising performance compared to recent techniques in the literature
Fichier principal
Vignette du fichier
informatics-08-00020.pdf (770.68 Ko) Télécharger le fichier
Origine : Publication financée par une institution

Dates et versions

hal-03184301 , version 1 (29-03-2021)

Identifiants

Citer

Giovanni Bonetta, Marco Roberti, Rossella Cancelliere, Patrick Gallinari. The Rare Word Issue in Natural Language Generation: A Character-Based Solution. Informatics, 2021, 8 (1), pp.20. ⟨10.3390/informatics8010020⟩. ⟨hal-03184301⟩
72 Consultations
70 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More