Agettivu, aggitivu o aghjettivu? POS Tagging Corsican Dialects
Résumé
In this paper we present a series of experiments towards POS tagging Corsican, a less-resourced language spoken
in Corsica and linguistically related to Italian. The first contribution is Corsican-POS, the first gold standard
POS-tagged corpus for Corsica, composed of 500 sentences manually annotated with the Universal POS tagset.
Our second contribution is a set of experiments and evaluation of POS tagging models which starts with a
baseline model for Italian and is aimed at finding the best training configuration, namely in terms of the size and
combination strategy of the existing raw and annotated resources.These experiments result in (i) the first POS
tagger for Corsican, reaching an accuracy of 93.38 %, (ii) a quantification of the gain provided by the use of each
available resource. We find that the optimal configuration uses Italian word embeddings further specialized with
Corsican embeddings and trained on the largest gold corpus for Corsican available so far.
Origine | Fichiers produits par l'(les) auteur(s) |
---|