POET: A French Extended Part-of-Speech Tagger

People Involved

Affiliations

  1. LIA, NLP team, Avignon University, Avignon, France.
  2. LS2N, TALN team, Nantes University, Nantes, France.

Demo: How to use in Flair

Requires Flair: pip install flair

from flair.data import Sentence
from flair.models import SequenceTagger

# Load the model
model = SequenceTagger.load("qanastek/pos-french")

sentence = Sentence("George Washington est allé à Washington")

# predict tags
model.predict(sentence)

# print predicted pos tags
print(sentence.to_tagged_string())

Output:

Preview Output

Training data

ANTILLES is a part-of-speech tagging corpora based on UD_French-GSD which was originally created in 2015 and is based on the universal dependency treebank v2.0.

Originally, the corpora consists of 400,399 words (16,341 sentences) and had 17 different classes. Now, after applying our tags augmentation we obtain 60 different classes which add linguistic and semantic information such as the gender, number, mood, person, tense or verb form given in the different CoNLL-03 fields from the original corpora.

We based our tags on the level of details given by the LIA_TAGG statistical POS tagger written by Frédéric Béchet in 2001.

The corpora used for this model is available on Github at the CoNLL-U format.

Training data are fed to the model as free language and doesn't pass a normalization phase. Thus, it's made the model case and punctuation sensitive.

Original Tags

PRON VERB SCONJ ADP CCONJ DET NOUN ADJ AUX ADV PUNCT PROPN NUM SYM PART X INTJ

New additional POS tags

Abbreviation Description Examples
PREP Preposition de
AUX Auxiliary Verb est
ADV Adverb toujours
COSUB Subordinating conjunction que
COCO Coordinating Conjunction et
PART Demonstrative particle -t
PRON Pronoun qui ce quoi
PDEMMS Demonstrative Pronoun - Singular Masculine ce
PDEMMP Demonstrative Pronoun - Plural Masculine ceux
PDEMFS Demonstrative Pronoun - Singular Feminine cette
PDEMFP Demonstrative Pronoun - Plural Feminine celles
PINDMS Indefinite Pronoun - Singular Masculine tout
PINDMP Indefinite Pronoun - Plural Masculine autres
PINDFS Indefinite Pronoun - Singular Feminine chacune
PINDFP Indefinite Pronoun - Plural Feminine certaines
PROPN Proper noun Houston
XFAMIL Last name Levy
NUM Numerical Adjective trentaine vingtaine
DINTMS Masculine Numerical Adjective un
DINTFS Feminine Numerical Adjective une
PPOBJMS Pronoun complements of objects - Singular Masculine le lui
PPOBJMP Pronoun complements of objects - Plural Masculine eux y
PPOBJFS Pronoun complements of objects - Singular Feminine moi la
PPOBJFP Pronoun complements of objects - Plural Feminine en y
PPER1S Personal Pronoun First-Person - Singular je
PPER2S Personal Pronoun Second-Person - Singular tu
PPER3MS Personal Pronoun Third-Person - Singular Masculine il
PPER3MP Personal Pronoun Third-Person - Plural Masculine ils
PPER3FS Personal Pronoun Third-Person - Singular Feminine elle
PPER3FP Personal Pronoun Third-Person - Plural Feminine elles
PREFS Reflexive Pronoun First-Person - Singular me m'
PREF Reflexive Pronoun Third-Person - Singular se s'
PREFP Reflexive Pronoun First / Second-Person - Plural nous vous
VERB Verb obtient
VPPMS Past Participle - Singular Masculine formulé
VPPMP Past Participle - Plural Masculine classés
VPPFS Past Participle - Singular Feminine appelée
VPPFP Past Participle - Plural Feminine sanctionnées
DET Determinant les l'
DETMS Determinant - Singular Masculine les
DETFS Determinant - Singular Feminine la
ADJ Adjective capable sérieux
ADJMS Adjective - Singular Masculine grand important
ADJMP Adjective - Plural Masculine grands petits
ADJFS Adjective - Singular Feminine française petite
ADJFP Adjective - Plural Feminine légères petites
NOUN Noun temps
NMS Noun - Singular Masculine drapeau
NMP Noun - Plural Masculine journalistes
NFS Noun - Singular Feminine tête
NFP Noun - Plural Feminine ondes
PREL Relative Pronoun qui dont
PRELMS Relative Pronoun - Singular Masculine lequel
PRELMP Relative Pronoun - Plural Masculine lesquels
PRELFS Relative Pronoun - Singular Feminine laquelle
PRELFP Relative Pronoun - Plural Feminine lesquelles
INTJ Interjection merci bref
CHIF Numbers 1979 10
SYM Symbol € %
YPFOR Endpoint .
PUNCT Ponctuation : ,
MOTINC Unknown words Technology Lady
X Typos & others sfeir 3D statu

Evaluation results

The test corpora used for this evaluation is available on Github.

Results:
- F-score (micro) 0.9797
- F-score (macro) 0.9178
- Accuracy 0.9797

By class:
              precision    recall  f1-score   support

        PREP     0.9966    0.9987    0.9976      1483
       PUNCT     1.0000    1.0000    1.0000       833
         NMS     0.9634    0.9801    0.9717       753
         DET     0.9923    0.9984    0.9954       645
        VERB     0.9913    0.9811    0.9862       583
         NFS     0.9667    0.9839    0.9752       560
         ADV     0.9940    0.9821    0.9880       504
       PROPN     0.9541    0.8937    0.9229       395
       DETMS     1.0000    1.0000    1.0000       362
         AUX     0.9860    0.9915    0.9888       355
       YPFOR     1.0000    1.0000    1.0000       353
         NMP     0.9666    0.9475    0.9570       305
        COCO     0.9959    1.0000    0.9980       245
       ADJMS     0.9463    0.9385    0.9424       244
       DETFS     1.0000    1.0000    1.0000       240
        CHIF     0.9648    0.9865    0.9755       222
         NFP     0.9515    0.9849    0.9679       199
       ADJFS     0.9657    0.9286    0.9468       182
       VPPMS     0.9387    0.9745    0.9563       157
       COSUB     1.0000    0.9844    0.9921       128
      DINTMS     0.9918    0.9918    0.9918       122
      XFAMIL     0.9298    0.9217    0.9258       115
     PPER3MS     1.0000    1.0000    1.0000        87
       ADJMP     0.9294    0.9634    0.9461        82
      PDEMMS     1.0000    1.0000    1.0000        75
       ADJFP     0.9861    0.9342    0.9595        76
        PREL     0.9859    1.0000    0.9929        70
      DINTFS     0.9839    1.0000    0.9919        61
        PREF     1.0000    1.0000    1.0000        52
     PPOBJMS     0.9565    0.9362    0.9462        47
       PREFP     0.9778    1.0000    0.9888        44
      PINDMS     1.0000    0.9773    0.9885        44
       VPPFS     0.8298    0.9750    0.8966        40
      PPER1S     1.0000    1.0000    1.0000        42
         SYM     1.0000    0.9474    0.9730        38
        NOUN     0.8824    0.7692    0.8219        39
        PRON     1.0000    0.9677    0.9836        31
      PDEMFS     1.0000    1.0000    1.0000        29
       VPPMP     0.9286    1.0000    0.9630        26
         ADJ     0.9524    0.9091    0.9302        22
     PPER3MP     1.0000    1.0000    1.0000        20
       VPPFP     1.0000    1.0000    1.0000        19
     PPER3FS     1.0000    1.0000    1.0000        18
      MOTINC     0.3333    0.4000    0.3636        15
       PREFS     1.0000    1.0000    1.0000        10
     PPOBJMP     1.0000    0.8000    0.8889        10
     PPOBJFS     0.6250    0.8333    0.7143         6
        INTJ     0.5000    0.6667    0.5714         6
        PART     1.0000    1.0000    1.0000         4
      PDEMMP     1.0000    1.0000    1.0000         3
      PDEMFP     1.0000    1.0000    1.0000         3
     PPER3FP     1.0000    1.0000    1.0000         2
         NUM     1.0000    0.3333    0.5000         3
      PPER2S     1.0000    1.0000    1.0000         2
     PPOBJFP     0.5000    0.5000    0.5000         2
      PRELMS     1.0000    1.0000    1.0000         2
      PINDFS     0.5000    1.0000    0.6667         1
      PINDMP     1.0000    1.0000    1.0000         1
           X     0.0000    0.0000    0.0000         1
      PINDFP     1.0000    1.0000    1.0000         1

   micro avg     0.9797    0.9797    0.9797     10019
   macro avg     0.9228    0.9230    0.9178     10019
weighted avg     0.9802    0.9797    0.9798     10019
 samples avg     0.9797    0.9797    0.9797     10019

BibTeX Citations

Please cite the following paper when using this model.

ANTILLES corpus and POET taggers:

@inproceedings{labrak:hal-03696042,
  TITLE = {{ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus}},
  AUTHOR = {Labrak, Yanis and Dufour, Richard},
  URL = {https://hal.archives-ouvertes.fr/hal-03696042},
  BOOKTITLE = {{25th International Conference on Text, Speech and Dialogue (TSD)}},
  ADDRESS = {Brno, Czech Republic},
  PUBLISHER = {{Springer}},
  YEAR = {2022},
  MONTH = Sep,
  KEYWORDS = {Part-of-speech corpus ; POS tagging ; Open tools ; Word embeddings ; Bi-LSTM ; CRF ; Transformers},
  PDF = {https://hal.archives-ouvertes.fr/hal-03696042/file/ANTILLES_A_freNch_linguisTIcaLLy_Enriched_part_of_Speech_corpus.pdf},
  HAL_ID = {hal-03696042},
  HAL_VERSION = {v1},
}

UD_French-GSD corpora:

@misc{
    universaldependencies,
    title={UniversalDependencies/UD_French-GSD},
    url={https://github.com/UniversalDependencies/UD_French-GSD}, journal={GitHub},
    author={UniversalDependencies}
}

LIA TAGG:

@techreport{LIA_TAGG,
  author = {Frédéric Béchet},
  title = {LIA_TAGG: a statistical POS tagger + syntactic bracketer},
  institution = {Aix-Marseille University & CNRS},
  year = {2001}
}

Flair Embeddings:

@inproceedings{akbik2018coling,
  title={Contextual String Embeddings for Sequence Labeling},
  author={Akbik, Alan and Blythe, Duncan and Vollgraf, Roland},
  booktitle = {{COLING} 2018, 27th International Conference on Computational Linguistics},
  pages     = {1638--1649},
  year      = {2018}
}

Acknowledgment

This work was financially supported by Zenidoc

Downloads last month
9
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train qanastek/pos-french-camembert-flair

Space using qanastek/pos-french-camembert-flair 1