Canine for Language Identification

Canine model trained on WiLI-2018 dataset to identify the language of a text.

Preprocessing

  • 10% of train data stratified sampled as validation set
  • max sequence length: 512

Hyperparameters

  • epochs: 4
  • learning-rate: 3e-5
  • batch size: 16
  • gradient_accumulation: 4
  • optimizer: AdamW with default settings

Test Results

  • Accuracy: 94,92%
  • Macro F1-score: 94,91%

Inference

Dictionary to return English names for a label id:

import datasets
import pycountry
def int_to_lang():
    dataset = datasets.load_dataset('wili_2018')
    # names for languages not in iso-639-3 from wikipedia
    non_iso_languages = {'roa-tara': 'Tarantino', 'zh-yue': 'Cantonese', 'map-bms': 'Banyumasan',
                         'nds-nl': 'Dutch Low Saxon', 'be-tarask': 'Belarusian'}
    # create dictionary from data set labels to language names
    lab_to_lang = {}
    for i, lang in enumerate(dataset['train'].features['label'].names):
        full_lang = pycountry.languages.get(alpha_3=lang)
        if full_lang:
            lab_to_lang[i] = full_lang.name
        else:
            lab_to_lang[i] = non_iso_languages[lang]
    return lab_to_lang

Credit to

@article{clark-etal-2022-canine,
    title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
    author = "Clark, Jonathan H.  and
      Garrette, Dan  and
      Turc, Iulia  and
      Wieting, John",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "10",
    year = "2022",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/2022.tacl-1.5",
    doi = "10.1162/tacl_a_00448",
    pages = "73--91",
    abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.",
}
@dataset{thoma_martin_2018_841984,
  author       = {Thoma, Martin},
  title        = {{WiLI-2018 - Wikipedia Language Identification 
                   database}},
  month        = jan,
  year         = 2018,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.841984},
  url          = {https://doi.org/10.5281/zenodo.841984}
}
Downloads last month
7
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train SebOchs/canine-c-lang-id