|
--- |
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- no |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sa |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- ug |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- zh |
|
license: mit |
|
--- |
|
|
|
# XLM-V (Base-sized model) |
|
|
|
XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R). |
|
It was introduced in the [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) |
|
paper by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer and Madian Khabsa. |
|
|
|
**Disclaimer**: The team releasing XLM-V did not write a model card for this model so this model card has been written by the Hugging Face team. [This repository](https://github.com/stefan-it/xlm-v-experiments) documents all necessary integeration steps. |
|
|
|
## Model description |
|
|
|
From the abstract of the XLM-V paper: |
|
|
|
> Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. |
|
> As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. |
|
> This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. |
|
> In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by |
|
> de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity |
|
> to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically |
|
> more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, |
|
> a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we |
|
> tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and |
|
> named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER). |
|
|
|
## Usage |
|
|
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='facebook/xlm-v-base') |
|
>>> unmasker("Paris is the <mask> of France.") |
|
|
|
[{'score': 0.9286897778511047, |
|
'token': 133852, |
|
'token_str': 'capital', |
|
'sequence': 'Paris is the capital of France.'}, |
|
{'score': 0.018073994666337967, |
|
'token': 46562, |
|
'token_str': 'Capital', |
|
'sequence': 'Paris is the Capital of France.'}, |
|
{'score': 0.013238662853837013, |
|
'token': 8696, |
|
'token_str': 'centre', |
|
'sequence': 'Paris is the centre of France.'}, |
|
{'score': 0.010450296103954315, |
|
'token': 550136, |
|
'token_str': 'heart', |
|
'sequence': 'Paris is the heart of France.'}, |
|
{'score': 0.005028395913541317, |
|
'token': 60041, |
|
'token_str': 'center', |
|
'sequence': 'Paris is the center of France.'}] |
|
``` |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because XLM-V has a similar architecture |
|
and has been trained on similar training data. |
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@ARTICLE{2023arXiv230110472L, |
|
author = {{Liang}, Davis and {Gonen}, Hila and {Mao}, Yuning and {Hou}, Rui and {Goyal}, Naman and {Ghazvininejad}, Marjan and {Zettlemoyer}, Luke and {Khabsa}, Madian}, |
|
title = "{XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models}", |
|
journal = {arXiv e-prints}, |
|
keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning}, |
|
year = 2023, |
|
month = jan, |
|
eid = {arXiv:2301.10472}, |
|
pages = {arXiv:2301.10472}, |
|
doi = {10.48550/arXiv.2301.10472}, |
|
archivePrefix = {arXiv}, |
|
eprint = {2301.10472}, |
|
primaryClass = {cs.CL}, |
|
adsurl = {https://ui.adsabs.harvard.edu/abs/2023arXiv230110472L}, |
|
adsnote = {Provided by the SAO/NASA Astrophysics Data System} |
|
} |
|
``` |