|
--- |
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- ast |
|
- az |
|
- ba |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- ceb |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- fa |
|
- ff |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- ht |
|
- hu |
|
- hy |
|
- id |
|
- ig |
|
- ilo |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- lb |
|
- lg |
|
- ln |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- no |
|
- ns |
|
- oc |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- ss |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- th |
|
- tl |
|
- tn |
|
- tr |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- wo |
|
- xh |
|
- yi |
|
- yo |
|
- zh |
|
- zu |
|
license: mit |
|
tags: |
|
- small100 |
|
- translation |
|
datasets: |
|
- flores101 |
|
- gsarti/flores_101 |
|
- tico19 |
|
- gmnlp/tico19 |
|
- tatoeba |
|
--- |
|
|
|
# SMALL-100 Model |
|
|
|
SMaLL-100 is a compact and fast massively multilingual machine translation model covering more than 10K language pairs, that achieves competitive results with M2M-100 while being much smaller and faster. It is introduced in [this paper](https://arxiv.org/abs/2210.11621), and initially released in [this repository](https://github.com/alirezamshi/small100). |
|
|
|
The model architecture and config are the same as [M2M-100](https://huggingface.co/facebook/m2m100_418M/tree/main) implementation, but the tokenizer is modified to adjust language codes. So, you should load the tokenizer locally from tokenization_small100.py file for the moment. |
|
|
|
``` |
|
from transformers import M2M100ForConditionalGeneration |
|
from tokenization_small100 import SMALL100Tokenizer |
|
|
|
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।" |
|
chinese_text = "生活就像一盒巧克力。" |
|
|
|
model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100") |
|
tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100") |
|
|
|
# translate Hindi to French |
|
tokenizer.tgt_lang = "fr" |
|
encoded_hi = tokenizer(hi_text, return_tensors="pt") |
|
generated_tokens = model.generate(**encoded_hi) |
|
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
# => "La vie est comme une boîte de chocolat." |
|
|
|
# translate Chinese to English |
|
tokenizer.tgt_lang = "en" |
|
encoded_zh = tokenizer(chinese_text, return_tensors="pt") |
|
generated_tokens = model.generate(**encoded_zh) |
|
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
# => "Life is like a box of chocolate." |
|
``` |
|
|
|
Please refer to [original repository](https://github.com/alirezamshi/small100) for further details. |
|
|