|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- akoksal/muri-it |
|
language: |
|
- afr |
|
- amh |
|
- ara |
|
- aze |
|
- bel |
|
- ben |
|
- bul |
|
- cat |
|
- ceb |
|
- ces |
|
- cos |
|
- cym |
|
- dan |
|
- deu |
|
- ell |
|
- eng |
|
- epo |
|
- est |
|
- eus |
|
- fas |
|
- fin |
|
- fra |
|
- fry |
|
- gla |
|
- gle |
|
- glg |
|
- guj |
|
- hat |
|
- hau |
|
- haw |
|
- hbs |
|
- heb |
|
- hin |
|
- hun |
|
- hye |
|
- ibo |
|
- isl |
|
- ita |
|
- jav |
|
- jpn |
|
- kan |
|
- kat |
|
- kaz |
|
- khm |
|
- kir |
|
- kor |
|
- kur |
|
- lao |
|
- lat |
|
- lav |
|
- lit |
|
- ltz |
|
- mal |
|
- mar |
|
- mkd |
|
- mlg |
|
- mlt |
|
- mon |
|
- mri |
|
- msa |
|
- msa |
|
- mya |
|
- nep |
|
- nld |
|
- nor |
|
- nya |
|
- pan |
|
- pol |
|
- por |
|
- pus |
|
- ron |
|
- rus |
|
- sin |
|
- slk |
|
- slv |
|
- smo |
|
- sna |
|
- snd |
|
- som |
|
- sot |
|
- spa |
|
- sqi |
|
- sun |
|
- swa |
|
- swe |
|
- tam |
|
- tel |
|
- tgk |
|
- tha |
|
- tur |
|
- ukr |
|
- urd |
|
- uzb |
|
- vie |
|
- xho |
|
- yid |
|
- yor |
|
- zho |
|
- zul |
|
base_model: |
|
- google/mt5-xxl |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
# MURI-101: Multilingual Instruction-Following Model for 101 languages (mT5-XXL) |
|
|
|
MURI-101 is a multilingual instruction-following model, fine-tuned using a subset of the [**MURI-IT**](https://huggingface.co/datasets/akoksal/muri-it) dataset. It supports **101 languages** and outperforms most multilingual models in both **Natural Language Understanding (NLU)** and **Natural Language Generation (NLG)** tasks, especially in low-resource settings. |
|
|
|
This model was trained on a dataset with multilingual reverse instructions, ensuring that outputs are culturally and linguistically appropriate for the target language, thus reducing translation artifacts. |
|
|
|
[Paper](https://arxiv.org/abs/2409.12958) |
|
|
|
### Model Architecture |
|
- **Base Model**: mT5-XXL |
|
- **Training Data**: Subset of MURI-IT |
|
- **Training Setup**: Trained with [t5x](https://github.com/google-research/t5x) on 32 TPU v4-32. Batch size: 64, data packing enabled, learning rate: 3e-4 without a scheduler, 5 epochs. |
|
|
|
## Results |
|
We compare **MURI-101** against state-of-the-art models for multilingual instruction following. MURI-101 outperforms most multilingual models, except for Aya, across both NLU and NLG datasets. |
|
|
|
|
|
| | Okapi | mT0 | mT0x | Aya-101 | MURI-101 | |
|
|-------------------|----------------|--------------|---------------|------------------|---------------------------| |
|
| arb | 27.7 | 31.5 | 31.6 | 38.2 | 36.5 | |
|
| ben | 26.8 | 31.6 | 30.2 | 35.8 | 33.0 | |
|
| cat | 30.5 | 32.8 | 32.6 | 39.6 | 38.8 | |
|
| dan | 31.8 | 33.0 | 32.0 | 39.7 | 38.4 | |
|
| deu | 31.7 | 32.7 | 32.5 | 39.7 | 38.9 | |
|
... |
|
| vie | 27.5 | 30.9 | 31.1 | 34.8 | 36.8 | |
|
| zho | 28.2 | 32.5 | 31.6 | 38.3 | 36.9 | |
|
| Avg. | 28.8 | 31.5 | 30.8 | 37.3 | 36.0 | |
|
|
|
Additionally, our model complements Aya effectively, especially in low-resource settings. |
|
|
|
| Language | mT5 | Aya_1 | Aya_1 + MURI_1 | |
|
|-------------------|------|-------|----------------| |
|
| aze | 20.4 | 37.0 | 39.5 | |
|
| bel | 22.4 | 32.1 | 33.7 | |
|
| bul | 20.7 | 34.4 | 38.1 | |
|
| cym | 18.4 | 33.0 | 35.5 | |
|
| gla | 19.3 | 28.7 | 35.2 | |
|
| kaz | 19.8 | 44.7 | 46.7 | |
|
| khm | 16.5 | 30.0 | 31.3 | |
|
| lao | 21.3 | 32.7 | 33.0 | |
|
| slk | 19.2 | 38.1 | 39.1 | |
|
| slv | 18.9 | 40.3 | 39.6 | |
|
| Avg. | 19.7 | 35.1 | **37.2** | |
|
|
|
|
|
|
|
## Use |
|
To load and use the model, you can use the following: |
|
|
|
### AutoModelForSeq2SeqLM |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
muri = AutoModelForSeq2SeqLM.from_pretrained("akoksal/muri-101") |
|
tokenizer = AutoTokenizer.from_pretrained("akoksal/muri-101") |
|
|
|
instruction = "Verilen cümlenin pozitif mi negatif mi olduğunu tahmin edin: Hayatta kesinlikle izlenmemesi gereken filmler kategorisindeki listemin en başına bu filmi koyarım." |
|
# Turkish to English translation: Guess whether the given sentence is positive or negative: I would put this movie at the very top of the list of movies that absolutely should not be watched in life. |
|
inputs = tokenizer(instruction, return_tensors="pt").to(device) |
|
outputs = muri.generate(**inputs, max_new_tokens=5) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
# > negatif |
|
# (negative) |
|
``` |
|
|
|
### Pipeline |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
muri = pipeline("text2text-generation", |
|
model="akoksal/muri-101") |
|
|
|
muri("""این مقاله را خلاصه کنید |
|
...تیم دانشآموزی کاوش باستانی یک بطری حاوی پیغام ۲۰۰ ساله در شمال فرانسه پیدا کردند""", |
|
max_new_tokens=150, |
|
do_sample=True, |
|
temperature=0.9, |
|
top_p=0.8) |
|
# Summarize this article |
|
# A student team of archeologists found a bottle containing a 200-year-old message in northern France ... [300 words] |
|
|
|
# > در طول سالیان متمادی باستان شناسان فرانسوی تلاش زیادی برای پیدا کردن آثار و اشیای باستانی انجام داده اند اما این بار پیدا شدن بطری حاوی پیغامی به بیش از دو قرن پیش از آن تاریخ نشان می دهد. |
|
# > Over the years, French archaeologists have made great efforts to find ancient works and objects, but this time, the discovery of a bottle containing a message shows that date more than two centuries ago. |
|
``` |
|
|
|
Thanks to [Google's TRC program](https://sites.research.google/trc/about/) for supporting the training of this model. |
|
|
|
Check out [the paper](https://arxiv.org/abs/2409.12958) for more detailed information on the experiments and results. |
|
|
|
## Citation |
|
``` |
|
@misc{koksal2024muri, |
|
title={MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions}, |
|
author={Abdullatif Köksal and Marion Thaler and Ayyoob Imani and Ahmet Üstün and Anna Korhonen and Hinrich Schütze}, |
|
year={2024}, |
|
eprint={2409.12958}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2409.12958}, |
|
} |
|
``` |