language:
- multilingual
- en
- de
- nl
- sv
- da
- af
- fr
- es
- it
- pt
- ro
- ru
- cs
- pl
- bg
- uk
- id
- jv
- ms
- tl
- ja
- zh
- ko
- vi
license: mit
pipeline_tag: translation
MITRE 913M
Description
MITRE (multilingual translation with registers) is a multilingual decoder-only model trained for many-to-many translation.
The technology, i.e., registering, is introduced in our paper.
This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this repository.
The model can directly translate between the 552 directions of 24 languages spanning more than 5 language families.
You can directly use our models by transformers
libs.
MITRE has another version with 466M parameters, which can be found in this repository.
Usages
Before get tokenizer, you should run pip install sentencepiece
at first.
You can simply call the tokenizer and the model by
from transformers import AutoModel, AutoTokenizer
# you can switch the name to "naist-nlp/mitre_913m"
tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)
To locally use this model and check the codes, you can clone this hub, then
from mitre_913m.tokenization_mitre import MitreTokenizer
from mitre_913m.modeling_mitre import MitreForConditionalGeneration
tokenizer = MitreTokenizer.from_pretrained("mitre_913m")
model = MitreForConditionalGeneration.from_pretrained("mitre_913m")
After get the objects of the model and the tokenizer, we can do translation.
english_text = "I have a red apple."
chinese_text = "我有一个红苹果。"
model.eval()
# Translating from one or several sentences to a sole language
src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")
# Translating from one or several sentences to corresponding languages
# src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ])
generated_tokens = model.generate(src_tokens.cuda())
results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(results)
# results
# de: Ich habe einen roten Apfel.
# zh: 我有一个红苹果。
# For training
# 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
# 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
# because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
# 3. You can refer our codes to know the details in implementation.
# tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
# labels = tokenizer.encode_target_tokens_to_labels(chinese_text)
Languages covered
Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)
Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)
BibTeX entry and citation info
place holder