metadata

language:
  - multilingual
  - en
  - de
  - nl
  - sv
  - da
  - af
  - fr
  - es
  - it
  - pt
  - ro
  - ru
  - cs
  - pl
  - bg
  - uk
  - id
  - jv
  - ms
  - tl
  - ja
  - zh
  - ko
  - vi
license: mit
pipeline_tag: translation

MITRE 913M

Description

MITRE (multilingual translation with registers) is a multilingual decoder-only model trained for many-to-many translation.
The technology, i.e., registering, is introduced in our paper.
This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this repository.

The model can directly translate between the 552 directions of 24 languages spanning more than 5 language families. You can directly use our models by transformers libs.
MITRE has another version with 466M parameters, which can be found in this repository.

Usages

Before get tokenizer, you should run pip install sentencepiece at first.
You can simply call the tokenizer and the model by

from transformers import AutoModel, AutoTokenizer

# you can switch the name to "naist-nlp/mitre_913m"
tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)

To locally use this model and check the codes, you can clone this hub, then

from mitre_913m.tokenization_mitre import MitreTokenizer
from mitre_913m.modeling_mitre import MitreForConditionalGeneration

tokenizer = MitreTokenizer.from_pretrained("mitre_913m")
model = MitreForConditionalGeneration.from_pretrained("mitre_913m")

After get the objects of the model and the tokenizer, we can do translation.

english_text = "I have a red apple."
chinese_text = "我有一个红苹果。"
model.eval()

# Translating from one or several sentences to a sole language
src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")
# Translating from one or several sentences to corresponding languages
# src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ])

generated_tokens = model.generate(src_tokens.cuda())
results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(results)
# results
# de: Ich habe einen roten Apfel.
# zh: 我有一个红苹果。

# For training
# 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
# 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
#    because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
# 3. You can refer our codes to know the details in implementation.
# tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
# labels = tokenizer.encode_target_tokens_to_labels(chinese_text)

Languages covered

Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)
Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)

BibTeX entry and citation info

place holder