File size: 3,606 Bytes
e6b8dbd
6ff51cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6b8dbd
6ff51cd
e6b8dbd
6ff51cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---

language: 
- multilingual
- en
- de
- nl
- sv
- da
- af
- fr
- es
- it
- pt
- ro
- ru
- cs
- pl
- bg
- uk
- id
- jv
- ms
- tl
- ja
- zh
- ko
- vi

license: mit
pipeline_tag: translation
---

# MITRE 913M

## Description
MITRE (multilingual translation with registers) is a multilingual decoder-only model trained for many-to-many translation.  
The technology, i.e., registering, is introduced in our [paper](url_placeholder).  
This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this [repository](url_placeholder).

The model can directly translate between the 552 directions of 24 languages spanning more than 5 language families.
You can directly use our models by `transformers` libs.  
MITRE has another version with 466M parameters, which can be found in this [repository](https://huggingface.co/naist-nlp/mitre_466m).


## Usages
Before get tokenizer, you should run `pip install sentencepiece` at first.  
You can simply call the tokenizer and the model by  
```python

from transformers import AutoModel, AutoTokenizer



# you can switch the name to "naist-nlp/mitre_913m"

tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)

model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)

```

To locally use this model and check the codes, you can clone this hub, then
```python

from mitre_913m.tokenization_mitre import MitreTokenizer

from mitre_913m.modeling_mitre import MitreForConditionalGeneration



tokenizer = MitreTokenizer.from_pretrained("mitre_913m")

model = MitreForConditionalGeneration.from_pretrained("mitre_913m")

```

After get the objects of the model and the tokenizer, we can do translation.
```python

english_text = "I have a red apple."

chinese_text = "我有一个红苹果。"

model.eval()



# Translating from one or several sentences to a sole language

src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")

# Translating from one or several sentences to corresponding languages

# src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ])



generated_tokens = model.generate(src_tokens.cuda())

results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

print(results)

# results

# de: Ich habe einen roten Apfel.

# zh: 我有一个红苹果。



# For training

# 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.

# 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,

#    because 'tokenizer.encode_target_tokens_to_input_ids' has pads.

# 3. You can refer our codes to know the details in implementation.

# tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)

# labels = tokenizer.encode_target_tokens_to_labels(chinese_text)

```

## Languages covered
Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)  
Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)  
Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)  
Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)  
Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)  



## BibTeX entry and citation info

```

place holder

```