zhiqu22
commited on
Commit
·
6ff51cd
1
Parent(s):
56af8f8
update readme
Browse files
README.md
CHANGED
@@ -1,3 +1,101 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: mit
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- multilingual
|
4 |
+
- en
|
5 |
+
- de
|
6 |
+
- nl
|
7 |
+
- sv
|
8 |
+
- da
|
9 |
+
- af
|
10 |
+
- fr
|
11 |
+
- es
|
12 |
+
- it
|
13 |
+
- pt
|
14 |
+
- ro
|
15 |
+
- ru
|
16 |
+
- cs
|
17 |
+
- pl
|
18 |
+
- bg
|
19 |
+
- uk
|
20 |
+
- id
|
21 |
+
- jv
|
22 |
+
- ms
|
23 |
+
- tl
|
24 |
+
- ja
|
25 |
+
- zh
|
26 |
+
- ko
|
27 |
+
- vi
|
28 |
+
|
29 |
license: mit
|
30 |
+
pipeline_tag: translation
|
31 |
---
|
32 |
+
# MITRE 913M
|
33 |
+
|
34 |
+
## Description
|
35 |
+
MITRE (multilingual translation with registers) is a multilingual decoder-only model trained for many-to-many translation.
|
36 |
+
The technology, i.e., registering, is introduced in our [paper](url_placeholder).
|
37 |
+
This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this [repository](url_placeholder).
|
38 |
+
|
39 |
+
The model can directly translate between the 552 directions of 24 languages spanning more than 5 language families.
|
40 |
+
You can directly use our models by `transformers` libs.
|
41 |
+
MITRE has another version with 466M parameters, which can be found in this [repository](https://huggingface.co/naist-nlp/mitre_466m).
|
42 |
+
|
43 |
+
|
44 |
+
## Usages
|
45 |
+
Before get tokenizer, you should run `pip install sentencepiece` at first.
|
46 |
+
You can simply call the tokenizer and the model by
|
47 |
+
```python
|
48 |
+
from transformers import AutoModel, AutoTokenizer
|
49 |
+
|
50 |
+
# you can switch the name to "naist-nlp/mitre_913m"
|
51 |
+
tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)
|
52 |
+
model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)
|
53 |
+
```
|
54 |
+
|
55 |
+
To locally use this model and check the codes, you can clone this hub, then
|
56 |
+
```python
|
57 |
+
from mitre_913m.tokenization_mitre import MitreTokenizer
|
58 |
+
from mitre_913m.modeling_mitre import MitreForConditionalGeneration
|
59 |
+
|
60 |
+
tokenizer = MitreTokenizer.from_pretrained("mitre_913m")
|
61 |
+
model = MitreForConditionalGeneration.from_pretrained("mitre_913m")
|
62 |
+
```
|
63 |
+
|
64 |
+
After get the objects of the model and the tokenizer, we can do translation.
|
65 |
+
```python
|
66 |
+
english_text = "I have a red apple."
|
67 |
+
chinese_text = "我有一个红苹果。"
|
68 |
+
model.eval()
|
69 |
+
|
70 |
+
# Translating from one or several sentences to a sole language
|
71 |
+
src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")
|
72 |
+
# Translating from one or several sentences to corresponding languages
|
73 |
+
# src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ])
|
74 |
+
|
75 |
+
generated_tokens = model.generate(src_tokens.cuda())
|
76 |
+
results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
77 |
+
print(results)
|
78 |
+
# results
|
79 |
+
# de: Ich habe einen roten Apfel.
|
80 |
+
# zh: 我有一个红苹果。
|
81 |
+
|
82 |
+
# For training
|
83 |
+
# 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
|
84 |
+
# 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
|
85 |
+
# because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
|
86 |
+
# 3. You can refer our codes to know the details in implementation.
|
87 |
+
# tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
|
88 |
+
# labels = tokenizer.encode_target_tokens_to_labels(chinese_text)
|
89 |
+
```
|
90 |
+
|
91 |
+
## Languages covered
|
92 |
+
Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
|
93 |
+
Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
|
94 |
+
Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
|
95 |
+
Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)
|
96 |
+
Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)
|
97 |
+
|
98 |
+
## BibTeX entry and citation info
|
99 |
+
```
|
100 |
+
place holder
|
101 |
+
```
|