zhiqu22 commited on
Commit
6ff51cd
·
1 Parent(s): 56af8f8

update readme

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md CHANGED
@@ -1,3 +1,101 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - multilingual
4
+ - en
5
+ - de
6
+ - nl
7
+ - sv
8
+ - da
9
+ - af
10
+ - fr
11
+ - es
12
+ - it
13
+ - pt
14
+ - ro
15
+ - ru
16
+ - cs
17
+ - pl
18
+ - bg
19
+ - uk
20
+ - id
21
+ - jv
22
+ - ms
23
+ - tl
24
+ - ja
25
+ - zh
26
+ - ko
27
+ - vi
28
+
29
  license: mit
30
+ pipeline_tag: translation
31
  ---
32
+ # MITRE 913M
33
+
34
+ ## Description
35
+ MITRE (multilingual translation with registers) is a multilingual decoder-only model trained for many-to-many translation.
36
+ The technology, i.e., registering, is introduced in our [paper](url_placeholder).
37
+ This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this [repository](url_placeholder).
38
+
39
+ The model can directly translate between the 552 directions of 24 languages spanning more than 5 language families.
40
+ You can directly use our models by `transformers` libs.
41
+ MITRE has another version with 466M parameters, which can be found in this [repository](https://huggingface.co/naist-nlp/mitre_466m).
42
+
43
+
44
+ ## Usages
45
+ Before get tokenizer, you should run `pip install sentencepiece` at first.
46
+ You can simply call the tokenizer and the model by
47
+ ```python
48
+ from transformers import AutoModel, AutoTokenizer
49
+
50
+ # you can switch the name to "naist-nlp/mitre_913m"
51
+ tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)
52
+ model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)
53
+ ```
54
+
55
+ To locally use this model and check the codes, you can clone this hub, then
56
+ ```python
57
+ from mitre_913m.tokenization_mitre import MitreTokenizer
58
+ from mitre_913m.modeling_mitre import MitreForConditionalGeneration
59
+
60
+ tokenizer = MitreTokenizer.from_pretrained("mitre_913m")
61
+ model = MitreForConditionalGeneration.from_pretrained("mitre_913m")
62
+ ```
63
+
64
+ After get the objects of the model and the tokenizer, we can do translation.
65
+ ```python
66
+ english_text = "I have a red apple."
67
+ chinese_text = "我有一个红苹果。"
68
+ model.eval()
69
+
70
+ # Translating from one or several sentences to a sole language
71
+ src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")
72
+ # Translating from one or several sentences to corresponding languages
73
+ # src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ])
74
+
75
+ generated_tokens = model.generate(src_tokens.cuda())
76
+ results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
77
+ print(results)
78
+ # results
79
+ # de: Ich habe einen roten Apfel.
80
+ # zh: 我有一个红苹果。
81
+
82
+ # For training
83
+ # 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
84
+ # 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
85
+ # because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
86
+ # 3. You can refer our codes to know the details in implementation.
87
+ # tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
88
+ # labels = tokenizer.encode_target_tokens_to_labels(chinese_text)
89
+ ```
90
+
91
+ ## Languages covered
92
+ Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
93
+ Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
94
+ Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
95
+ Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)
96
+ Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)
97
+
98
+ ## BibTeX entry and citation info
99
+ ```
100
+ place holder
101
+ ```