naist-nlp
/

mitre_913m

Translation

Safetensors

mitre

custom_code

Model card Files Files and versions Community

zhiqu22 commited on 16 days ago

Commit

6ff51cd

1 Parent(s): 56af8f8

update readme

Browse files

Files changed (1) hide show

README.md +98 -0

README.md CHANGED Viewed

@@ -1,3 +1,101 @@
 ---
 license: mit
 ---

 ---
+language:
+- multilingual
+- en
+- de
+- nl
+- sv
+- da
+- af
+- fr
+- es
+- it
+- pt
+- ro
+- ru
+- cs
+- pl
+- bg
+- uk
+- id
+- jv
+- ms
+- tl
+- ja
+- zh
+- ko
+- vi
 license: mit
+pipeline_tag: translation
 ---
+# MITRE 913M
+## Description
+MITRE (multilingual translation with registers) is a multilingual decoder-only model trained for many-to-many translation.
+The technology, i.e., registering, is introduced in our [paper](url_placeholder).
+This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this [repository](url_placeholder).
+The model can directly translate between the 552 directions of 24 languages spanning more than 5 language families.
+You can directly use our models by `transformers` libs.
+MITRE has another version with 466M parameters, which can be found in this [repository](https://huggingface.co/naist-nlp/mitre_466m).
+## Usages
+Before get tokenizer, you should run `pip install sentencepiece` at first.
+You can simply call the tokenizer and the model by
+```python
+from transformers import AutoModel, AutoTokenizer
+# you can switch the name to "naist-nlp/mitre_913m"
+tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)
+model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)
+```
+To locally use this model and check the codes, you can clone this hub, then
+```python
+from mitre_913m.tokenization_mitre import MitreTokenizer
+from mitre_913m.modeling_mitre import MitreForConditionalGeneration
+tokenizer = MitreTokenizer.from_pretrained("mitre_913m")
+model = MitreForConditionalGeneration.from_pretrained("mitre_913m")
+```
+After get the objects of the model and the tokenizer, we can do translation.
+```python
+english_text = "I have a red apple."
+chinese_text = "我有一个红苹果。"
+model.eval()
+# Translating from one or several sentences to a sole language
+src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")
+# Translating from one or several sentences to corresponding languages
+# src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ])
+generated_tokens = model.generate(src_tokens.cuda())
+results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
+print(results)
+# results
+# de: Ich habe einen roten Apfel.
+# zh: 我有一个红苹果。
+# For training
+# 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
+# 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
+#    because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
+# 3. You can refer our codes to know the details in implementation.
+# tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
+# labels = tokenizer.encode_target_tokens_to_labels(chinese_text)
+```
+## Languages covered
+Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
+Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
+Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
+Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)
+Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)
+## BibTeX entry and citation info
+```
+place holder
+```