anzorq commited on
Commit
fa00b57
·
1 Parent(s): eb55f23

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -1
README.md CHANGED
@@ -81,4 +81,73 @@ The following hyperparameters were used during training:
81
  - Transformers 4.21.0
82
  - Pytorch 1.10.0+cu113
83
  - Datasets 2.4.0
84
- - Tokenizers 0.12.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  - Transformers 4.21.0
82
  - Pytorch 1.10.0+cu113
83
  - Datasets 2.4.0
84
+ - Tokenizers 0.12.1
85
+
86
+ ---
87
+
88
+ # Model inference
89
+ ### 1. Install dependencies
90
+ ```bash
91
+ pip install transformers sentencepiece torch ctranslate2
92
+ ```
93
+
94
+ ### 2. Inference
95
+ ## Vanilla model
96
+ ```Python
97
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
98
+
99
+ model_path = "anzorq/m2m100_418M_ft_ru-kbd_44K"
100
+ tgt_lang="zu"
101
+
102
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
103
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
104
+
105
+ def translate(text, num_beams=4, num_return_sequences=4):
106
+ inputs = tokenizer(text, return_tensors="pt")
107
+ num_return_sequences = min(num_return_sequences, num_beams)
108
+
109
+ translated_tokens = model.generate(
110
+ **inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang], num_beams=num_beams, num_return_sequences=num_return_sequences
111
+ )
112
+
113
+ translations = [tokenizer.decode(translation, skip_special_tokens=True) for translation in translated_tokens]
114
+ return text, translations
115
+
116
+ # Test the translation
117
+ text = "Текст для перевода"
118
+ print(translate(text))
119
+ ```
120
+
121
+ ## CTranslate2 model (quantized model, much faster inference)
122
+ ```Python
123
+ import ctranslate2
124
+ import transformers
125
+
126
+ translator = ctranslate2.Translator("ctranslate") # Ensure correct path to the ctranslate2 model directory
127
+ tokenizer = transformers.AutoTokenizer.from_pretrained("anzorq/m2m100_418M_ft_ru-kbd_44K")
128
+ tgt_lang="zu"
129
+
130
+ def translate(text, num_beams=4, num_return_sequences=4):
131
+ num_return_sequences = min(num_return_sequences, num_beams)
132
+
133
+ source = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
134
+ target_prefix = [tokenizer.lang_code_to_token[tgt_lang]]
135
+ results = translator.translate_batch(
136
+ [source],
137
+ target_prefix=[target_prefix],
138
+ beam_size=num_beams,
139
+ num_hypotheses=num_return_sequences
140
+ )
141
+
142
+ translations = []
143
+ for hypothesis in results[0].hypotheses:
144
+ target = hypothesis[1:]
145
+ decoded_sentence = tokenizer.decode(tokenizer.convert_tokens_to_ids(target))
146
+ translations.append(decoded_sentence)
147
+
148
+ return text, translations
149
+
150
+ # Test the translation
151
+ text = "Текст для перевода"
152
+ print(translate(text))
153
+ ```