metadata
license: mit
language:
- ru
- kbd
datasets:
- anzorq/kbd-ru
widget:
- text: Я иду домой.
example_title: Я иду домой.
- text: Дети играют во дворе.
example_title: Дети играют во дворе.
- text: Сколько тебе лет?
example_title: Сколько тебе лет?
- text: На следующий день мы отправились в путь.
example_title: На следующий день мы отправились в путь.
tags:
- translation
m2m100_ru_kbd_44K
This model is a fine-tuned version of facebook/m2m100_418M on a ru-kbd dataset, containing 44K sentences from books, textbooks, dictionaries etc.. It achieves the following results on the evaluation set:
- Loss: 0.9399
- Bleu: 22.389
- Gen Len: 16.562
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0
Training results
Training Loss | Epoch | Step | Validation Loss | Bleu | Gen Len |
---|---|---|---|---|---|
2.2391 | 0.18 | 1000 | 1.9921 | 7.4066 | 16.377 |
1.8436 | 0.36 | 2000 | 1.6756 | 9.3443 | 18.428 |
1.63 | 0.53 | 3000 | 1.5361 | 10.9057 | 17.134 |
1.5205 | 0.71 | 4000 | 1.3994 | 12.6061 | 17.471 |
1.4471 | 0.89 | 5000 | 1.3107 | 14.4452 | 16.985 |
1.1915 | 1.07 | 6000 | 1.2462 | 15.1903 | 16.544 |
1.1165 | 1.25 | 7000 | 1.1917 | 16.3859 | 17.044 |
1.0654 | 1.43 | 8000 | 1.1351 | 17.617 | 16.481 |
1.0464 | 1.6 | 9000 | 1.0939 | 18.649 | 16.517 |
1.0376 | 1.78 | 10000 | 1.0603 | 18.2567 | 17.152 |
1.0027 | 1.96 | 11000 | 1.0184 | 20.6011 | 16.875 |
0.7741 | 2.14 | 12000 | 1.0159 | 20.4801 | 16.488 |
0.7566 | 2.32 | 13000 | 0.9899 | 21.6967 | 16.681 |
0.7346 | 2.49 | 14000 | 0.9738 | 21.8249 | 16.679 |
0.7397 | 2.67 | 15000 | 0.9555 | 21.569 | 16.608 |
0.6919 | 2.85 | 16000 | 0.9441 | 22.4658 | 16.493 |
Framework versions
- Transformers 4.21.0
- Pytorch 1.10.0+cu113
- Datasets 2.4.0
- Tokenizers 0.12.1
Model inference
1. Install dependencies
pip install transformers sentencepiece torch ctranslate2
2. Inference
CTranslate2 model (quantized model, much faster inference)
First, download the files for the model in ctranslate2 format:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='anzorq/m2m100_418M_ft_ru-kbd_44K', subfolder='ctranslate2', filename='config.json', local_dir='./')
hf_hub_download(repo_id='anzorq/m2m100_418M_ft_ru-kbd_44K', subfolder='ctranslate2', filename='model.bin', local_dir='./')
hf_hub_download(repo_id='anzorq/m2m100_418M_ft_ru-kbd_44K', subfolder='ctranslate2', filename='sentencepiece.bpe.model', local_dir='./')
hf_hub_download(repo_id='anzorq/m2m100_418M_ft_ru-kbd_44K', subfolder='ctranslate2', filename='shared_vocabulary.json', local_dir='./')
Run inference:
import ctranslate2
import transformers
translator = ctranslate2.Translator("ctranslate2") # Ensure correct path to the ctranslate2 model directory
tokenizer = transformers.AutoTokenizer.from_pretrained("anzorq/m2m100_418M_ft_ru-kbd_44K")
tgt_lang="zu"
def translate(text, num_beams=4, num_return_sequences=4):
num_return_sequences = min(num_return_sequences, num_beams)
source = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
target_prefix = [tokenizer.lang_code_to_token[tgt_lang]]
results = translator.translate_batch(
[source],
target_prefix=[target_prefix],
beam_size=num_beams,
num_hypotheses=num_return_sequences
)
translations = []
for hypothesis in results[0].hypotheses:
target = hypothesis[1:]
decoded_sentence = tokenizer.decode(tokenizer.convert_tokens_to_ids(target))
translations.append(decoded_sentence)
return text, translations
# Test the translation
text = "Текст для перевода"
print(translate(text))
Vanilla model
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_path = "anzorq/m2m100_418M_ft_ru-kbd_44K"
tgt_lang="zu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
def translate(text, num_beams=4, num_return_sequences=4):
inputs = tokenizer(text, return_tensors="pt")
num_return_sequences = min(num_return_sequences, num_beams)
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang], num_beams=num_beams, num_return_sequences=num_return_sequences
)
translations = [tokenizer.decode(translation, skip_special_tokens=True) for translation in translated_tokens]
return text, translations
# Test the translation
text = "Текст для перевода"
print(translate(text))