anzorq's picture
Update README.md
52d96d8
|
raw
history blame
5.71 kB
metadata
license: mit
language:
  - ru
  - kbd
datasets:
  - anzorq/kbd-ru
widget:
  - text: Я иду домой.
    example_title: Я иду домой.
  - text: Дети играют во дворе.
    example_title: Дети играют во дворе.
  - text: Сколько тебе лет?
    example_title: Сколько тебе лет?
  - text: На следующий день мы отправились в путь.
    example_title: На следующий день мы отправились в путь.
tags:
  - translation

m2m100_ru_kbd_44K

This model is a fine-tuned version of facebook/m2m100_418M on a ru-kbd dataset, containing 44K sentences from books, textbooks, dictionaries etc.. It achieves the following results on the evaluation set:

  • Loss: 0.9399
  • Bleu: 22.389
  • Gen Len: 16.562

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3.0

Training results

Training Loss Epoch Step Validation Loss Bleu Gen Len
2.2391 0.18 1000 1.9921 7.4066 16.377
1.8436 0.36 2000 1.6756 9.3443 18.428
1.63 0.53 3000 1.5361 10.9057 17.134
1.5205 0.71 4000 1.3994 12.6061 17.471
1.4471 0.89 5000 1.3107 14.4452 16.985
1.1915 1.07 6000 1.2462 15.1903 16.544
1.1165 1.25 7000 1.1917 16.3859 17.044
1.0654 1.43 8000 1.1351 17.617 16.481
1.0464 1.6 9000 1.0939 18.649 16.517
1.0376 1.78 10000 1.0603 18.2567 17.152
1.0027 1.96 11000 1.0184 20.6011 16.875
0.7741 2.14 12000 1.0159 20.4801 16.488
0.7566 2.32 13000 0.9899 21.6967 16.681
0.7346 2.49 14000 0.9738 21.8249 16.679
0.7397 2.67 15000 0.9555 21.569 16.608
0.6919 2.85 16000 0.9441 22.4658 16.493

Framework versions

  • Transformers 4.21.0
  • Pytorch 1.10.0+cu113
  • Datasets 2.4.0
  • Tokenizers 0.12.1

Model inference

1. Install dependencies

pip install transformers sentencepiece torch ctranslate2

2. Inference

CTranslate2 model (quantized model, much faster inference)

First, download the files for the model in ctranslate2 format:

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id='anzorq/m2m100_418M_ft_ru-kbd_44K', subfolder='ctranslate2', filename='config.json', local_dir='./')
hf_hub_download(repo_id='anzorq/m2m100_418M_ft_ru-kbd_44K', subfolder='ctranslate2', filename='model.bin', local_dir='./')
hf_hub_download(repo_id='anzorq/m2m100_418M_ft_ru-kbd_44K', subfolder='ctranslate2', filename='sentencepiece.bpe.model', local_dir='./')
hf_hub_download(repo_id='anzorq/m2m100_418M_ft_ru-kbd_44K', subfolder='ctranslate2', filename='shared_vocabulary.json', local_dir='./')

Run inference:

import ctranslate2
import transformers

translator = ctranslate2.Translator("ctranslate2") # Ensure correct path to the ctranslate2 model directory
tokenizer = transformers.AutoTokenizer.from_pretrained("anzorq/m2m100_418M_ft_ru-kbd_44K")
tgt_lang="zu"

def translate(text, num_beams=4, num_return_sequences=4):
    num_return_sequences = min(num_return_sequences, num_beams)

    source = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
    target_prefix = [tokenizer.lang_code_to_token[tgt_lang]]
    results = translator.translate_batch(
        [source],
        target_prefix=[target_prefix],
        beam_size=num_beams,
        num_hypotheses=num_return_sequences
    )
    
    translations = []
    for hypothesis in results[0].hypotheses:
        target = hypothesis[1:]
        decoded_sentence = tokenizer.decode(tokenizer.convert_tokens_to_ids(target))
        translations.append(decoded_sentence)
    
    return text, translations

# Test the translation
text = "Текст для перевода"
print(translate(text))

Vanilla model

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_path = "anzorq/m2m100_418M_ft_ru-kbd_44K"  
tgt_lang="zu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

def translate(text, num_beams=4, num_return_sequences=4):
    inputs = tokenizer(text, return_tensors="pt")
    num_return_sequences = min(num_return_sequences, num_beams)

    translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang], num_beams=num_beams, num_return_sequences=num_return_sequences
    )

    translations = [tokenizer.decode(translation, skip_special_tokens=True) for translation in translated_tokens]
    return text, translations

# Test the translation
text = "Текст для перевода"
print(translate(text))