File size: 2,519 Bytes
6ae9f2a
 
a83b07f
 
 
 
 
 
 
 
af04e04
 
 
6ae9f2a
 
 
 
a83b07f
6ae9f2a
 
 
a83b07f
 
 
 
 
 
 
d6f9466
a83b07f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6f9466
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
library_name: transformers
datasets:
- AigizK/mari-russian-parallel-corpora
language:
- ru
- ba
metrics:
- bleu
pipeline_tag: translation
widget:
- text: "башкирский-русский: Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап."
  example_title: "Translation bashkir-russian"
---

### Model Description

t5-small from [google t5 repo](https://huggingface.co/google-t5/t5-small) fine-tuned on [russian-bashkir corpora](https://huggingface.co/datasets/AigizK/bashkir-russian-parallel-corpora)

#### Metrics

BLEU: 0.3018

chrF: 0.5478


#### Run inference

Use the example below*:

```python
from typing import List, Union

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer


@torch.inference_mode
def infer(
        model: T5ForConditionalGeneration,
        tokenizer: Union[T5TokenizerFast, T5Tokenizer],
        device: str,
        texts: List[str],
        target_language: str,
        max_length: int = 256
    ) -> List[str]:
    assert target_language in ("русский", "башкирский"), "target language must be in (русский, башкирский)"
    if target_language == "русский":
        prefix = "башкирский-русский: "
    else:
        prefix = "русский-башкирский: "
    text_with_prefix = [
        prefix + (text[0].upper() + text[1:] + "." if not text.endswith(".") else text[0].upper() + text[1:]) \
        for text in texts
        ]
    inputs = tokenizer(
                text_with_prefix,
                padding="max_length",
                max_length=256,
                truncation=True,
                return_tensors="pt"
                )
    model.eval()
    outputs = model.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device))
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)


if __name__ == "__main__":
    tokenizer = T5Tokenizer.from_pretrained("zhursvlevy/t5-small-bashkir-russian")
    model = T5ForConditionalGeneration.from_pretrained("zhursvlevy/t5-small-bashkir-russian")
  
    input_text = "Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап"
    output_text = "Ответ на главный вопрос жизни, Вселенной и всего такого"
    
    infer(model, tokenizer, "cpu", [input_text], "русский")
```

*The widget may not work correctly due to the default pipeline.