|
--- |
|
language: |
|
- ru |
|
- myv |
|
tags: |
|
- erzya |
|
- mordovian |
|
- fill-mask |
|
- pretraining |
|
- embeddings |
|
- masked-lm |
|
- feature-extraction |
|
- sentence-similarity |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- slone/myv_ru_2022 |
|
--- |
|
|
|
This is an Erzya (`myv`, cyrillic script) sentence encoder from the paper [The first neural machine translation system for the Erzya language](https://arxiv.org/abs/2209.09368). |
|
|
|
It is based on [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) ([license here](https://tfhub.dev/google/LaBSE/2)), but with updated vocabulary and checkpoint: |
|
- Removed all tokens except the most popular ones for English or Russian; |
|
- Added extra tokens for Erzya language; |
|
- Fine-tuned on the [slone/myv_ru_2022](https://huggingface.co/slone/myv_ru_2022) corpus using a mixture of tasks: |
|
- Cross-lingual distillation of sentence embeddings from the original LaBSE model, using the parallel `ru-myv` corpus; |
|
- Masked language modelling on `myv` monolingual data; |
|
- Sentence pair classification to distinguish correct `ru-myv` translations from random pairs. |
|
|
|
The model can be used as a sentence encoder or a masked language modelling predictor for Erzya, or fine-tuned for any downstream NLU dask. |
|
|
|
Sentence embeddings can be produced with the code below: |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModel |
|
tokenizer = AutoTokenizer.from_pretrained("slone/LaBSE-en-ru-myv-v1") |
|
model = AutoModel.from_pretrained("slone/LaBSE-en-ru-myv-v1") |
|
sentences = ["Hello World", "Привет Мир", "Шумбратадо Мастор"] |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
embeddings = model_output.pooler_output |
|
embeddings = torch.nn.functional.normalize(embeddings) |
|
print(embeddings.shape) # torch.Size([3, 768]) |
|
``` |
|
|
|
|
|
|