This is an Erzya (myv, cyrillic script) sentence encoder from the paper The first neural machine translation system for the Erzya language.

It is based on sentence-transformers/LaBSE (license here), but with updated vocabulary and checkpoint:

  • Removed all tokens except the most popular ones for English or Russian;
  • Added extra tokens for Erzya language;
  • Fine-tuned on the slone/myv_ru_2022 corpus using a mixture of tasks:
    • Cross-lingual distillation of sentence embeddings from the original LaBSE model, using the parallel ru-myv corpus;
    • Masked language modelling on myv monolingual data;
    • Sentence pair classification to distinguish correct ru-myv translations from random pairs.

The model can be used as a sentence encoder or a masked language modelling predictor for Erzya, or fine-tuned for any downstream NLU dask.

Sentence embeddings can be produced with the code below:

import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("slone/LaBSE-en-ru-myv-v1")
model = AutoModel.from_pretrained("slone/LaBSE-en-ru-myv-v1")
sentences = ["Hello World", "Привет Мир", "Шумбратадо Мастор"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
   model_output = model(**encoded_input)
embeddings = model_output.pooler_output
embeddings = torch.nn.functional.normalize(embeddings)
print(embeddings.shape)  # torch.Size([3, 768])
Downloads last month
24
Safetensors
Model size
144M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train slone/LaBSE-en-ru-myv-v1