This is a text pair classifier, trained to predict whether a Bashkir sentence and a Russian sentence have the same meaning.

It can be used for filtering parallel corpora or evaluating machine translation quality.

It can be applied to predict scores like this:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

clf_name = 'slone/bert-base-multilingual-cased-bak-rus-similarity'
clf = AutoModelForSequenceClassification.from_pretrained(clf_name)
clf_tokenizer = AutoTokenizer.from_pretrained(clf_name)

def classify(texts_ba, texts_ru):
    with torch.inference_mode():
        batch = clf_tokenizer(texts_ba, texts_ru, padding=True, truncation=True, max_length=512, return_tensors='pt').to(clf.device)
        return torch.softmax(clf(**batch).logits.view(-1, 2), -1)[:, 1].cpu().numpy()

print(classify(['Сәләм, ғаләм!', 'Хәйерле көн, тыныслыҡ.'], ['Привет, мир!', 'Мама мыла раму.']))
# [0.96345973 0.02213471]

For most "good" sentence pairs, these scores are above 0.5.

Downloads last month
9
Safetensors
Model size
178M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train slone/bert-base-multilingual-cased-bak-rus-similarity