slone
/

LaBSE-en-ru-myv-v1

Feature Extraction

sentence-similarity

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

LaBSE-en-ru-myv-v1 / README.md

cointegrated's picture

Update README.md

1b1e8f7 over 2 years ago

|

1.88 kB

	---
	language:
	- ru
	- myv
	tags:
	- erzya
	- mordovian
	- fill-mask
	- pretraining
	- embeddings
	- masked-lm
	- feature-extraction
	- sentence-similarity
	license: cc-by-sa-4.0
	datasets:
	- slone/myv_ru_2022
	---

	This is an Erzya (`myv`, cyrillic script) sentence encoder from the paper [The first neural machine translation system for the Erzya language](https://arxiv.org/abs/2209.09368).

	It is based on [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) ([license here](https://tfhub.dev/google/LaBSE/2)), but with updated vocabulary and checkpoint:
	- Removed all tokens except the most popular ones for English or Russian;
	- Added extra tokens for Erzya language;
	- Fine-tuned on the [slone/myv_ru_2022](https://huggingface.co/slone/myv_ru_2022) corpus using a mixture of tasks:
	- Cross-lingual distillation of sentence embeddings from the original LaBSE model, using the parallel `ru-myv` corpus;
	- Masked language modelling on `myv` monolingual data;
	- Sentence pair classification to distinguish correct `ru-myv` translations from random pairs.

	The model can be used as a sentence encoder or a masked language modelling predictor for Erzya, or fine-tuned for any downstream NLU dask.

	Sentence embeddings can be produced with the code below:
	```python
	import torch
	from transformers import AutoTokenizer, AutoModel
	tokenizer = AutoTokenizer.from_pretrained("slone/LaBSE-en-ru-myv-v1")
	model = AutoModel.from_pretrained("slone/LaBSE-en-ru-myv-v1")
	sentences = ["Hello World", "Привет Мир", "Шумбратадо Мастор"]
	encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
	with torch.no_grad():
	model_output = model(**encoded_input)
	embeddings = model_output.pooler_output
	embeddings = torch.nn.functional.normalize(embeddings)
	print(embeddings.shape) # torch.Size([3, 768])
	```