slone
/

LaBSE-en-ru-myv-v1

Feature Extraction

sentence-similarity

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

cointegrated commited on Sep 20, 2022

Commit

d34f966

·

1 Parent(s): 680da70

Create README.md

Files changed (1) hide show

README.md +43 -0

README.md ADDED Viewed

	@@ -0,0 +1,43 @@

+---
+language:
+- ru
+- myv
+tags:
+- erzya
+- mordovian
+- fill-mask
+- pretraining
+- embeddings
+- masked-lm
+- feature-extraction
+- sentence-similarity
+license: cc-by-sa-4.0
+datasets:
+  - slone/myv_ru_2022
+---
+This is an Erzya (`myv`, cyrillic script) sentence encoder from the paper "The first neural machine translation system for the Erzya language".
+It is based on [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) ([license here](https://tfhub.dev/google/LaBSE/2)), but with updated vocabulary and checkpoint:
+- Removed all tokens except the most popular ones for English or Russian;
+- Added extra tokens for Erzya language;
+- Fine-tuned on the [slone/myv_ru_2022](https://huggingface.co/slone/myv_ru_2022) corpus using a mixture of tasks:
+  - Cross-lingual distillation of sentence embeddings from the original LaBSE model, using the parallel `ru-myv` corpus;
+  - Masked language modelling on `myv` monolingual data;
+  - Sentence pair classification to distinguish correct `ru-myv` translations from random pairs.
+ ```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru")
+model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")
+sentences = ["Hello World", "Привет Мир", "Шумбратадо Мастор"]
+encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')
+with torch.no_grad():
+    model_output = model(**encoded_input)
+embeddings = model_output.pooler_output
+embeddings = torch.nn.functional.normalize(embeddings)
+print(embeddings.shape)  # torch.Size([3, 768])
+```
+The model can be used as a sentence encoder or fine-tuned for any downstream NLU dask.