d0rj commited on
Commit
d6cb528
·
1 Parent(s): ccf9b40

docs: update README

Browse files
Files changed (1) hide show
  1. README.md +76 -0
README.md CHANGED
@@ -1,3 +1,79 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - ru
5
+ tags:
6
+ - distill
7
+ - fill-mask
8
+ - embeddings
9
+ - masked-lm
10
+ - tiny
11
+ - feature-extraction
12
+ - sentence-similarity
13
+ datasets:
14
+ - GEM/wiki_lingua
15
+ - xnli
16
+ - RussianNLP/wikiomnia
17
+ - mlsum
18
+ - IlyaGusev/gazeta
19
+ widget:
20
+ - text: Москва - <mask> России.
21
+ - text: Если б море было пивом, я бы <mask>
22
+ - text: Столица России - <mask>.
23
  ---
24
+ # ruRoberta-distilled
25
+
26
+ Model was distilled from [ai-forever/ruRoberta-large](https://huggingface.co/ai-forever/ruRoberta-large) with ❤️ by me for 120 hours using 4 Nvidia V100.
27
+
28
+ ## Usage
29
+
30
+ ```python
31
+ from transformers import pipeline
32
+
33
+
34
+ pipe = pipeline('feature-extraction', model='d0rj/ruRoberta-distilled')
35
+ tokens_embeddings = pipe('Привет, мир!')
36
+ ```
37
+
38
+ ```python
39
+ import torch
40
+ from transformers import AutoTokenizer, AutoModel
41
+
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained('d0rj/ruRoberta-distilled')
44
+ model = AutoModel.from_pretrained('d0rj/ruRoberta-distilled')
45
+
46
+
47
+ def embed_bert_cls(text: str) -> torch.Tensor:
48
+ t = tokenizer(text, padding=True, truncation=True, return_tensors='pt').to(model.device)
49
+ with torch.no_grad():
50
+ model_output = model(**t)
51
+ embeddings = model_output.last_hidden_state[:, 0, :]
52
+ embeddings = torch.nn.functional.normalize(embeddings)
53
+ return embeddings[0].cpu()
54
+
55
+
56
+ embedding = embed_bert_cls('Привет, мир!')
57
+ ```
58
+
59
+ ## Logs
60
+
61
+ See all logs at [WandB](https://wandb.ai/d0rj/distill-ruroberta/runs/lehtr3bk/workspace).
62
+
63
+ ## Configuration
64
+
65
+ - Activation GELU -> GELUFast
66
+ - Attention heads 16 -> 8
67
+ - Hidden layers 24 -> 6
68
+ - Weights size 1.42 GB -> 464 MB
69
+
70
+ ## Data
71
+
72
+ Overall: 9.4 GB of raw texts, 5.1 GB of binarized texts.
73
+
74
+ Used data:
75
+ - [GEM/wiki_lingua](https://huggingface.co/datasets/GEM/wiki_lingua)
76
+ - [xnli](https://huggingface.co/datasets/xnli)
77
+ - [RussianNLP/wikiomnia](https://huggingface.co/datasets/RussianNLP/wikiomnia)
78
+ - [mlsum](https://huggingface.co/datasets/mlsum)
79
+ - [IlyaGusev/gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta)