File size: 3,050 Bytes
7b25245
4582162
 
7b25245
68d882b
7b25245
4582162
 
 
076004f
 
 
 
 
4bd9604
f5e5a7b
f366492
076004f
 
 
 
b4f3d09
 
 
 
 
 
 
 
 
 
 
 
076004f
 
4582162
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d568520
4582162
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
language:
- zh
license: mit
pipeline_tag: sentence-similarity
---

# SimCSE(sup)



## Model List
The evaluation dataset is in Chinese, and we used the same language model **RoBERTa large** on different methods.
|          Model          | STS-B(w-avg) | ATEC | BQ | LCQMC | PAWSX | Avg. |
|:-----------------------:|:------------:|:-----------:|:----------|:----------|:----------:|:----------:|
|  [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh)  |  78.61| -| -| -| -| -|
|  [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5)  |  79.07| -| -| -| -| -|
|  [hellonlp/simcse-large-zh](https://huggingface.co/hellonlp/simcse-roberta-large-zh)  |  81.32| -| -| -| -| -|



## Data List
The following data are all in Chinese.
|          Data          | Link | size(train) | size(valid) | size(test) |
|:-----------------------:|:------------:|:------------:|:------------:|:------------:|
|  STS-B  |  [STS-B](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/10yfKfTtcmLQ70-jzHIln1A%3Fpwd%3Dgf8y)|  5231|  1458|  1361|
|  ATEC   |  [ATEC](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1gmnyz9emqOXwaHhSM9CCUA%3Fpwd%3Db17c)| 62477|  20000|  20000|
|  BQ   |  [BQ](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1M-e01yyy5NacVPrph9fbaQ%3Fpwd%3Dtis9)|  100000|  10000|  10000|
|  LCQMC   |  [LCQMC](https://pan.baidu.com/s/16DfE7fHrCkk4e8a2j3SYUg?pwd=bc8w )|  238766| 8802| 12500|
|  PAWSX   |  [PAWSX](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1ox0tJY3ZNbevHDeAqDBOPQ%3Fpwd%3Dmgjn)| 49401| 2000| 2000|
|  SNLI   |  [SNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1NOgA7JwWghiauwGAUvcm7w%3Fpwd%3Ds75v)| 146828| 2699| 2618|
|  MNLI   |  [MNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1xjZKtWk3MAbJ6HX4pvXJ-A%3Fpwd%3D2kte)| 122547| 2932| 2397|



## Uses
You can use our model for encoding sentences into embeddings
```python
import torch
from transformers import BertTokenizer
from transformers import BertModel
from sklearn.metrics.pairwise import cosine_similarity

# model
simcse_sup_path = "hellonlp/simcse-roberta-large-zh"
tokenizer = BertTokenizer.from_pretrained(simcse_sup_path)
MODEL = BertModel.from_pretrained(simcse_sup_path)

def get_vector_simcse(sentence):
    """
    预测simcse的语义向量。
    """
    input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
    output = MODEL(input_ids)
    return output.last_hidden_state[:, 0].squeeze(0)

embeddings = get_vector_simcse("武汉是一个美丽的城市。")
print(embeddings.shape)
#torch.Size([1024])
```

You can also compute the cosine similarities between two sentences
```python
def get_similarity_two(sentence1, sentence2):
    vec1 = get_vector_simcse(sentence1).tolist()
    vec2 = get_vector_simcse(sentence2).tolist()
    similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0]
    return similarity_list

sentence1 = '你好吗'
sentence2 = '你还好吗'
result = get_similarity_two(sentence1,sentence2)
print(result)
#0.848331
```