---
language:
- zh
license: mit
pipeline_tag: sentence-similarity
---

# SimCSE(sup)


## Data List
The following datasets are all in Chinese.
|          Data          | size(train) | size(valid) | size(test) |
|:----------------------:|:----------:|:----------:|:----------:|
|   [ATEC](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1gmnyz9emqOXwaHhSM9CCUA%3Fpwd%3Db17c)   |  62477|  20000|  20000|
|   [BQ](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1M-e01yyy5NacVPrph9fbaQ%3Fpwd%3Dtis9)     | 100000|  10000|  10000|
|   [LCQMC](https://pan.baidu.com/s/16DfE7fHrCkk4e8a2j3SYUg?pwd=bc8w )                                      | 238766|   8802|  12500|
|   [PAWSX](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1ox0tJY3ZNbevHDeAqDBOPQ%3Fpwd%3Dmgjn)  |  49401|   2000|   2000|
|   [STS-B](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/10yfKfTtcmLQ70-jzHIln1A%3Fpwd%3Dgf8y)  |   5231|   1458|   1361|
|   [*SNLI*](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1NOgA7JwWghiauwGAUvcm7w%3Fpwd%3Ds75v)   | 146828|   2699|   2618|
|   [*MNLI*](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1xjZKtWk3MAbJ6HX4pvXJ-A%3Fpwd%3D2kte)   | 122547|   2932|   2397|


## Model List
The evaluation dataset is in Chinese, and we used the same language model **RoBERTa base** on different methods.  In addition, considering that the test set of some datasets is small, which may lead to a large deviation in evaluation accuracy, the evaluation data here uses train, valid and test at the same time, and the final evaluation result adopts the **weighted average (w-avg)** method.
|          Model          | STS-B(w-avg) | ATEC | BQ | LCQMC | PAWSX | Avg. |
|:-----------------------:|:------------:|:-----------:|:----------|:-------------|:------------:|:----------:|
|  BERT-Whitening  |  65.27| -| -| -| -| -|
|  SimBERT   |  70.01| -| -| -| -| -|
|  SBERT-Whitening  |  71.75| -| -| -| -| -|
|  [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh)  |  78.61| -| -| -| -| -|
|  [hellonlp/simcse-base-zh(sup)](https://huggingface.co/hellonlp/simcse-roberta-base-zh)  |  **80.96**| -| -| -| -| -|


## Uses
You can use our model for encoding sentences into embeddings
```python
import torch
from transformers import BertTokenizer
from transformers import BertModel
from sklearn.metrics.pairwise import cosine_similarity

# model
simcse_sup_path = "hellonlp/simcse-roberta-base-zh"
tokenizer = BertTokenizer.from_pretrained(simcse_sup_path)
MODEL = BertModel.from_pretrained(simcse_sup_path)

def get_vector_simcse(sentence):
    """
    预测simcse的语义向量。
    """
    input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
    output = MODEL(input_ids)
    return output.last_hidden_state[:, 0].squeeze(0)

embeddings = get_vector_simcse("武汉是一个美丽的城市。")
print(embeddings.shape)
#torch.Size([768])
```

You can also compute the cosine similarities between two sentences
```python
def get_similarity_two(sentence1, sentence2):
    vec1 = get_vector_simcse(sentence1).tolist()
    vec2 = get_vector_simcse(sentence2).tolist()
    similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0]
    return similarity_list

sentence1 = '你好吗'
sentence2 = '你还好吗'
result = get_similarity_two(sentence1,sentence2)
print(result) #0.7996
#(1.0, '你好吗')
#(0.8247, '你好不好')
#(0.8217, '你现在好吗')
#(0.7976, '你还好吗')
#(0.7918, '你好不好呢')
#(0.712, '你过的好吗')
#(0.6986, '你怎么样')
#(0.6693, '你')
#(0.5442, '你好个鬼')
#(0.4516, '你吃饭了吗')
#(0.4, '我好开心啊')
#(0.29, '我不开心')
#(0.2782, '我吃了一个苹果')
```