|
--- |
|
language: |
|
- zh |
|
license: mit |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
# SimCSE(sup) |
|
|
|
|
|
## Data List |
|
The following datasets are all in Chinese. |
|
| Data | size(train) | size(valid) | size(test) | |
|
|:----------------------:|:----------:|:----------:|:----------:| |
|
| [ATEC](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1gmnyz9emqOXwaHhSM9CCUA%3Fpwd%3Db17c) | 62477| 20000| 20000| |
|
| [BQ](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1M-e01yyy5NacVPrph9fbaQ%3Fpwd%3Dtis9) | 100000| 10000| 10000| |
|
| [LCQMC](https://pan.baidu.com/s/16DfE7fHrCkk4e8a2j3SYUg?pwd=bc8w ) | 238766| 8802| 12500| |
|
| [PAWSX](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1ox0tJY3ZNbevHDeAqDBOPQ%3Fpwd%3Dmgjn) | 49401| 2000| 2000| |
|
| [STS-B](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/10yfKfTtcmLQ70-jzHIln1A%3Fpwd%3Dgf8y) | 5231| 1458| 1361| |
|
| [*SNLI*](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1NOgA7JwWghiauwGAUvcm7w%3Fpwd%3Ds75v) | 146828| 2699| 2618| |
|
| [*MNLI*](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1xjZKtWk3MAbJ6HX4pvXJ-A%3Fpwd%3D2kte) | 122547| 2932| 2397| |
|
|
|
|
|
|
|
|
|
## Model List |
|
The evaluation dataset is in Chinese, and we used the same language model **RoBERTa base** on different methods. In addition, considering that the test set of some datasets is small, which may lead to a large deviation in evaluation accuracy, the evaluation data here uses train, valid and test at the same time, and the final evaluation result adopts the **weighted average (w-avg)** method. |
|
| Model | STS-B(w-avg) | ATEC | BQ | LCQMC | PAWSX | Avg. | |
|
|:-----------------------:|:------------:|:-----------:|:----------|:-------------|:------------:|:----------:| |
|
| BERT-Whitening | 65.27| -| -| -| -| -| |
|
| SimBERT | 70.01| -| -| -| -| -| |
|
| SBERT-Whitening | 71.75| -| -| -| -| -| |
|
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | 78.61| -| -| -| -| -| |
|
| [hellonlp/simcse-base-zh(sup)](https://huggingface.co/hellonlp/simcse-roberta-base-zh) | **80.96**| -| -| -| -| -| |
|
|
|
|
|
|
|
|
|
|
|
## Uses |
|
You can use our model for encoding sentences into embeddings |
|
```python |
|
import torch |
|
from transformers import BertTokenizer |
|
from transformers import BertModel |
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
# model |
|
simcse_sup_path = "hellonlp/simcse-roberta-base-zh" |
|
tokenizer = BertTokenizer.from_pretrained(simcse_sup_path) |
|
MODEL = BertModel.from_pretrained(simcse_sup_path) |
|
|
|
def get_vector_simcse(sentence): |
|
""" |
|
预测simcse的语义向量。 |
|
""" |
|
input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0) |
|
output = MODEL(input_ids) |
|
return output.last_hidden_state[:, 0].squeeze(0) |
|
|
|
embeddings = get_vector_simcse("武汉是一个美丽的城市。") |
|
print(embeddings.shape) |
|
#torch.Size([768]) |
|
``` |
|
|
|
You can also compute the cosine similarities between two sentences |
|
```python |
|
def get_similarity_two(sentence1, sentence2): |
|
vec1 = get_vector_simcse(sentence1).tolist() |
|
vec2 = get_vector_simcse(sentence2).tolist() |
|
similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0] |
|
return similarity_list |
|
|
|
sentence1 = '你好吗' |
|
sentence2 = '你还好吗' |
|
result = get_similarity_two(sentence1,sentence2) |
|
print(result) #0.7996 |
|
#(1.0, '你好吗') |
|
#(0.8247, '你好不好') |
|
#(0.8217, '你现在好吗') |
|
#(0.7976, '你还好吗') |
|
#(0.7918, '你好不好呢') |
|
#(0.712, '你过的好吗') |
|
#(0.6986, '你怎么样') |
|
#(0.6693, '你') |
|
#(0.5442, '你好个鬼') |
|
#(0.4516, '你吃饭了吗') |
|
#(0.4, '我好开心啊') |
|
#(0.29, '我不开心') |
|
#(0.2782, '我吃了一个苹果') |
|
``` |