Upload README.md
Browse files
README.md
CHANGED
@@ -1065,11 +1065,41 @@ license: mit
|
|
1065 |
|
1066 |
General Text Embeddings (GTE) model. [Towards General Text Embeddings with Multi-stage Contrastive Learning](https://arxiv.org/abs/2308.03281)
|
1067 |
|
1068 |
-
The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1069 |
|
1070 |
## Metrics
|
1071 |
|
1072 |
-
We compared the performance of the GTE models with other popular text embedding models on the MTEB benchmark. For more detailed comparison results, please refer to the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1073 |
|
1074 |
## Usage
|
1075 |
|
@@ -1081,10 +1111,10 @@ from torch import Tensor
|
|
1081 |
from transformers import AutoTokenizer, AutoModel
|
1082 |
|
1083 |
input_texts = [
|
1084 |
-
"
|
1085 |
-
"
|
1086 |
-
"
|
1087 |
-
"
|
1088 |
]
|
1089 |
|
1090 |
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
|
@@ -1103,20 +1133,21 @@ print(scores.tolist())
|
|
1103 |
```
|
1104 |
|
1105 |
Use with sentence-transformers:
|
|
|
1106 |
```python
|
1107 |
from sentence_transformers import SentenceTransformer
|
1108 |
from sentence_transformers.util import cos_sim
|
1109 |
|
1110 |
sentences = ['That is a happy person', 'That is a very happy person']
|
1111 |
|
1112 |
-
model = SentenceTransformer('thenlper/gte-large')
|
1113 |
embeddings = model.encode(sentences)
|
1114 |
print(cos_sim(embeddings[0], embeddings[1]))
|
1115 |
```
|
1116 |
|
1117 |
### Limitation
|
1118 |
|
1119 |
-
This model exclusively caters to
|
1120 |
|
1121 |
### Citation
|
1122 |
|
|
|
1065 |
|
1066 |
General Text Embeddings (GTE) model. [Towards General Text Embeddings with Multi-stage Contrastive Learning](https://arxiv.org/abs/2308.03281)
|
1067 |
|
1068 |
+
The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer different sizes of models for both Chinese and English Languages. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including **information retrieval**, **semantic textual similarity**, **text reranking**, etc.
|
1069 |
+
|
1070 |
+
## Model List
|
1071 |
+
|
1072 |
+
| Models | Language | Max Sequence Length | Dimension | Model Size |
|
1073 |
+
|:-----: | :-----: |:-----: |:-----: |:-----: |
|
1074 |
+
|[GTE-large-zh](https://huggingface.co/thenlper/gte-large-zh) | Chinese | 512 | 1024 | 0.67GB |
|
1075 |
+
|[GTE-base-zh](https://huggingface.co/thenlper/gte-base-zh) | Chinese | 512 | 512 | 0.21GB |
|
1076 |
+
|[GTE-small-zh](https://huggingface.co/thenlper/gte-small-zh) | Chinese | 512 | 512 | 0.10GB |
|
1077 |
+
|[GTE-large](https://huggingface.co/thenlper/gte-large) | English | 512 | 1024 | 0.67GB |
|
1078 |
+
|[GTE-base](https://huggingface.co/thenlper/gte-base) | English | 512 | 512 | 0.21GB |
|
1079 |
+
|[GTE-small](https://huggingface.co/thenlper/gte-small) | English | 512 | 384 | 0.10GB |
|
1080 |
|
1081 |
## Metrics
|
1082 |
|
1083 |
+
We compared the performance of the GTE models with other popular text embedding models on the MTEB (CMTEB for Chinese language) benchmark. For more detailed comparison results, please refer to the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
|
1084 |
+
|
1085 |
+
- Evaluation results on CMTEB
|
1086 |
+
|
1087 |
+
| Model | Model Size (GB) | Embedding Dimensions | Sequence Length | Average (35 datasets) | Classification (9 datasets) | Clustering (4 datasets) | Pair Classification (2 datasets) | Reranking (4 datasets) | Retrieval (8 datasets) | STS (8 datasets) |
|
1088 |
+
| ------------------- | -------------- | -------------------- | ---------------- | --------------------- | ------------------------------------ | ------------------------------ | --------------------------------------- | ------------------------------ | ---------------------------- | ------------------------ |
|
1089 |
+
| **gte-large-zh** | 0.65 | 1024 | 512 | **66.72** | 71.34 | 53.07 | 81.14 | 67.42 | 72.49 | 57.82 |
|
1090 |
+
| gte-base-zh | 0.20 | 768 | 512 | 65.92 | 71.26 | 53.86 | 80.44 | 67.00 | 71.71 | 55.96 |
|
1091 |
+
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | 65.13 | 69.05 | 49.16 | 82.68 | 66.41 | 70.14 | 58.66 |
|
1092 |
+
| stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
|
1093 |
+
| bge-large-zh-v1.5 | 1.3 | 1024 | 512 | 64.53 | 69.13 | 48.99 | 81.6 | 65.84 | 70.46 | 56.25 |
|
1094 |
+
| stella-base-zh-v2 | 0.21 | 768 | 1024 | 64.36 | 68.29 | 49.4 | 79.96 | 66.1 | 70.08 | 56.92 |
|
1095 |
+
| stella-base-zh | 0.21 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
|
1096 |
+
| piccolo-large-zh | 0.65 | 1024 | 512 | 64.11 | 67.03 | 47.04 | 78.38 | 65.98 | 70.93 | 58.02 |
|
1097 |
+
| piccolo-base-zh | 0.2 | 768 | 512 | 63.66 | 66.98 | 47.12 | 76.61 | 66.68 | 71.2 | 55.9 |
|
1098 |
+
| gte-small-zh | 0.1 | 512 | 512 | 60.04 | 64.35 | 48.95 | 69.99 | 66.21 | 65.50 | 49.72 |
|
1099 |
+
| bge-small-zh-v1.5 | 0.1 | 512 | 512 | 57.82 | 63.96 | 44.18 | 70.4 | 60.92 | 61.77 | 49.1 |
|
1100 |
+
| m3e-base | 0.41 | 768 | 512 | 57.79 | 67.52 | 47.68 | 63.99 | 59.54| 56.91 | 50.47 |
|
1101 |
+
|text-embedding-ada-002(openai) | - | 1536| 8192 | 53.02 | 64.31 | 45.68 | 69.56 | 54.28 | 52.0 | 43.35 |
|
1102 |
+
|
1103 |
|
1104 |
## Usage
|
1105 |
|
|
|
1111 |
from transformers import AutoTokenizer, AutoModel
|
1112 |
|
1113 |
input_texts = [
|
1114 |
+
"中国的首都是哪里",
|
1115 |
+
"你喜欢去哪里旅游",
|
1116 |
+
"北京",
|
1117 |
+
"今天中午吃什么"
|
1118 |
]
|
1119 |
|
1120 |
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
|
|
|
1133 |
```
|
1134 |
|
1135 |
Use with sentence-transformers:
|
1136 |
+
|
1137 |
```python
|
1138 |
from sentence_transformers import SentenceTransformer
|
1139 |
from sentence_transformers.util import cos_sim
|
1140 |
|
1141 |
sentences = ['That is a happy person', 'That is a very happy person']
|
1142 |
|
1143 |
+
model = SentenceTransformer('thenlper/gte-large-zh')
|
1144 |
embeddings = model.encode(sentences)
|
1145 |
print(cos_sim(embeddings[0], embeddings[1]))
|
1146 |
```
|
1147 |
|
1148 |
### Limitation
|
1149 |
|
1150 |
+
This model exclusively caters to Chinese texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
|
1151 |
|
1152 |
### Citation
|
1153 |
|