File size: 9,956 Bytes
8791d08
64f5c2d
 
8791d08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64f5c2d
 
 
 
 
a2f8027
8791d08
b59d5ad
8791d08
b59d5ad
8791d08
a6cbdf4
 
 
 
 
b59d5ad
8791d08
b59d5ad
69c2c00
b59d5ad
10756d5
8791d08
 
 
 
 
 
 
 
b59d5ad
8791d08
 
 
04ba113
8791d08
 
 
f8cd687
74eed44
 
8791d08
04ba113
 
 
 
8791d08
04ba113
8791d08
04ba113
8791d08
 
f8cd687
04ba113
74eed44
 
 
 
b59d5ad
8791d08
 
 
 
74eed44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b59d5ad
74eed44
8791d08
b59d5ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8791d08
 
b59d5ad
8791d08
 
 
 
 
 
 
 
 
 
b59d5ad
8791d08
 
 
 
 
 
b59d5ad
8791d08
 
 
 
 
 
b59d5ad
8791d08
 
 
 
 
64f5c2d
 
b59d5ad
 
96d4895
64f5c2d
f75b170
b59d5ad
 
 
 
 
 
 
 
 
 
64f5c2d
a2f8027
b59d5ad
a2f8027
9e3c53e
b59d5ad
 
 
 
 
 
 
 
 
 
 
 
49cb3ce
f75b170
8791d08
a2f8027
b59d5ad
ab2bd8c
8791d08
a2f8027
8791d08
b59d5ad
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
language:
- ja
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
metrics:
- pearson_cosine
- spearman_cosine
- pearson_manhattan
- spearman_manhattan
- pearson_euclidean
- spearman_euclidean
- pearson_dot
- spearman_dot
- pearson_max
- spearman_max
widget: []
pipeline_tag: sentence-similarity
datasets:
- hpprc/emb
- hpprc/mqa-ja
- google-research-datasets/paws-x
base_model: pkshatech/GLuCoSE-base-ja
license: apache-2.0
---
# GLuCoSE v2

This model is a general Japanese text embedding model, excelling in retrieval tasks. It can run on CPU and is designed to measure semantic similarity between sentences, as well as to function as a retrieval system for searching passages based on queries.

Key features:
- Specialized for retrieval tasks, it demonstrates the highest performance among similar size  models in MIRACL and other tasks .
- Optimized for Japanese text processing
- Can run on CPU

During inference, the prefix "query: " or "passage: " is required. Please check the Usage section for details.

## Model Description

The model is based on [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) and fine-tuned through distillation using several large-scale embedding models and multi-stage contrastive learning.

- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity

## Usage

### Direct Usage (Sentence Transformers)

You can perform inference using SentenceTransformer with the following code:

```python
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

# Download from the 🤗 Hub
model = SentenceTransformer("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

```

### Direct Usage (Transformers)

You can perform inference using Transformers with the following code:

```python
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
    emb = last_hidden_states * attention_mask.unsqueeze(-1)
    emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
    return emb

# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
model = AutoModel.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]

# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

```

## Training Details

The fine-tuning of GLuCoSE v2 is carried out through the following steps:

**Step 1: Ensemble distillation**

- The embedded representation was distilled using [E5-mistral](https://huggingface.co/intfloat/e5-mistral-7b-instruct), [gte-Qwen2](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct), and [mE5-large](https://huggingface.co/intfloat/multilingual-e5-large) as teacher models.

**Step 2: Contrastive learning**

- Triplets were created from [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88), [MNLI](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7), [PAWS-X](https://huggingface.co/datasets/paws-x), [JSeM](https://github.com/DaisukeBekki/JSeM) and [Mr.TyDi](https://huggingface.co/datasets/castorini/mr-tydi) and used for training.
- This training aimed to improve the overall performance as a sentence embedding model.

**Step 3: Search-specific contrastive learning**

- In order to make the model more robust to the retrieval task, additional two-stage training with QA and retrieval task was conducted.
- In the first stage, the synthetic dataset [auto-wiki-qa](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa) was used for training,
while in the second stage, [JQaRA](https://huggingface.co/datasets/hotchpotch/JQaRA), [MQA](https://huggingface.co/datasets/hpprc/mqa-ja), [Japanese Wikipedia Human Retrieval, Mr.TyDi,MIRACL, Quiz Works and Quiz No Mor](https://huggingface.co/datasets/hpprc/emb)i were used.

<!--

### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--

### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

<!--

## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--

### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Benchmarks

### Retrieval

Evaluated with [MIRACL-ja](https://huggingface.co/datasets/miracl/miracl), [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) , [JaCWIR](https://huggingface.co/datasets/hotchpotch/JaCWIR) and [MLDR-ja](https://huggingface.co/datasets/Shitao/MLDR).

| Model | Size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | JaCWIR<br>MAP@10 | MLDR<br>nDCG@10 |
| :---: | :---: | :---: | :---: | :---: | :---: |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.6B | 89.2 | 55.4 | **87.6** | 29.8 |
| [cl-nagoya/ruri-large](https://huggingface.co/cl-nagoya/ruri-large) | 0.3B | 78.7 | 62.4 | 85.0 | **37.5** |
|  |  |  |  |  |  |
| [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 0.3B | 84.2 | 47.2 | **85.3** | 25.4 |
| [cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 74.3 | 58.1 | 84.6 | **35.3** |
| [pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
| **GLuCoSE v2** | 0.1B | **85.5** | **60.6** | **85.3** | 33.8 |

Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [JaCWIR](https://huggingface.co/datasets/hotchpotch/JaCWIR).

### JMTEB

Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
The average score is macro-average.
| Model | Size | Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| OpenAI/text-embedding-3-small | - | 69.18 | 66.39 | 79.46 | 73.06 | 92.92 | 51.06 | 62.27 |
| OpenAI/text-embedding-3-large | - | 74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
|  |  |  |  |  |  |  |  |  |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.6B | 70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
| [cl-nagoya/ruri-large](https://huggingface.co/cl-nagoya/ruri-large) | 0.3B | 73.31 | 73.02 | 83.13 | 77.43 | 92.99 | 51.82 | 62.29 |
|  |  |  |  |  |  |  |  |  |
| [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 0.3B | 68.61 | 68.21 | 79.84 | 69.30 | **92.85** | 48.26 | 62.26 |
| [cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 71.91 | 69.82 | 82.87 | 75.58 | 92.91 | **54.16** | 62.38 |
| [pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 67.29 | 59.02 | 78.71 | **76.82** | 91.90 | 49.78 | **66.39** |
| **GLuCoSE v2** | 0.1B | **72.23** | **73.36** | **82.96** | 74.21 | 93.01 | 48.65 | 62.37 |

Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the [JMTEB leaderboard](https://github.com/sbintuitions/JMTEB/blob/main/leaderboard.md). Results for ruri are quoted from the [cl-nagoya/ruri-base model card](https://huggingface.co/cl-nagoya/ruri-base/blob/main/README.md).

## Authors

Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe

## License

This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).