File size: 9,956 Bytes
8791d08 64f5c2d 8791d08 64f5c2d a2f8027 8791d08 b59d5ad 8791d08 b59d5ad 8791d08 a6cbdf4 b59d5ad 8791d08 b59d5ad 69c2c00 b59d5ad 10756d5 8791d08 b59d5ad 8791d08 04ba113 8791d08 f8cd687 74eed44 8791d08 04ba113 8791d08 04ba113 8791d08 04ba113 8791d08 f8cd687 04ba113 74eed44 b59d5ad 8791d08 74eed44 b59d5ad 74eed44 8791d08 b59d5ad 8791d08 b59d5ad 8791d08 b59d5ad 8791d08 b59d5ad 8791d08 b59d5ad 8791d08 64f5c2d b59d5ad 96d4895 64f5c2d f75b170 b59d5ad 64f5c2d a2f8027 b59d5ad a2f8027 9e3c53e b59d5ad 49cb3ce f75b170 8791d08 a2f8027 b59d5ad ab2bd8c 8791d08 a2f8027 8791d08 b59d5ad |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
---
language:
- ja
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
metrics:
- pearson_cosine
- spearman_cosine
- pearson_manhattan
- spearman_manhattan
- pearson_euclidean
- spearman_euclidean
- pearson_dot
- spearman_dot
- pearson_max
- spearman_max
widget: []
pipeline_tag: sentence-similarity
datasets:
- hpprc/emb
- hpprc/mqa-ja
- google-research-datasets/paws-x
base_model: pkshatech/GLuCoSE-base-ja
license: apache-2.0
---
# GLuCoSE v2
This model is a general Japanese text embedding model, excelling in retrieval tasks. It can run on CPU and is designed to measure semantic similarity between sentences, as well as to function as a retrieval system for searching passages based on queries.
Key features:
- Specialized for retrieval tasks, it demonstrates the highest performance among similar size models in MIRACL and other tasks .
- Optimized for Japanese text processing
- Can run on CPU
During inference, the prefix "query: " or "passage: " is required. Please check the Usage section for details.
## Model Description
The model is based on [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) and fine-tuned through distillation using several large-scale embedding models and multi-stage contrastive learning.
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
## Usage
### Direct Usage (Sentence Transformers)
You can perform inference using SentenceTransformer with the following code:
```python
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F
# Download from the 🤗 Hub
model = SentenceTransformer("pkshatech/GLuCoSE-base-ja-v2")
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]
# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]
```
### Direct Usage (Transformers)
You can perform inference using Transformers with the following code:
```python
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
emb = last_hidden_states * attention_mask.unsqueeze(-1)
emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
return emb
# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
model = AutoModel.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]
# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]
```
## Training Details
The fine-tuning of GLuCoSE v2 is carried out through the following steps:
**Step 1: Ensemble distillation**
- The embedded representation was distilled using [E5-mistral](https://huggingface.co/intfloat/e5-mistral-7b-instruct), [gte-Qwen2](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct), and [mE5-large](https://huggingface.co/intfloat/multilingual-e5-large) as teacher models.
**Step 2: Contrastive learning**
- Triplets were created from [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88), [MNLI](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7), [PAWS-X](https://huggingface.co/datasets/paws-x), [JSeM](https://github.com/DaisukeBekki/JSeM) and [Mr.TyDi](https://huggingface.co/datasets/castorini/mr-tydi) and used for training.
- This training aimed to improve the overall performance as a sentence embedding model.
**Step 3: Search-specific contrastive learning**
- In order to make the model more robust to the retrieval task, additional two-stage training with QA and retrieval task was conducted.
- In the first stage, the synthetic dataset [auto-wiki-qa](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa) was used for training,
while in the second stage, [JQaRA](https://huggingface.co/datasets/hotchpotch/JQaRA), [MQA](https://huggingface.co/datasets/hpprc/mqa-ja), [Japanese Wikipedia Human Retrieval, Mr.TyDi,MIRACL, Quiz Works and Quiz No Mor](https://huggingface.co/datasets/hpprc/emb)i were used.
<!--
### Downstream Usage (Sentence Transformers)
You can finetune this model on your own dataset.
<details><summary>Click to expand</summary>
</details>
-->
<!--
### Out-of-Scope Use
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->
<!--
## Bias, Risks and Limitations
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->
<!--
### Recommendations
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->
## Benchmarks
### Retrieval
Evaluated with [MIRACL-ja](https://huggingface.co/datasets/miracl/miracl), [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) , [JaCWIR](https://huggingface.co/datasets/hotchpotch/JaCWIR) and [MLDR-ja](https://huggingface.co/datasets/Shitao/MLDR).
| Model | Size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | JaCWIR<br>MAP@10 | MLDR<br>nDCG@10 |
| :---: | :---: | :---: | :---: | :---: | :---: |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.6B | 89.2 | 55.4 | **87.6** | 29.8 |
| [cl-nagoya/ruri-large](https://huggingface.co/cl-nagoya/ruri-large) | 0.3B | 78.7 | 62.4 | 85.0 | **37.5** |
| | | | | | |
| [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 0.3B | 84.2 | 47.2 | **85.3** | 25.4 |
| [cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 74.3 | 58.1 | 84.6 | **35.3** |
| [pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
| **GLuCoSE v2** | 0.1B | **85.5** | **60.6** | **85.3** | 33.8 |
Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [JaCWIR](https://huggingface.co/datasets/hotchpotch/JaCWIR).
### JMTEB
Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
The average score is macro-average.
| Model | Size | Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| OpenAI/text-embedding-3-small | - | 69.18 | 66.39 | 79.46 | 73.06 | 92.92 | 51.06 | 62.27 |
| OpenAI/text-embedding-3-large | - | 74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
| | | | | | | | | |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.6B | 70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
| [cl-nagoya/ruri-large](https://huggingface.co/cl-nagoya/ruri-large) | 0.3B | 73.31 | 73.02 | 83.13 | 77.43 | 92.99 | 51.82 | 62.29 |
| | | | | | | | | |
| [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 0.3B | 68.61 | 68.21 | 79.84 | 69.30 | **92.85** | 48.26 | 62.26 |
| [cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 71.91 | 69.82 | 82.87 | 75.58 | 92.91 | **54.16** | 62.38 |
| [pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 67.29 | 59.02 | 78.71 | **76.82** | 91.90 | 49.78 | **66.39** |
| **GLuCoSE v2** | 0.1B | **72.23** | **73.36** | **82.96** | 74.21 | 93.01 | 48.65 | 62.37 |
Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the [JMTEB leaderboard](https://github.com/sbintuitions/JMTEB/blob/main/leaderboard.md). Results for ruri are quoted from the [cl-nagoya/ruri-base model card](https://huggingface.co/cl-nagoya/ruri-base/blob/main/README.md).
## Authors
Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe
## License
This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). |