Update README.md
Browse files
README.md
CHANGED
@@ -35,51 +35,20 @@ This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps
|
|
35 |
The model is based on [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) and additionally fine-tuned.
|
36 |
Fine-tuning consists of the following steps.
|
37 |
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
- **
|
48 |
-
|
49 |
-
|
50 |
-
-
|
51 |
-
|
52 |
-
### Step 2: Contrastive learning
|
53 |
-
|
54 |
-
We conducted contrastive learning in NLI, paraphrasing, and retrieval tasks:
|
55 |
-
|
56 |
-
- **Objective**: Further improve the model's performance as a comprehensive sentence embedding model
|
57 |
-
- **Method**: Contrastive learning loss with triplets, similar to supervised [SimCSE](https://arxiv.org/abs/2104.08821)
|
58 |
-
- **Training Data**: Triplets created from the following datasets:
|
59 |
-
- [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)
|
60 |
-
- [MNLI](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7)
|
61 |
-
- [PAWS-X](https://huggingface.co/datasets/paws-x)
|
62 |
-
- [JSeM](https://github.com/DaisukeBekki/JSeM)
|
63 |
-
- [Mr.TyDi](https://huggingface.co/datasets/castorini/mr-tydi)
|
64 |
-
|
65 |
-
### Step 3: Search-specific contrastive learning
|
66 |
-
|
67 |
-
We performed additional training on retrieval tasks:
|
68 |
-
|
69 |
-
- **Objective**: Make the model more powerful and robust for retrieval tasks
|
70 |
-
- **Method**:
|
71 |
-
- Two-stage training with QA and question-answer data
|
72 |
-
- Utilized 7 hard negatives in the training process folowing [SFR-embedding blog](https://blog.salesforceairesearch.com/sfr-embedded-mistral/)
|
73 |
-
- **Training Data**:
|
74 |
-
- First stage: [auto-wiki-qa](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa) (Synthetic dataset)
|
75 |
-
- Second stage:
|
76 |
-
- [Japanese Wikipedia Human Retrieval](https://huggingface.co/datasets/hpprc/emb)
|
77 |
-
- [Mr.TyDi](https://huggingface.co/datasets/hpprc/emb)
|
78 |
-
- [MIRACL](https://huggingface.co/datasets/hpprc/emb)
|
79 |
-
- [JQaRA](https://huggingface.co/datasets/hotchpotch/JQaRA)
|
80 |
-
- [MQA](https://huggingface.co/datasets/hpprc/mqa-ja)
|
81 |
-
- [Quiz Works](https://huggingface.co/datasets/hpprc/emb)
|
82 |
-
- [Quiz No Mori](https://huggingface.co/datasets/hpprc/emb)
|
83 |
|
84 |
### Model Description
|
85 |
- **Model Type:** Sentence Transformer
|
@@ -217,7 +186,6 @@ Evaluated with [MIRACL-ja](https://huggingface.co/datasets/miracl/miracl), [JQAR
|
|
217 |
|[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 74.3 | 58.1 | 84.6 | **35.3** |
|
218 |
|[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
|
219 |
|**GLuCoSE v2**| 0.1B | **85.5** | **60.6** | **85.3** | 33.8 |
|
220 |
-
|
221 |
Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [JaCWIR](https://huggingface.co/datasets/hotchpotch/JCWIR).
|
222 |
|
223 |
| Model | Size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | JaCWIR<br>MAP@10 | MLDR<br>nDCG@10 |
|
@@ -251,7 +219,6 @@ Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
|
|
251 |
|[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) |0.1B|71.91|69.82|82.87|75.58|92.91|**54.16**|62.38|
|
252 |
|[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|0.1B|70.44|59.02|78.71|**76.82**|91.90|49.78|**66.39**|
|
253 |
|**GLuCoSE v2**|0.1B|**72.22**|**73.36**|**82.96**|74.21|93.01|48.65|62.37|
|
254 |
-
|
255 |
Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the [JMTEB leaderboard](https://github.com/sbintuitions/JMTEB/blob/main/leaderboard.md). Results for ruri are quoted from the [cl-nagoya/ruri-base model card](https://huggingface.co/cl-nagoya/ruri-base/blob/main/README.md).
|
256 |
|
257 |
## Authors
|
|
|
35 |
The model is based on [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) and additionally fine-tuned.
|
36 |
Fine-tuning consists of the following steps.
|
37 |
|
38 |
+
**Step 1: Ensemble distillation**
|
39 |
+
|
40 |
+
- The embedded representation was distilled using E5-mistral, gte-Qwen2 and mE5-large as teacher models.
|
41 |
+
|
42 |
+
**Step 2: Contrastive learning**
|
43 |
+
|
44 |
+
- Triples were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
|
45 |
+
- This training aimed to improve the overall performance as a sentence embedding model.
|
46 |
+
|
47 |
+
**Step 3: Search-specific contrastive learning**
|
48 |
+
|
49 |
+
- In order to make the model more robust to the retrieval task, additional two-stage training with QA and question-answer data was conducted.
|
50 |
+
- In the first stage, the synthetic dataset auto-wiki was used for training, while in the second stage, Japanese Wikipedia Human Retrieval, Mr.TyDi, MIRACL, JQaRA, MQA, Quiz Works and Quiz No Mori were used.
|
51 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
|
53 |
### Model Description
|
54 |
- **Model Type:** Sentence Transformer
|
|
|
186 |
|[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 74.3 | 58.1 | 84.6 | **35.3** |
|
187 |
|[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
|
188 |
|**GLuCoSE v2**| 0.1B | **85.5** | **60.6** | **85.3** | 33.8 |
|
|
|
189 |
Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [JaCWIR](https://huggingface.co/datasets/hotchpotch/JCWIR).
|
190 |
|
191 |
| Model | Size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | JaCWIR<br>MAP@10 | MLDR<br>nDCG@10 |
|
|
|
219 |
|[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) |0.1B|71.91|69.82|82.87|75.58|92.91|**54.16**|62.38|
|
220 |
|[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|0.1B|70.44|59.02|78.71|**76.82**|91.90|49.78|**66.39**|
|
221 |
|**GLuCoSE v2**|0.1B|**72.22**|**73.36**|**82.96**|74.21|93.01|48.65|62.37|
|
|
|
222 |
Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the [JMTEB leaderboard](https://github.com/sbintuitions/JMTEB/blob/main/leaderboard.md). Results for ruri are quoted from the [cl-nagoya/ruri-base model card](https://huggingface.co/cl-nagoya/ruri-base/blob/main/README.md).
|
223 |
|
224 |
## Authors
|