yano0 commited on
Commit
10756d5
·
verified ·
1 Parent(s): 97dd64e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -47
README.md CHANGED
@@ -35,51 +35,20 @@ This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps
35
  The model is based on [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) and additionally fine-tuned.
36
  Fine-tuning consists of the following steps.
37
 
38
- ### Step 1: Ensemble distillation
39
-
40
- We conducted Contrastive Knowledge Distillation following [DistilCSE](https://arxiv.org/abs/2112.05638):
41
-
42
- - **Objective**: Distill knowledge from multiple teacher models to a student model
43
- - **Method**:
44
- - Passed GLuCoSE's output through separate linear layers for each teacher model
45
- - Minimized distance between processed student output and teacher embeddings
46
- - Objective function: Sum of losses from all teacher models
47
- - **Models**:
48
- - Teacher Models: [E5-mistral](https://huggingface.co/intfloat/e5-mistral-7b-instruct), [gte-Qwen2](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct), and [mE5-large](https://huggingface.co/intfloat/multilingual-e5-large)
49
- - Student Model: [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja)
50
- - **Training Data**: Japanese Wikipedia (We used [jawiki](https://huggingface.co/datasets/hpprc/jawiki).)
51
-
52
- ### Step 2: Contrastive learning
53
-
54
- We conducted contrastive learning in NLI, paraphrasing, and retrieval tasks:
55
-
56
- - **Objective**: Further improve the model's performance as a comprehensive sentence embedding model
57
- - **Method**: Contrastive learning loss with triplets, similar to supervised [SimCSE](https://arxiv.org/abs/2104.08821)
58
- - **Training Data**: Triplets created from the following datasets:
59
- - [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)
60
- - [MNLI](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7)
61
- - [PAWS-X](https://huggingface.co/datasets/paws-x)
62
- - [JSeM](https://github.com/DaisukeBekki/JSeM)
63
- - [Mr.TyDi](https://huggingface.co/datasets/castorini/mr-tydi)
64
-
65
- ### Step 3: Search-specific contrastive learning
66
-
67
- We performed additional training on retrieval tasks:
68
-
69
- - **Objective**: Make the model more powerful and robust for retrieval tasks
70
- - **Method**:
71
- - Two-stage training with QA and question-answer data
72
- - Utilized 7 hard negatives in the training process folowing [SFR-embedding blog](https://blog.salesforceairesearch.com/sfr-embedded-mistral/)
73
- - **Training Data**:
74
- - First stage: [auto-wiki-qa](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa) (Synthetic dataset)
75
- - Second stage:
76
- - [Japanese Wikipedia Human Retrieval](https://huggingface.co/datasets/hpprc/emb)
77
- - [Mr.TyDi](https://huggingface.co/datasets/hpprc/emb)
78
- - [MIRACL](https://huggingface.co/datasets/hpprc/emb)
79
- - [JQaRA](https://huggingface.co/datasets/hotchpotch/JQaRA)
80
- - [MQA](https://huggingface.co/datasets/hpprc/mqa-ja)
81
- - [Quiz Works](https://huggingface.co/datasets/hpprc/emb)
82
- - [Quiz No Mori](https://huggingface.co/datasets/hpprc/emb)
83
 
84
  ### Model Description
85
  - **Model Type:** Sentence Transformer
@@ -217,7 +186,6 @@ Evaluated with [MIRACL-ja](https://huggingface.co/datasets/miracl/miracl), [JQAR
217
  |[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 74.3 | 58.1 | 84.6 | **35.3** |
218
  |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
219
  |**GLuCoSE v2**| 0.1B | **85.5** | **60.6** | **85.3** | 33.8 |
220
-
221
  Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [JaCWIR](https://huggingface.co/datasets/hotchpotch/JCWIR).
222
 
223
  | Model | Size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | JaCWIR<br>MAP@10 | MLDR<br>nDCG@10 |
@@ -251,7 +219,6 @@ Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
251
  |[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) |0.1B|71.91|69.82|82.87|75.58|92.91|**54.16**|62.38|
252
  |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|0.1B|70.44|59.02|78.71|**76.82**|91.90|49.78|**66.39**|
253
  |**GLuCoSE v2**|0.1B|**72.22**|**73.36**|**82.96**|74.21|93.01|48.65|62.37|
254
-
255
  Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the [JMTEB leaderboard](https://github.com/sbintuitions/JMTEB/blob/main/leaderboard.md). Results for ruri are quoted from the [cl-nagoya/ruri-base model card](https://huggingface.co/cl-nagoya/ruri-base/blob/main/README.md).
256
 
257
  ## Authors
 
35
  The model is based on [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) and additionally fine-tuned.
36
  Fine-tuning consists of the following steps.
37
 
38
+ **Step 1: Ensemble distillation**
39
+
40
+ - The embedded representation was distilled using E5-mistral, gte-Qwen2 and mE5-large as teacher models.
41
+
42
+ **Step 2: Contrastive learning**
43
+
44
+ - Triples were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
45
+ - This training aimed to improve the overall performance as a sentence embedding model.
46
+
47
+ **Step 3: Search-specific contrastive learning**
48
+
49
+ - In order to make the model more robust to the retrieval task, additional two-stage training with QA and question-answer data was conducted.
50
+ - In the first stage, the synthetic dataset auto-wiki was used for training, while in the second stage, Japanese Wikipedia Human Retrieval, Mr.TyDi, MIRACL, JQaRA, MQA, Quiz Works and Quiz No Mori were used.
51
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ### Model Description
54
  - **Model Type:** Sentence Transformer
 
186
  |[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 74.3 | 58.1 | 84.6 | **35.3** |
187
  |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
188
  |**GLuCoSE v2**| 0.1B | **85.5** | **60.6** | **85.3** | 33.8 |
 
189
  Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [JaCWIR](https://huggingface.co/datasets/hotchpotch/JCWIR).
190
 
191
  | Model | Size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | JaCWIR<br>MAP@10 | MLDR<br>nDCG@10 |
 
219
  |[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) |0.1B|71.91|69.82|82.87|75.58|92.91|**54.16**|62.38|
220
  |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|0.1B|70.44|59.02|78.71|**76.82**|91.90|49.78|**66.39**|
221
  |**GLuCoSE v2**|0.1B|**72.22**|**73.36**|**82.96**|74.21|93.01|48.65|62.37|
 
222
  Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the [JMTEB leaderboard](https://github.com/sbintuitions/JMTEB/blob/main/leaderboard.md). Results for ruri are quoted from the [cl-nagoya/ruri-base model card](https://huggingface.co/cl-nagoya/ruri-base/blob/main/README.md).
223
 
224
  ## Authors