pkshatech
/

GLuCoSE-base-ja-v2

@@ -35,51 +35,20 @@ This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps
 The model is based on [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) and additionally fine-tuned.
 Fine-tuning consists of the following steps.
-### Step 1: Ensemble distillation
-We conducted Contrastive Knowledge Distillation following [DistilCSE](https://arxiv.org/abs/2112.05638):
-- **Objective**: Distill knowledge from multiple teacher models to a student model
-- **Method**:
-  - Passed GLuCoSE's output through separate linear layers for each teacher model
-  - Minimized distance between processed student output and teacher embeddings
-  - Objective function: Sum of losses from all teacher models
-- **Models**:
-  - Teacher Models: [E5-mistral](https://huggingface.co/intfloat/e5-mistral-7b-instruct), [gte-Qwen2](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct), and [mE5-large](https://huggingface.co/intfloat/multilingual-e5-large)
-  - Student Model: [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja)
-- **Training Data**: Japanese Wikipedia (We used [jawiki](https://huggingface.co/datasets/hpprc/jawiki).)
-### Step 2: Contrastive learning
-We conducted contrastive learning in NLI, paraphrasing, and retrieval tasks:
-- **Objective**: Further improve the model's performance as a comprehensive sentence embedding model
-- **Method**: Contrastive learning loss with triplets, similar to supervised [SimCSE](https://arxiv.org/abs/2104.08821)
-- **Training Data**: Triplets created from the following datasets:
-  - [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)
-  - [MNLI](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7)
-  - [PAWS-X](https://huggingface.co/datasets/paws-x)
-  - [JSeM](https://github.com/DaisukeBekki/JSeM)
-  - [Mr.TyDi](https://huggingface.co/datasets/castorini/mr-tydi)
-### Step 3: Search-specific contrastive learning
-We performed additional training on retrieval tasks:
-- **Objective**: Make the model more powerful and robust for retrieval tasks
-- **Method**:
-  - Two-stage training with QA and question-answer data
-  - Utilized 7 hard negatives in the training process folowing [SFR-embedding blog](https://blog.salesforceairesearch.com/sfr-embedded-mistral/)
-- **Training Data**:
-  - First stage: [auto-wiki-qa](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa) (Synthetic dataset)
-  - Second stage:
-    - [Japanese Wikipedia Human Retrieval](https://huggingface.co/datasets/hpprc/emb)
-    - [Mr.TyDi](https://huggingface.co/datasets/hpprc/emb)
-    - [MIRACL](https://huggingface.co/datasets/hpprc/emb)
-    - [JQaRA](https://huggingface.co/datasets/hotchpotch/JQaRA)
-    - [MQA](https://huggingface.co/datasets/hpprc/mqa-ja)
-    - [Quiz Works](https://huggingface.co/datasets/hpprc/emb)
-    - [Quiz No Mori](https://huggingface.co/datasets/hpprc/emb)
 ### Model Description
 - **Model Type:** Sentence Transformer
@@ -217,7 +186,6 @@ Evaluated with [MIRACL-ja](https://huggingface.co/datasets/miracl/miracl), [JQAR
 |[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 74.3 | 58.1 | 84.6 | **35.3** |
 |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
 |**GLuCoSE v2**| 0.1B | **85.5** | **60.6** | **85.3** | 33.8 |
 Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [JaCWIR](https://huggingface.co/datasets/hotchpotch/JCWIR).
 | Model | Size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | JaCWIR<br>MAP@10 | MLDR<br>nDCG@10 |
@@ -251,7 +219,6 @@ Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
 |[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) |0.1B|71.91|69.82|82.87|75.58|92.91|**54.16**|62.38|
 |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|0.1B|70.44|59.02|78.71|**76.82**|91.90|49.78|**66.39**|
 |**GLuCoSE v2**|0.1B|**72.22**|**73.36**|**82.96**|74.21|93.01|48.65|62.37|
 Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the [JMTEB leaderboard](https://github.com/sbintuitions/JMTEB/blob/main/leaderboard.md). Results for ruri are quoted from the [cl-nagoya/ruri-base model card](https://huggingface.co/cl-nagoya/ruri-base/blob/main/README.md).
 ## Authors

 The model is based on [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) and additionally fine-tuned.
 Fine-tuning consists of the following steps.
+**Step 1: Ensemble distillation**
+- The embedded representation was distilled using E5-mistral, gte-Qwen2 and mE5-large as teacher models.
+**Step 2: Contrastive learning**
+-  Triples were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
+- This training aimed to improve the overall performance as a sentence embedding model.
+**Step 3: Search-specific contrastive learning**
+- In order to make the model more robust to the retrieval task, additional two-stage training with QA and question-answer data was conducted.
+- In the first stage, the synthetic dataset auto-wiki was used for training, while in the second stage, Japanese Wikipedia Human Retrieval, Mr.TyDi, MIRACL, JQaRA, MQA, Quiz Works and Quiz No Mori were used.
 ### Model Description
 - **Model Type:** Sentence Transformer
 |[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 74.3 | 58.1 | 84.6 | **35.3** |
 |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
 |**GLuCoSE v2**| 0.1B | **85.5** | **60.6** | **85.3** | 33.8 |
 Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [JaCWIR](https://huggingface.co/datasets/hotchpotch/JCWIR).
 | Model | Size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | JaCWIR<br>MAP@10 | MLDR<br>nDCG@10 |
 |[cl-nagoya/ruri-base](https://huggingface.co/cl-nagoya/ruri-base) |0.1B|71.91|69.82|82.87|75.58|92.91|**54.16**|62.38|
 |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|0.1B|70.44|59.02|78.71|**76.82**|91.90|49.78|**66.39**|
 |**GLuCoSE v2**|0.1B|**72.22**|**73.36**|**82.96**|74.21|93.01|48.65|62.37|
 Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the [JMTEB leaderboard](https://github.com/sbintuitions/JMTEB/blob/main/leaderboard.md). Results for ruri are quoted from the [cl-nagoya/ruri-base model card](https://huggingface.co/cl-nagoya/ruri-base/blob/main/README.md).
 ## Authors