PM-AI commited on
Commit
0a1fa71
·
1 Parent(s): 41d328c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -49
README.md CHANGED
@@ -32,28 +32,27 @@ The model can be easily used with [Sentence Transformer](https://github.com/UKPL
32
 
33
  ## Training
34
  This model is based on a training approach from 2020 by Philip May, who published the [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) model.
35
- We updated this approach by a new base model and some extensions to the training data.
36
  These changes are discussed in the next sections.
37
 
38
  ### Training Data
39
- The model is based on training with samples from STSb, SICK and Priya22 semantic textual relatedness datasets.
40
- It contains about 76.000 sentence pairs.
41
- These sentence pairs are based on German-German, English-English and German-English mixed.
42
- The training object is to optimize for cosine similarity loss based on a human annoted sentence similarity score.
43
  In terms of content, the samples are based on rather simple sentences.
44
 
45
  When the TSystems model was published, only the STSb dataset was used for STS training.
46
- It is also included in our model, but expanded to include SICK and Priya22 semantic textual relatedness:
47
- - SICK was already used in parts in STSb, but our independent translation (XYZ) using DeepL leads to slightly different formulations. This approach allows more examples to be included in the training.
48
- - The Priya22 semantic textual relatedness dataset published in 2022 was also translated into German via DeepL and added to the training data. Since it does not have a train test split, it was created independently at a ratio of 80:20.
49
  The rating scale of all datasets has been adjusted to STSb with a value range from 0 to 5.
50
  All training and test data (STSb, Sick, Priya22) were checked for duplicates within and with each other and removed if found.
51
- Because the test data has a higher priority, duplicated entries between test-train are always removed from train split.
52
  The final used datasets can be viewed here: XYZ.
53
 
54
-
55
  ### Training
56
- Befor fine-tuning for STS we made the English paraphrasing model [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) usable for German by applying Knowledge Distillation (Teacher-Student approach).
57
  The TSystems model used version 1, which is based on 7 different datasets and contains around 24.6 million samples.
58
  We are using version 2 with 12 datasets and about 83.3 million examples.
59
  Details for this process here: XYZ
@@ -77,47 +76,93 @@ It makes no sense to say identical sentences have a score of 4.8 or 4.9.
77
 
78
  ### Evaluation <a name="evaluation"></a>
79
 
80
- todo todo todo
81
- todo links für die datensets, knowledge distillation etc
82
-
83
-
84
- The evaluation is based on **[germanDPR](https://arxiv.org/abs/2104.12741)**.
85
- The dataset developed by [Deepset.ai](deepset.ai) consists of question-answer pairs, which are supplemented by three "hard negatives" per question.
86
- This makes it an ideal basis for benchmarking.
87
- Publicly available is the dataset as **[deepset/germanDPR](https://huggingface.co/datasets/deepset/germandpr)**, which does not support BEIR by default.
88
- Consequently, this dataset was also reworked manually.
89
- In addition, duplicate text elements were removed and minimal text adjustments were made.
90
- The details of this process can be found in **[PM-AI/germandpr-beir](https://huggingface.co/datasets/PM-AI/germandpr-beir)**.
91
 
92
- The BEIR-compatible germanDPR dataset consists of **9275 questions** with **23993 text passages** for the **train split**.
93
- In order to have enough text passages for information retrieval, we use the train split and not the test split.
94
- The following table shows the evaluation results for different approaches and models:
95
 
96
- **model**|**NDCG@1**|**NDCG@10**|**NDCG@100**|**comment**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  :-----:|:-----:|:-----:|:-----:|:-----:
98
- bi-encoder_msmarco_bert-base_german (new) | 0.5300 <br /> 🏆 | 0.7196 <br /> 🏆 | 0.7360 <br /> 🏆 | "OUR model"
99
- [deepset/gbert-base-germandpr-X_encoder](https://huggingface.co/deepset/gbert-base-germandpr-ctx_encoder) | 0.4828 | 0.6970 | 0.7147 | "has two encoder models (one for queries and one for corpus), is SOTA approach"
100
- [distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) | 0.4561 | 0.6347 | 0.6613 | "trained on 15 languages"
101
- [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 0.4511 | 0.6328 | 0.6592 | "trained on huge corpus, support for 50+ languages"
102
- [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 0.4350 | 0.6103 | 0.6411 | "trained on 50+ languages"
103
- [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 0.4168 | 0.5931 | 0.6237 | "trained on large corpus, support for 50+ languages"
104
- [svalabs/bi-electra-ms-marco-german-uncased](svalabs/bi-electra-ms-marco-german-uncased) | 0.3818 | 0.5663 | 0.5986 | "most similar to OUR model"
105
- [BM25](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#bm25) | 0.3196 | 0.5377 | 0.5740 | "lexical approach"
106
-
107
- **❗It is crucial to understand that the comparisons are also made with models based on other transformer approaches❗**
108
-
109
- A direct comparison based on the same approach can be made with [svalabs/bi-electra-ms-marco-german-uncased](svalabs/bi-electra-ms-marco-german-uncased).
110
- In this case, the model presented here outperforms its predecessor by up to 14 percentage points.
111
-
112
- Comparing with [deepset/gbert-base-germandpr-X_encoder](https://huggingface.co/deepset/gbert-base-germandpr-ctx_encoder) is theoretically a little unfair since deepset's approach is based on two models at the same time!
113
- Queries and passages are encoded separately which leads to a better, more superior contextualization.
114
- Still, our newly trained model is outperforming the other approach by around two percentage points.
115
- In addition, using two models at the same time also increases demands on memory and CPU power which causes higher costs.
116
- This makes the approach presented here even more valuable.
117
-
118
- Note:
119
- - Texts used for evaluation are sometimes very long. All models, except for BM25 approach, truncate the incoming texts some point. This can decrease performance.
120
- - Evaluation of deepset's gbert-base-germandpr model might give an incorrect impression. The model was originally trained on the data we used for evaluation (not 1:1 but almost).
 
 
 
 
121
 
122
  ### Acknowledgment
123
 
 
32
 
33
  ## Training
34
  This model is based on a training approach from 2020 by Philip May, who published the [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) model.
35
+ We updated this approach by a new base model for fine-tuning and some extensions to the training data.
36
  These changes are discussed in the next sections.
37
 
38
  ### Training Data
39
+ The model is based on training with samples from [STSb](https://huggingface.co/datasets/stsb_multi_mt), [SICK](https://huggingface.co/datasets/mteb/sickr-sts) and [Priya22 semantic textual relatedness](https://github.com/Priya22/semantic-textual-relatedness) datasets.
40
+ They contain about 76.000 sentence pairs in total.
41
+ These sentence pairs are based on _German-German_, _English-English_ and _German-English mixed_.
42
+ The training object is to optimize for _cosine similarity loss_ based on a human annoted sentence similarity score.
43
  In terms of content, the samples are based on rather simple sentences.
44
 
45
  When the TSystems model was published, only the STSb dataset was used for STS training.
46
+ Therefore it is included in our model, but expanded to include SICK and Priya22 semantic textual relatedness:
47
+ - SICK was partly used in STSb, but our independent translation (XYZ) using [DeepL](https://www.deepl.com/) leads to slightly different phrases. This approach allows more examples to be included in the training.
48
+ - The Priya22 semantic textual relatedness dataset published in 2022 was also translated into German via DeepL and added to the training data. Since it does not have a train-test-split, it was created independently at a ratio of 80:20.
49
  The rating scale of all datasets has been adjusted to STSb with a value range from 0 to 5.
50
  All training and test data (STSb, Sick, Priya22) were checked for duplicates within and with each other and removed if found.
51
+ Because the test data is prioritized, duplicated entries between test-train are exclusively removed from train split.
52
  The final used datasets can be viewed here: XYZ.
53
 
 
54
  ### Training
55
+ Befor fine-tuning for STS we made the English paraphrasing model [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) usable for German by applying **[Knowledge Distillation](https://arxiv.org/abs/2004.09813)** (_Teacher-Student_ approach).
56
  The TSystems model used version 1, which is based on 7 different datasets and contains around 24.6 million samples.
57
  We are using version 2 with 12 datasets and about 83.3 million examples.
58
  Details for this process here: XYZ
 
76
 
77
  ### Evaluation <a name="evaluation"></a>
78
 
79
+ Now the performance is measured cross-lingually as well as for German and English only.
80
+ In addition, the test samples used are evaluated individually for each data set (STSb, SICK, Priya22), as well as in a large combined test data set (all).
81
+ This subdivision per data set allows for a fair overall assessment, since external models are not built on the same data basis as the model presented here.
82
+ The data is not evenly distributed in either training or testing!
 
 
 
 
 
 
 
83
 
84
+ **❗Some models are only usable for one language (because they are monolingual). They will almost not perform at all in the other two tables.**
 
 
85
 
86
+ The first table shows the evaluation results for **cross-lingual (German-English-Mixed)** based on _Spearman_:
87
+ **model**|**STSb**|**SICK**|**Priya22**|**all**|
88
+ :-----:|:-----:|:-----:|:-----:|:-----:
89
+ [PM-AI/sts_paraphrase_xlm-roberta-base_de-en (ours)](https://huggingface.co/PM-AI/sts_paraphrase_xlm-roberta-base_de-en) | 0.8672 <br /> 🏆 | 0.8639 <br /> 🏆 | 0.8354 <br /> 🏆 | 0.8711 <br /> 🏆
90
+ [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) | 0.8525 | 0.7642 | 0.7998 | 0.8216
91
+ [todo (ours, no fine-tuning)]() | 0.8225 | 0.7579 | 0.8255 | 0.8109
92
+ [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 0.8310 | 0.7529 | 0.8184 | 0.8102
93
+ [sentence-transformers/stsb-xlm-r-multilingual](https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual) | 0.8194 | 0.7703 | 0.7566 | 0.7998
94
+ [sentence-transformers/paraphrase-xlm-r-multilingual-v1](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) | 0.7985 | 0.7217 | 0.7975 | 0.7838
95
+ [sentence-transformers/paraphrase-xlm-r-multilingual-v1](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) | 0.7985 | 0.7217 | 0.7975 | 0.7838
96
+ [sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1](https://huggingface.co/sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1) | 0.7985 | 0.7217 | 0.7975 | 0.7838
97
+ [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 0.7823 | 0.7090 | 0.7830 | 0.7834
98
+ [sentence-transformers/distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) | 0.7449 | 0.6941 | 0.7607 | 0.7534
99
+ [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 0.7517 | 0.6950 | 0.7619 | 0.7496
100
+ [sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking) | 0.7211 | 0.6650 | 0.7382 | 0.7200
101
+ [Sahajtomar/German-semantic](https://huggingface.co/Sahajtomar/German-semantic) | 0.7170 | 0.5871 | 0.7204 | 0.6802
102
+ [symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli](https://huggingface.co/symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli) | 0.6488 | 0.5489 | 0.6688 | 0.6303
103
+ [sentence-transformers/sentence-t5-large](https://huggingface.co/sentence-transformers/sentence-t5-large) | 0.6849 | 0.6063 | 0.7360 | 0.5843
104
+ [sentence-transformers/sentence-t5-base](https://huggingface.co/sentence-transformers/sentence-t5-base) | 0.6013 | 0.5213 | 0.6671 | 0.5068
105
+ [sentence-transformers/gtr-t5-large](https://huggingface.co/sentence-transformers/gtr-t5-large) | 0.5881 | 0.5168 | 0.6674 | 0.4984
106
+ [deepset/gbert-large-sts](https://huggingface.co/deepset/gbert-large-sts) | 0.3842 | 0.3537 | 0.4105 | 0.4362
107
+ [sentence-transformers/gtr-t5-base](https://huggingface.co/sentence-transformers/gtr-t5-base) | 0.5204 | 0.4346 | 0.6008 | 0.4276
108
+ [textattack/bert-base-uncased-STS-B](https://huggingface.co/textattack/bert-base-uncased-STS-B) | 0.0669 | 0.1135 | 0.0105 | 0.1514
109
+ [symanto/xlm-roberta-base-snli-mnli-anli-xnli](https://huggingface.co/symanto/xlm-roberta-base-snli-mnli-anli-xnli) | 0.1694 | 0.0440 | 0.0521 | 0.1156
110
+
111
+ The second table shows the evaluation results for **German only** based on _Spearman_:
112
+ **model**|**STSb**|**SICK**|**Priya22**|**all**|
113
+ :-----:|:-----:|:-----:|:-----:|:-----:
114
+ [PM-AI/sts_paraphrase_xlm-roberta-base_de-en (ours)](https://huggingface.co/PM-AI/sts_paraphrase_xlm-roberta-base_de-en) | 0.8658 <br /> 🏆 | 0.8775 <br /> 🏆 | 0.8432 <br /> 🏆 | 0.8747 <br /> 🏆
115
+ [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) | 0.8547 | 0.8047 | 0.8068 | 0.8327
116
+ [Sahajtomar/German-semantic](https://huggingface.co/Sahajtomar/German-semantic) | 0.8485 | 0.7915 | 0.8139 | 0.8280
117
+ [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 0.8360 | 0.7941 | 0.8237 | 0.8178
118
+ [todo (ours, no fine-tuning)]() | 0.8297 | 0.7930 | 0.8341 | 0.8170
119
+ [sentence-transformers/stsb-xlm-r-multilingual](https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual) | 0.8190 | 0.8027 | 0.7674 | 0.8072
120
+ [sentence-transformers/paraphrase-xlm-r-multilingual-v1](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) | 0.8079 | 0.7844 | 0.8126 | 0.8034
121
+ [sentence-transformers/paraphrase-xlm-r-multilingual-v1](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) | 0.8079 | 0.7844 | 0.8126 | 0.8034
122
+ [sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1](https://huggingface.co/sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1) | 0.8079 | 0.7844 | 0.8126 | 0.8034
123
+ [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 0.7891 | 0.7830 | 0.8010 | 0.7981
124
+ [sentence-transformers/distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) | 0.7705 | 0.7612 | 0.7899 | 0.7780
125
+ [sentence-transformers/sentence-t5-large](https://huggingface.co/sentence-transformers/sentence-t5-large) | 0.7771 | 0.7724 | 0.7829 | 0.7727
126
+ [sentence-transformers/sentence-t5-base](https://huggingface.co/sentence-transformers/sentence-t5-base) | 0.7361 | 0.7613 | 0.7643 | 0.7602
127
+ [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 0.7467 | 0.7494 | 0.7684 | 0.7584
128
+ [sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking) | 0.7419 | 0.7420 | 0.7692 | 0.7566
129
+ [sentence-transformers/gtr-t5-large](https://huggingface.co/sentence-transformers/gtr-t5-large) | 0.7252 | 0.7201 | 0.7613 | 0.7447
130
+ [sentence-transformers/gtr-t5-base](https://huggingface.co/sentence-transformers/gtr-t5-base) | 0.7058 | 0.6943 | 0.7462 | 0.7271
131
+ [symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli](https://huggingface.co/symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli) | 0.7284 | 0.7136 | 0.7109 | 0.6997
132
+ [deepset/gbert-large-sts](https://huggingface.co/deepset/gbert-large-sts) | 0.6576 | 0.7141 | 0.6769 | 0.6959
133
+ [textattack/bert-base-uncased-STS-B](https://huggingface.co/textattack/bert-base-uncased-STS-B) | 0.4427 | 0.6023 | 0.4380 | 0.5380
134
+ [symanto/xlm-roberta-base-snli-mnli-anli-xnli](https://huggingface.co/symanto/xlm-roberta-base-snli-mnli-anli-xnli) | 0.4154 | 0.5048 | 0.3478 | 0.4540
135
+
136
+ And last but not least our third table which shows the evaluation results for **English only** based on _Spearman_:
137
+ **model**|**STSb**|**SICK**|**Priya22**|**all**|
138
  :-----:|:-----:|:-----:|:-----:|:-----:
139
+ [PM-AI/sts_paraphrase_xlm-roberta-base_de-en (ours)](https://huggingface.co/PM-AI/sts_paraphrase_xlm-roberta-base_de-en) | 0.8768 <br /> 🏆 | 0.8705 <br /> 🏆 | 0.8402 | 0.8748 <br /> 🏆
140
+ [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 0.8682 | 0.8065 | 0.8430 | 0.8378
141
+ [todo (ours, no fine-tuning)]() | 0.8597 | 0.8105 | 0.8399 | 0.8363
142
+ [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) | 0.8660 | 0.7897 | 0.8097 | 0.8308
143
+ [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 0.8441 | 0.8059 | 0.8175 | 0.8300
144
+ [sentence-transformers/sentence-t5-base](https://huggingface.co/sentence-transformers/sentence-t5-base) | 0.8551 | 0.8063 | 0.8434 | 0.8235
145
+ [sentence-transformers/sentence-t5-large](https://huggingface.co/sentence-transformers/sentence-t5-large) | 0.8536 | 0.8097 | 0.8475 <br /> 🏆 | 0.8191
146
+ [sentence-transformers/stsb-xlm-r-multilingual](https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual) | 0.8503 | 0.8009 | 0.7675 | 0.8162
147
+ [sentence-transformers/paraphrase-xlm-r-multilingual-v1](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) | 0.8350 | 0.7645 | 0.8211 | 0.8050
148
+ [sentence-transformers/paraphrase-xlm-r-multilingual-v1](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) | 0.8350 | 0.7645 | 0.8211 | 0.8050
149
+ [sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1](https://huggingface.co/sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1) | 0.8350 | 0.7645 | 0.8211 | 0.8050
150
+ [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 0.8075 | 0.7534 | 0.7908 | 0.7828
151
+ [sentence-transformers/distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) | 0.8061 | 0.7421 | 0.7923 | 0.7784
152
+ [Sahajtomar/German-semantic](https://huggingface.co/Sahajtomar/German-semantic) | 0.8061 | 0.7098 | 0.7709 | 0.7712
153
+ [sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking) | 0.7866 | 0.7477 | 0.7700 | 0.7691
154
+ [sentence-transformers/gtr-t5-large](https://huggingface.co/sentence-transformers/gtr-t5-large) | 0.7763 | 0.7258 | 0.8124 | 0.7675
155
+ [sentence-transformers/gtr-t5-base](https://huggingface.co/sentence-transformers/gtr-t5-base) | 0.7961 | 0.7129 | 0.8147 | 0.7669
156
+ [symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli](https://huggingface.co/symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli) | 0.7799 | 0.7415 | 0.7335 | 0.7376
157
+ [deepset/gbert-large-sts](https://huggingface.co/deepset/gbert-large-sts) | 0.5703 | 0.6011 | 0.5673 | 0.6060
158
+ [textattack/bert-base-uncased-STS-B](https://huggingface.co/textattack/bert-base-uncased-STS-B) | 0.4978 | 0.6099 | 0.5505 | 0.5754
159
+ [symanto/xlm-roberta-base-snli-mnli-anli-xnli](https://huggingface.co/symanto/xlm-roberta-base-snli-mnli-anli-xnli) | 0.3830 | 0.5180 | 0.3056 | 0.4414
160
+
161
+ **❗It is crucial to understand that:**
162
+ - Only our model has seen training data from STSb, SICK and Priya22 combined, which is one reason for better results. The model has simply been trained to be more sensitive to these type of samples.
163
+ - The datasets are not proportionally aligned in terms of their number of examples. For example, Priya22 is significantly underrepresented.
164
+ - The compared models are of different sizes, which affects resource consumption (CPU, RAM) and inference speed (benchmark). So-called "large" models usually perform better, but also cost more (resources, monetary value) than e.g. "base" models.
165
+ - Multilingual models are usually made multilingual by Knowledge Distillation, starting from a monolingual state. Therefore, they usually perform somewhat better in the original language.
166
 
167
  ### Acknowledgment
168