PM-AI commited on
Commit
1fd2e02
·
1 Parent(s): accf7b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -87
README.md CHANGED
@@ -1,94 +1,141 @@
1
  ---
 
 
 
 
 
2
  pipeline_tag: sentence-similarity
3
  tags:
4
- - sentence-transformers
 
 
 
 
 
5
  - feature-extraction
6
- - sentence-similarity
7
  - transformers
8
-
 
 
 
 
9
  ---
10
 
11
- # PM-AI/sts_paraphrase_xlm-roberta-base_de-en
12
-
13
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
-
15
- <!--- Describe your model here -->
16
-
17
- ## Usage (Sentence-Transformers)
18
-
19
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
-
21
- ```
22
- pip install -U sentence-transformers
23
- ```
24
-
25
- Then you can use the model like this:
26
-
27
- ```python
28
- from sentence_transformers import SentenceTransformer
29
- sentences = ["This is an example sentence", "Each sentence is converted"]
30
-
31
- model = SentenceTransformer('PM-AI/sts_paraphrase_xlm-roberta-base_de-en')
32
- embeddings = model.encode(sentences)
33
- print(embeddings)
34
- ```
35
-
36
-
37
-
38
- ## Usage (HuggingFace Transformers)
39
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
40
-
41
- ```python
42
- from transformers import AutoTokenizer, AutoModel
43
- import torch
44
-
45
-
46
- #Mean Pooling - Take attention mask into account for correct averaging
47
- def mean_pooling(model_output, attention_mask):
48
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
49
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
50
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
51
-
52
-
53
- # Sentences we want sentence embeddings for
54
- sentences = ['This is an example sentence', 'Each sentence is converted']
55
-
56
- # Load model from HuggingFace Hub
57
- tokenizer = AutoTokenizer.from_pretrained('PM-AI/sts_paraphrase_xlm-roberta-base_de-en')
58
- model = AutoModel.from_pretrained('PM-AI/sts_paraphrase_xlm-roberta-base_de-en')
59
-
60
- # Tokenize sentences
61
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
62
-
63
- # Compute token embeddings
64
- with torch.no_grad():
65
- model_output = model(**encoded_input)
66
-
67
- # Perform pooling. In this case, mean pooling.
68
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
69
-
70
- print("Sentence embeddings:")
71
- print(sentence_embeddings)
72
- ```
73
-
74
-
75
-
76
- ## Evaluation Results
77
-
78
- <!--- Describe how your model was evaluated -->
79
-
80
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=PM-AI/sts_paraphrase_xlm-roberta-base_de-en)
81
-
82
-
83
-
84
- ## Full Model Architecture
85
- ```
86
- SentenceTransformer(
87
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
88
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
89
- )
90
- ```
91
-
92
- ## Citing & Authors
93
-
94
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - de
4
+ - en
5
+ datasets:
6
+ - todo
7
  pipeline_tag: sentence-similarity
8
  tags:
9
+ - semantic textual similarity
10
+ - sts
11
+ - semantic search
12
+ - sentence similarity
13
+ - paraphrasing
14
+ - sentence-transformer
15
  - feature-extraction
 
16
  - transformers
17
+ task_categories:
18
+ - sentence-similarity
19
+ - feature-extraction
20
+ - text-retrieval
21
+ - other
22
  ---
23
 
24
+ # Model card for PM-AI/sts_paraphrase_xlm-roberta-base_de-en
25
+
26
+ ## Model summary
27
+ Transformer model for **Semantic Textual Similarity (STS)** for _German_ and _Englisch_ sentences/texts.
28
+ The embeddings output can be used for **semantic search**, **paraphrasing** and **retrieval** with _cosine similarity_.
29
+ The Model is applicable to _Englisch-German-Mixed_ sentences/texts but also for _Englisch only_ and _German only_.
30
+
31
+ The model can be easily used with [Sentence Transformer](https://github.com/UKPLab/sentence-transformers) library.
32
+
33
+ ## Training
34
+ This model is based on a training approach from 2020 by Philip May, who published the [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) model.
35
+ We updated this approach by a new base model and some extensions to the training data.
36
+ These changes are discussed in the next sections.
37
+
38
+ ### Training Data
39
+ The model is based on training with samples from STSb, SICK and Priya22 semantic textual relatedness datasets.
40
+ It contains about 76.000 sentence pairs.
41
+ These sentence pairs are based on German-German, English-English and German-English mixed.
42
+ The training object is to optimize for cosine similarity loss based on a human annoted sentence similarity score.
43
+ In terms of content, the samples are based on rather simple sentences.
44
+
45
+ When the TSystems model was published, only the STSb dataset was used for STS training.
46
+ It is also included in our model, but expanded to include SICK and Priya22 semantic textual relatedness:
47
+ - SICK was already used in parts in STSb, but our independent translation (XYZ) using DeepL leads to slightly different formulations. This approach allows more examples to be included in the training.
48
+ - The Priya22 semantic textual relatedness dataset published in 2022 was also translated into German via DeepL and added to the training data. Since it does not have a train test split, it was created independently at a ratio of 80:20.
49
+ The rating scale of all datasets has been adjusted to STSb with a value range from 0 to 5.
50
+ All training and test data (STSb, Sick, Priya22) were checked for duplicates within and with each other and removed if found.
51
+ Because the test data has a higher priority, duplicated entries between test-train are always removed from train split.
52
+ The final used datasets can be viewed here: XYZ.
53
+
54
+
55
+ ### Training
56
+ Befor fine-tuning for STS we made the English paraphrasing model [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) usable for German by applying Knowledge Distillation (Teacher-Student approach).
57
+ The TSystems model used version 1, which is based on 7 different datasets and contains around 24.6 million samples.
58
+ We are using version 2 with 12 datasets and about 83.3 million examples.
59
+ Details for this process here: XYZ
60
+
61
+ For fine-tuning we are using SBERT's [training_stsbenchmark_continue_training.py](https://github.com/UKPLab/sentence-transformers/blob/b86eec31cf0a102ad786ba1ff31bfeb4998d3ca5/examples/training/sts/training_stsbenchmark_continue_training.py) training script.
62
+ One thing has been changed in this training script: when a sentence pair consists of identical utterances the score is set to 5.0 (maximum).
63
+ It makes no sense to say identical sentences have a score of 4.8 or 4.9.
64
+
65
+ #### Parameterization of training
66
+ - **Script: [training_stsbenchmark_continue_training.py](https://github.com/UKPLab/sentence-transformers/blob/b86eec31cf0a102ad786ba1ff31bfeb4998d3ca5/examples/training/sts/training_stsbenchmark_continue_training.py)**
67
+ - **Datasets: todo**
68
+ - **GPU: NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)**
69
+ - **Batch Size: 32**
70
+ - **Base Model: todo**
71
+ - **Loss Function: Cosine Similarity**
72
+ - **Learning Rate: 2e-5**
73
+ - **Epochs: 3**
74
+ - **Evaluation Samples: 500**
75
+ - **Evaluation Steps: 1000**
76
+ - **Warmup Steps: 10%**
77
+ - **:**
78
+
79
+ ### Evaluation <a name="evaluation"></a>
80
+
81
+ todo todo todo
82
+
83
+ The evaluation is based on **[germanDPR](https://arxiv.org/abs/2104.12741)**.
84
+ The dataset developed by [Deepset.ai](deepset.ai) consists of question-answer pairs, which are supplemented by three "hard negatives" per question.
85
+ This makes it an ideal basis for benchmarking.
86
+ Publicly available is the dataset as **[deepset/germanDPR](https://huggingface.co/datasets/deepset/germandpr)**, which does not support BEIR by default.
87
+ Consequently, this dataset was also reworked manually.
88
+ In addition, duplicate text elements were removed and minimal text adjustments were made.
89
+ The details of this process can be found in **[PM-AI/germandpr-beir](https://huggingface.co/datasets/PM-AI/germandpr-beir)**.
90
+
91
+ The BEIR-compatible germanDPR dataset consists of **9275 questions** with **23993 text passages** for the **train split**.
92
+ In order to have enough text passages for information retrieval, we use the train split and not the test split.
93
+ The following table shows the evaluation results for different approaches and models:
94
+
95
+ **model**|**NDCG@1**|**NDCG@10**|**NDCG@100**|**comment**
96
+ :-----:|:-----:|:-----:|:-----:|:-----:
97
+ bi-encoder_msmarco_bert-base_german (new) | 0.5300 <br /> 🏆 | 0.7196 <br /> 🏆 | 0.7360 <br /> 🏆 | "OUR model"
98
+ [deepset/gbert-base-germandpr-X_encoder](https://huggingface.co/deepset/gbert-base-germandpr-ctx_encoder) | 0.4828 | 0.6970 | 0.7147 | "has two encoder models (one for queries and one for corpus), is SOTA approach"
99
+ [distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) | 0.4561 | 0.6347 | 0.6613 | "trained on 15 languages"
100
+ [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 0.4511 | 0.6328 | 0.6592 | "trained on huge corpus, support for 50+ languages"
101
+ [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 0.4350 | 0.6103 | 0.6411 | "trained on 50+ languages"
102
+ [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 0.4168 | 0.5931 | 0.6237 | "trained on large corpus, support for 50+ languages"
103
+ [svalabs/bi-electra-ms-marco-german-uncased](svalabs/bi-electra-ms-marco-german-uncased) | 0.3818 | 0.5663 | 0.5986 | "most similar to OUR model"
104
+ [BM25](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#bm25) | 0.3196 | 0.5377 | 0.5740 | "lexical approach"
105
+
106
+ **❗It is crucial to understand that the comparisons are also made with models based on other transformer approaches���**
107
+
108
+ A direct comparison based on the same approach can be made with [svalabs/bi-electra-ms-marco-german-uncased](svalabs/bi-electra-ms-marco-german-uncased).
109
+ In this case, the model presented here outperforms its predecessor by up to 14 percentage points.
110
+
111
+ Comparing with [deepset/gbert-base-germandpr-X_encoder](https://huggingface.co/deepset/gbert-base-germandpr-ctx_encoder) is theoretically a little unfair since deepset's approach is based on two models at the same time!
112
+ Queries and passages are encoded separately which leads to a better, more superior contextualization.
113
+ Still, our newly trained model is outperforming the other approach by around two percentage points.
114
+ In addition, using two models at the same time also increases demands on memory and CPU power which causes higher costs.
115
+ This makes the approach presented here even more valuable.
116
+
117
+ Note:
118
+ - Texts used for evaluation are sometimes very long. All models, except for BM25 approach, truncate the incoming texts some point. This can decrease performance.
119
+ - Evaluation of deepset's gbert-base-germandpr model might give an incorrect impression. The model was originally trained on the data we used for evaluation (not 1:1 but almost).
120
+
121
+ ### Acknowledgment
122
+
123
+ This work is a collaboration between [Technical University of Applied Sciences Wildau (TH Wildau)](https://en.th-wildau.de/) and [sense.ai.tion GmbH](https://senseaition.com/).
124
+ You can contact us via:
125
+ * [Philipp Müller (M.Eng.)](https://www.linkedin.com/in/herrphilipps); Author
126
+ * [Prof. Dr. Janett Mohnke](mailto:[email protected]); TH Wildau
127
+ * [Dr. Matthias Boldt, Jörg Oehmichen](mailto:[email protected]); sense.AI.tion GmbH
128
+
129
+ This work was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".
130
+
131
+ <div style="display:flex">
132
+ <div style="padding-left:20px;">
133
+ <a href="https://efre.brandenburg.de/efre/de/"><img src="https://huggingface.co/datasets/PM-AI/germandpr-beir/resolve/main/res/EFRE-Logo_rechts_oweb_en_rgb.jpeg" alt="Logo of European Regional Development Fund (EFRE)" width="200"/></a>
134
+ </div>
135
+ <div style="padding-left:20px;">
136
+ <a href="https://www.senseaition.com"><img src="https://senseaition.com/wp-content/uploads/thegem-logos/logo_c847aaa8f42141c4055d4a8665eb208d_3x.png" alt="Logo of senseaition GmbH" width="200"/></a>
137
+ </div>
138
+ <div style="padding-left:20px;">
139
+ <a href="https://www.th-wildau.de"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/TH_Wildau_Logo.png/640px-TH_Wildau_Logo.png" alt="Logo of TH Wildau" width="180"/></a>
140
+ </div>
141
+ </div>