language:
- de
- en
pipeline_tag: feature-extraction
tags:
- semantic textual similarity
- sts
- semantic search
- sentence similarity
- paraphrasing
- documents retrieval
- passage retrieval
- information retrieval
- sentence-transformer
- feature-extraction
- transformers
task_categories:
- sentence-similarity
- feature-extraction
- text-retrieval
- other
library_name: sentence-transformers
license: mit
Model card for PM-AI/paraphrase-distilroberta-base-v2_de-en
For internal purposes and for testing, we have made a monolingual paraphrasing model from Sentence Transformers usable for German + English via Knowledge Distillation. The decision was made in favor of sentence-transformers/paraphrase-distilroberta-base-v2 because this model has no public available multilingual version (to our knowledge). In addition, it has significantly more training samples compared to its predecessor: 83.3 million samples were used instead of 24.6 million samples.
Training
- Download of datasets
- Execution of knowledge distillation
Training Data
Datasets used based on offical source:
- AllNLI
- sentence-compression
- SimpleWiki
- altlex
- msmarco-triplets
- quora_duplicates
- coco_captions
- flickr30k_captions
- yahoo_answers_title_question
- S2ORC_citation_pairs
- stackexchange_duplicate_questions
- wiki-atomic-edits
Training Execution
First we downloaded some german-english parallel datasets via get_parallel_data_*.py.
These datasets are: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
Then we started knowledge distillation with make_multilingual_sys.py
Parameterization of training
- Script: make_multilingual_sys.py
- Datasets: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
- GPU: NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)
- Batch Size: 64
- Max Sequence Length: 256
- Train Max Sentence Length: 600
- Max Sentences Per Train File: 1000000
- Teacher Model: sentence-transformers/paraphrase-distilroberta-base-v2
- Student Model: xlm-roberta-base
- Loss Function: MSE Loss
- Learning Rate: 2e-5
- Epochs: 20
- Evaluation Steps: 10000
- Warmup Steps: 10000
Acknowledgment
This work is a collaboration between Technical University of Applied Sciences Wildau (TH Wildau) and sense.ai.tion GmbH. You can contact us via:
- Philipp Müller (M.Eng.); Author
- Prof. Dr. Janett Mohnke; TH Wildau
- Dr. Matthias Boldt, Jörg Oehmichen; sense.AI.tion GmbH
This work was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".