PM-AI's picture
Update README.md
86a6a3d
|
raw
history blame
4.51 kB
metadata
language:
  - de
  - en
pipeline_tag: feature-extraction
tags:
  - semantic textual similarity
  - sts
  - semantic search
  - sentence similarity
  - paraphrasing
  - documents retrieval
  - passage retrieval
  - information retrieval
  - sentence-transformer
  - feature-extraction
  - transformers
task_categories:
  - sentence-similarity
  - feature-extraction
  - text-retrieval
  - other
library_name: sentence-transformers
license: mit

Model card for PM-AI/paraphrase-distilroberta-base-v2_de-en

For internal purposes and for testing, we have made a monolingual paraphrasing model from Sentence Transformers usable for German + English via Knowledge Distillation. The decision was made in favor of sentence-transformers/paraphrase-distilroberta-base-v2 because this model has no public available multilingual version (to our knowledge). In addition, it has significantly more training samples compared to its predecessor: 83.3 million samples were used instead of 24.6 million samples.

Training

  1. Download of datasets
  2. Execution of knowledge distillation

Training Data

Datasets used based on offical source:

  • AllNLI
  • sentence-compression
  • SimpleWiki
  • altlex
  • msmarco-triplets
  • quora_duplicates
  • coco_captions
  • flickr30k_captions
  • yahoo_answers_title_question
  • S2ORC_citation_pairs
  • stackexchange_duplicate_questions
  • wiki-atomic-edits

Training Execution

First we downloaded some german-english parallel datasets via get_parallel_data_*.py.

These datasets are: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary

Then we started knowledge distillation with make_multilingual_sys.py

Parameterization of training

  • Script: make_multilingual_sys.py
  • Datasets: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
  • GPU: NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)
  • Batch Size: 64
  • Max Sequence Length: 256
  • Train Max Sentence Length: 600
  • Max Sentences Per Train File: 1000000
  • Teacher Model: sentence-transformers/paraphrase-distilroberta-base-v2
  • Student Model: xlm-roberta-base
  • Loss Function: MSE Loss
  • Learning Rate: 2e-5
  • Epochs: 20
  • Evaluation Steps: 10000
  • Warmup Steps: 10000

Acknowledgment

This work is a collaboration between Technical University of Applied Sciences Wildau (TH Wildau) and sense.ai.tion GmbH. You can contact us via:

This work was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".

Logo of European Regional Development Fund (EFRE)
Logo of senseaition GmbH
Logo of TH Wildau