julien-c HF staff commited on
Commit
89ac337
·
1 Parent(s): ebb0a6c

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/oliverguhr/german-sentiment-bert/README.md

Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # German Sentiment Classification with Bert
2
+
3
+ This model was trained for sentiment classification of German language texts. To achieve the best results all model inputs needs to be preprocessed with the same procedure, that was applied during the training. To simplify the usage of the model,
4
+ we provide a Python package that bundles the code need for the preprocessing and inferencing.
5
+
6
+ The model uses the Googles Bert architecture and was trained on 1.834 million German-language samples. The training data contains texts from various domains like Twitter, Facebook and movie, app and hotel reviews.
7
+ You can find more information about the dataset and the training process in the [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf).
8
+
9
+ ## Using the Python package
10
+
11
+ To get started install the package from [pypi](https://pypi.org/project/germansentiment/):
12
+
13
+ ```bash
14
+ pip install germansentiment
15
+ ```
16
+
17
+ ```python
18
+ from germansentiment import SentimentModel
19
+
20
+ model = SentimentModel()
21
+
22
+ texts = [
23
+ "Mit keinem guten Ergebniss","Das ist gar nicht mal so gut",
24
+ "Total awesome!","nicht so schlecht wie erwartet",
25
+ "Der Test verlief positiv.","Sie fährt ein grünes Auto."]
26
+
27
+ result = model.predict_sentiment(texts)
28
+ print(result)
29
+ ```
30
+
31
+ The code above will output following list:
32
+
33
+ ```python
34
+ ["negative","negative","positive","positive","neutral", "neutral"]
35
+ ```
36
+
37
+ ## A minimal working Sample
38
+
39
+
40
+ ```python
41
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
42
+ from typing import List
43
+ import torch
44
+ import re
45
+
46
+ class SentimentModel():
47
+ def __init__(self, model_name: str):
48
+ self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
49
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name)
50
+
51
+ self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
52
+ self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
53
+ self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)
54
+
55
+ def predict_sentiment(self, texts: List[str])-> List[str]:
56
+ texts = [self.clean_text(text) for text in texts]
57
+ # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
58
+ input_ids = self.tokenizer(texts, padding=True, truncation=True, add_special_tokens=True)
59
+ input_ids = torch.tensor(input_ids["input_ids"])
60
+
61
+ with torch.no_grad():
62
+ logits = self.model(input_ids)
63
+
64
+ label_ids = torch.argmax(logits[0], axis=1)
65
+
66
+ labels = [self.model.config.id2label[label_id] for label_id in label_ids.tolist()]
67
+ return labels
68
+
69
+ def replace_numbers(self,text: str) -> str:
70
+ return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fünf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")
71
+
72
+ def clean_text(self,text: str)-> str:
73
+ text = text.replace("\n", " ")
74
+ text = self.clean_http_urls.sub('',text)
75
+ text = self.clean_at_mentions.sub('',text)
76
+ text = self.replace_numbers(text)
77
+ text = self.clean_chars.sub('', text) # use only text chars
78
+ text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace
79
+ text = text.strip().lower()
80
+ return text
81
+
82
+ texts = ["Mit keinem guten Ergebniss","Das war unfair", "Das ist gar nicht mal so gut",
83
+ "Total awesome!","nicht so schlecht wie erwartet", "Das ist gar nicht mal so schlecht",
84
+ "Der Test verlief positiv.","Sie fährt ein grünes Auto.", "Der Fall wurde an die Polzei übergeben."]
85
+
86
+ model = SentimentModel(model_name = "oliverguhr/german-sentiment-bert")
87
+
88
+ print(model.predict_sentiment(texts))
89
+ ```
90
+
91
+ ## Model and Data
92
+
93
+ If you are interested in code and data that was used to train this model please have a look at [this repository](https://github.com/oliverguhr/german-sentiment) and our [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf). Here is a table of the F1 scores that his model achieves on following datasets. Since we trained this model on a newer version of the transformer library, the results are slightly better than reported in the paper.
94
+
95
+ | Dataset | F1 micro Score |
96
+ | :----------------------------------------------------------- | -------------: |
97
+ | [holidaycheck](https://github.com/oliverguhr/german-sentiment) | 0.9568 |
98
+ | [scare](https://www.romanklinger.de/scare/) | 0.9418 |
99
+ | [filmstarts](https://github.com/oliverguhr/german-sentiment) | 0.9021 |
100
+ | [germeval](https://sites.google.com/view/germeval2017-absa/home) | 0.7536 |
101
+ | [PotTS](https://www.aclweb.org/anthology/L16-1181/) | 0.6780 |
102
+ | [emotions](https://github.com/oliverguhr/german-sentiment) | 0.9649 |
103
+ | [sb10k](https://www.spinningbytes.com/resources/germansentiment/) | 0.7376 |
104
+ | [Leipzig Wikipedia Corpus 2016](https://wortschatz.uni-leipzig.de/de/download/german) | 0.9967 |
105
+ | all | 0.9639 |
106
+
107
+ ## Cite
108
+
109
+ For feedback and questions contact me view mail or Twitter [@oliverguhr](https://twitter.com/oliverguhr). Please cite us if you found this useful:
110
+
111
+ ```
112
+ @InProceedings{guhr-EtAl:2020:LREC,
113
+ author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim},
114
+ title = {Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems},
115
+ booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
116
+ month = {May},
117
+ year = {2020},
118
+ address = {Marseille, France},
119
+ publisher = {European Language Resources Association},
120
+ pages = {1620--1625},
121
+ url = {https://www.aclweb.org/anthology/2020.lrec-1.201}
122
+ }
123
+ ```
124
+
125
+