stefan-it commited on
Commit
aaa91c1
·
1 Parent(s): 0ce0188

readme: add initial version

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md CHANGED
@@ -1,3 +1,62 @@
1
  ---
2
  license: mit
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - gwlms/germeval2014
5
+ language:
6
+ - de
7
  ---
8
+
9
+ # SpanMarker for GermEval 2014 NER
10
+
11
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that
12
+ was fine-tuned on the [GermEval 2014 NER Dataset](https://sites.google.com/site/germeval2014ner/home).
13
+
14
+ The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following
15
+ properties: The data was sampled from German Wikipedia and News Corpora as a collection of citations. The dataset
16
+ covers over 31,000 sentences corresponding to over 590,000 tokens. The NER annotation uses the NoSta-D guidelines,
17
+ which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating
18
+ embeddings among NEs such as `[ORG FC Kickers [LOC Darmstadt]]`.
19
+
20
+ 12 classes of Named Entites are annotated and must be recognized: four main classes `PER`son, `LOC`ation, `ORG`anisation,
21
+ and `OTH`er and their subclasses by introducing two fine-grained labels: `-deriv` marks derivations from NEs such as
22
+ "englisch" (“English”), and `-part` marks compounds including a NE as a subsequence deutschlandweit (“Germany-wide”).
23
+
24
+ # Fine-Tuning
25
+
26
+ We use the same hyper-parameters as used in the
27
+ ["German's Next Language Model"](https://aclanthology.org/2020.coling-main.598/) paper using the released
28
+ [GELECTRA Large](https://huggingface.co/deepset/gelectra-large) model as backbone.
29
+
30
+ Evaluation is performed with SpanMarkers internal evaluation code that uses `seqeval`. Additionally we use
31
+ the official GermEval 2014 Evaluation Script for double-checking the results. A backup of the `nereval.py` script
32
+ can be found [here](https://github.com/bplank/DaNplus/blob/master/scripts/nereval.perl).
33
+
34
+ We fine-tune 5 models and upload the model with best F1-Score on development set:
35
+
36
+ | Model | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Avg.
37
+ | ---------------------- | ----- | ----- | ----- | ----- | ----- | -----
38
+ | GELECTRA Large (5e-05) | 89.99 | 89.55 | 89.60 | 89.34 | 89.68 | 89.63
39
+
40
+ The best model achieves a final test score of 89.08%:
41
+
42
+ ```bash
43
+ 1. Strict, Combined Evaluation (official):
44
+ Accuracy: 99.26%;
45
+ Precision: 89.01%;
46
+ Recall: 89.16%;
47
+ FB1: 89.08
48
+ ```
49
+
50
+ # Usage
51
+
52
+ The fine-tuned model can be used like:
53
+
54
+ ```python
55
+ from span_marker import SpanMarkerModel
56
+
57
+ # Download from the 🤗 Hub
58
+ model = SpanMarkerModel.from_pretrained("stefan-it/span-marker-gelectra-large-germeval14")
59
+
60
+ # Run inference
61
+ entities = model.predict("Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München.")
62
+ ```