Add evaluation results
Browse files
README.md
CHANGED
@@ -14,6 +14,11 @@ model-index:
|
|
14 |
|
15 |
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset.
|
16 |
|
|
|
|
|
|
|
|
|
|
|
17 |
## Intended uses & limitations
|
18 |
|
19 |
You can directly use this model as a language detector, i.e. for sequence classification tasks. Currently, it supports the following 20 languages:
|
@@ -22,13 +27,62 @@ You can directly use this model as a language detector, i.e. for sequence classi
|
|
22 |
|
23 |
## Training and evaluation data
|
24 |
|
25 |
-
|
26 |
-
|
27 |
-
-
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
## Training procedure
|
31 |
|
|
|
|
|
32 |
### Training hyperparameters
|
33 |
|
34 |
The following hyperparameters were used during training:
|
@@ -43,11 +97,17 @@ The following hyperparameters were used during training:
|
|
43 |
|
44 |
### Training results
|
45 |
|
|
|
|
|
46 |
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 |
|
47 |
|:-------------:|:-----:|:----:|:---------------:|:--------:|:------:|
|
48 |
| 0.2492 | 1.0 | 1094 | 0.0149 | 0.9969 | 0.9969 |
|
49 |
| 0.0101 | 2.0 | 2188 | 0.0103 | 0.9977 | 0.9977 |
|
50 |
|
|
|
|
|
|
|
|
|
51 |
|
52 |
### Framework versions
|
53 |
|
|
|
14 |
|
15 |
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset.
|
16 |
|
17 |
+
## Model description
|
18 |
+
|
19 |
+
This model is an XLM-RoBERTa transformer model with a classification head on top (i.e. a linear layer on top of the pooled output).
|
20 |
+
For additional information please refer to the [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) model card or to the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al.
|
21 |
+
|
22 |
## Intended uses & limitations
|
23 |
|
24 |
You can directly use this model as a language detector, i.e. for sequence classification tasks. Currently, it supports the following 20 languages:
|
|
|
27 |
|
28 |
## Training and evaluation data
|
29 |
|
30 |
+
The model was fine-tuned on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset, which consists of text sequences in 20 languages. The training set contains 70k samples, while the validation and test sets 10k each. The average accuracy on the test set is **99.6%** (this matches the average macro/weighted F1-score being the test set perfectly balanced). A more detailed evaluation is provided by the following table.
|
31 |
+
|
32 |
+
| Language | Precision | Recall | F1-score | support |
|
33 |
+
|:--------:|:---------:|:------:|:--------:|:-------:|
|
34 |
+
|ar |0.998 |0.996 |0.997 |500 |
|
35 |
+
|bg |0.998 |0.964 |0.981 |500 |
|
36 |
+
|de |0.998 |0.996 |0.997 |500 |
|
37 |
+
|el |0.996 |1.000 |0.998 |500 |
|
38 |
+
|en |1.000 |1.000 |1.000 |500 |
|
39 |
+
|es |0.967 |1.000 |0.983 |500 |
|
40 |
+
|fr |1.000 |1.000 |1.000 |500 |
|
41 |
+
|hi |0.994 |0.992 |0.993 |500 |
|
42 |
+
|it |1.000 |0.992 |0.996 |500 |
|
43 |
+
|ja |0.996 |0.996 |0.996 |500 |
|
44 |
+
|nl |1.000 |1.000 |1.000 |500 |
|
45 |
+
|pl |1.000 |1.000 |1.000 |500 |
|
46 |
+
|pt |0.988 |1.000 |0.994 |500 |
|
47 |
+
|ru |1.000 |0.994 |0.997 |500 |
|
48 |
+
|sw |1.000 |1.000 |1.000 |500 |
|
49 |
+
|th |1.000 |0.998 |0.999 |500 |
|
50 |
+
|tr |0.994 |0.992 |0.993 |500 |
|
51 |
+
|ur |1.000 |1.000 |1.000 |500 |
|
52 |
+
|vi |0.992 |1.000 |0.996 |500 |
|
53 |
+
|zh |1.000 |1.000 |1.000 |500 |
|
54 |
+
|
55 |
+
### Benchmarks
|
56 |
+
|
57 |
+
As a baseline to compare `xlm-roberta-base-language-detection` against, we have used the Python [langid](https://github.com/saffsd/langid.py) library. Since it comes pre-trained on 97 languages, we have used its `.set_languages()` method to constrain the language set to our 20 languages. The average accuracy of langid on the test set is **98.5%**. More details are provided by the table below.
|
58 |
+
|
59 |
+
| Language | Precision | Recall | F1-score | support |
|
60 |
+
|:--------:|:---------:|:------:|:--------:|:-------:|
|
61 |
+
|ar |0.990 |0.970 |0.980 |500 |
|
62 |
+
|bg |0.998 |0.964 |0.981 |500 |
|
63 |
+
|de |0.992 |0.944 |0.967 |500 |
|
64 |
+
|el |1.000 |0.998 |0.999 |500 |
|
65 |
+
|en |1.000 |1.000 |1.000 |500 |
|
66 |
+
|es |1.000 |0.968 |0.984 |500 |
|
67 |
+
|fr |0.996 |1.000 |0.998 |500 |
|
68 |
+
|hi |0.949 |0.976 |0.963 |500 |
|
69 |
+
|it |0.990 |0.980 |0.985 |500 |
|
70 |
+
|ja |0.927 |0.988 |0.956 |500 |
|
71 |
+
|nl |0.980 |1.000 |0.990 |500 |
|
72 |
+
|pl |0.986 |0.996 |0.991 |500 |
|
73 |
+
|pt |0.950 |0.996 |0.973 |500 |
|
74 |
+
|ru |0.996 |0.974 |0.985 |500 |
|
75 |
+
|sw |1.000 |1.000 |1.000 |500 |
|
76 |
+
|th |1.000 |0.996 |0.998 |500 |
|
77 |
+
|tr |0.990 |0.968 |0.979 |500 |
|
78 |
+
|ur |0.998 |0.996 |0.997 |500 |
|
79 |
+
|vi |0.971 |0.990 |0.980 |500 |
|
80 |
+
|zh |1.000 |1.000 |1.000 |500 |
|
81 |
|
82 |
## Training procedure
|
83 |
|
84 |
+
Fine-tuning was done via the `Trainer` API.
|
85 |
+
|
86 |
### Training hyperparameters
|
87 |
|
88 |
The following hyperparameters were used during training:
|
|
|
97 |
|
98 |
### Training results
|
99 |
|
100 |
+
The validation results on the `valid` split of the Language Identification dataset are summarised here below.
|
101 |
+
|
102 |
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 |
|
103 |
|:-------------:|:-----:|:----:|:---------------:|:--------:|:------:|
|
104 |
| 0.2492 | 1.0 | 1094 | 0.0149 | 0.9969 | 0.9969 |
|
105 |
| 0.0101 | 2.0 | 2188 | 0.0103 | 0.9977 | 0.9977 |
|
106 |
|
107 |
+
In short, it achieves the following results on the validation set:
|
108 |
+
- Loss: 0.0101
|
109 |
+
- Accuracy: 0.9977
|
110 |
+
- F1: 0.9977
|
111 |
|
112 |
### Framework versions
|
113 |
|