yeshpanovrustem commited on
Commit
05f13d5
·
1 Parent(s): 4c952a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -57
README.md CHANGED
@@ -27,61 +27,39 @@ datasets:
27
  # A Named Entity Recognition Model for Kazakh
28
  - The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
29
  - The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.
30
- ## KazNERD (cleaned)
31
- While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed.
32
- As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed.
33
- ### Statistics for training (Train), validation (Valid), and test (Test) sets
34
- | Unit | Train | Valid | Test | Total |
 
 
35
  | :---: | :---: | :---: | :---: | :---: |
36
- | Sentence | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
37
- | Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) |
38
- | NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) |
39
- ### 80 / 10 / 10 split
40
- |Representation| Train | Valid | Test | Total |
41
- | :---: | :---: | :---: | :---: | :---: |
42
- | **AID** | 67,582 (79.99%) | 8,439 (9.99%) | 8,467 (10.02%)| 84,488 (100%) |
43
- | **BID** | 19,006 (80.11%) | 2,380 (10.03%) | 2,338 (9.85%)| 23,724 (100%) |
44
- | **CID** | 1,050 (78.89%) | 138 (10.37%) | 143 ( 10.74%) | 1,331 (100%) |
45
- | **DID** | 633 (79.22%) | 82 (10.26%) | 84 (10.51%) | 799 (100%) |
46
- | **EID** | 260 (81.00%) | 27 (8.41%) | 34 (10.59%)| 321 (100%) |
47
- | **FID** | 9 (75.00%) | 1 (8.33%)| 2 (16.67%)| 12 (100%) |
48
- |**Total**| **88,540 (80.00%)** | **11,067 (10.00%)** | **11,068 (10.00%)** | **110,675 (100%)** |
49
- ### Distribution of representations across sets
50
- |Representation| Train | Valid | Test | Total |
51
- | :---: | :---: | :---: | :---: | :---: |
52
- | **AID** | 67,582 (76.33%) | 8,439 (76.25%) | 8,467 (76.50%)| 84,488 (76.34%) |
53
- | **BID** | 19,006 (21.47%) | 2,380 (21.51%) | 2,338 (21.12%)| 23,724 (21.44%) |
54
- | **CID** | 1,050 (1.19%) | 138 (1.25%) | 143 ( 1.29%) | 1,331 (1.20%) |
55
- | **DID** | 633 (0.71%) | 82 (0.74%) | 84 (0.76%) | 799 (0.72%) |
56
- | **EID** | 260 (0.29%) | 27 (0.24%) | 34 (0.31%)| 321 (0.29%) |
57
- | **FID** | 9 (0.01%) | 1 (0.01%)| 2 (0.02%)| 12 (0.01%) |
58
- |**Total**| **88,540 (100.00%)** | **11,067 (10.00%)** | **11,068 (10.00%)** | **110,675 (100%)** |
59
- ### Distribution of NEs across sets
60
- | **NE Class** | **Train** | **Valid** | **Test** | **Total** |
61
- |:---:| :---: | :---: | :---: | :---: |
62
- | **ADAGE** | 153 (0.14%) | 19 (0.14%) | 17 (0.13%) | 189 (0.14%) |
63
- | **ART** | 1,533 (1.44%) | 155 (1.18%) | 161 (1.23%) | 1,849 (1.40%) |
64
- | **CARDINAL** | 23,135 (21.8%) | 2,878 (21.82%) | 2,789 (21.34%) | 28,802 (21.75%) |
65
- | **CONTACT** | 159 (0.15%) | 18 (0.14%) | 20 (0.15%) | 197 (0.15%) |
66
- | **DATE** | 20,006 (18.85%) | 2,603 (19.74%) | 2,584 (19.77%) | 25,193 (19.03%) |
67
- | **DISEASE** | 1,022 (0.96%) | 121 (0.92%) | 119 (0.91%) | 1,262 (0.95%) |
68
- | **EVENT** | 1,331 (1.25%) | 154 (1.17%) | 154 (1.18%) | 1,639 (1.24%) |
69
- | **FACILITY** | 1,723 (1.62%) | 178 (1.35%) | 197 (1.51%) | 2,098 (1.58%) |
70
- | **GPE** | 13,625 (12.84%) | 1,656 (12.56%) | 1,691 (12.94%) | 16,972 (12.82%) |
71
- | **LANGUAGE** | 350 (0.33%) | 47 (0.36%) | 41 (0.31%) | 438 (0.33%) |
72
- | **LAW** | 419 (0.39%) | 56 (0.42%) | 55 (0.42%) | 530 (0.40%) |
73
- | **LOCATION** | 1,736 (1.64%) | 210 (1.59%) | 208 (1.59%) | 2,154 (1.63%) |
74
- | **MISCELLANEOUS** | 191 (0.18%) | 26 (0.2%) | 26 (0.2%) | 243 (0.18%) |
75
- | **MONEY** | 3,652 (3.44%) | 455 (3.45%) | 427 (3.27%) | 4,534 (3.42%) |
76
- | **NON_HUMAN** | 6 (0.01%) | 1 (0.01%) | 1 (0.01%) | 8 (0.01%) |
77
- | **NORP** | 2,929 (2.76%) | 374 (2.84%) | 368 (2.82%) | 3,671 (2.77%) |
78
- | **ORDINAL** | 3,054 (2.88%) | 385 (2.92%) | 382 (2.92%) | 3,821 (2.89%) |
79
- | **ORGANISATION** | 5,956 (5.61%) | 753 (5.71%) | 718 (5.49%) | 7,427 (5.61%) |
80
- | **PERCENTAGE** | 3,357 (3.16%) | 437 (3.31%) | 462 (3.53%) | 4,256 (3.21%) |
81
- | **PERSON** | 9,817 (9.25%) | 1,175 (8.91%) | 1,151 (8.81%) | 12,143 (9.17%) |
82
- | **POSITION** | 4,844 (4.56%) | 587 (4.45%) | 597 (4.57%) | 6,028 (4.55%) |
83
- | **PRODUCT** | 586 (0.55%) | 73 (0.55%) | 75 (0.57%) | 734 (0.55%) |
84
- | **PROJECT** | 1,681 (1.58%) | 209 (1.58%) | 206 (1.58%) | 2,096 (1.58%) |
85
- | **QUANTITY** | 3,063 (2.89%) | 411 (3.12%) | 403 (3.08%) | 3,877 (2.93%) |
86
- | **TIME** | 1,820 (1.71%) | 208 (1.58%) | 220 (1.68%) | 2,248 (1.70%) |
87
- | **Total** | **106,148 (100%)** | **13,189 (100%)** | **13,072 (100%)** | **132,409 (100%)** |
 
27
  # A Named Entity Recognition Model for Kazakh
28
  - The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
29
  - The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.
30
+ ## Evaluation results on the validation and test sets
31
+ | | Validation set | | | Test set| |
32
+ |:---:| :---: | :---: | :---: | :---: | :---: |
33
+ | **Precision** | **Recall** | **F<sub>1</sub>-score** | **Precision** | **Recall** | **F<sub>1</sub>-score** |
34
+ | 96.58% | 96.66% | 96.62% | 96.49% | 96.86% | 96.67% |
35
+ ## Model performance for the NE classes of the validation set
36
+ | NE Class | Precision | Recall | F<sub>1</sub>-score | Support |
37
  | :---: | :---: | :---: | :---: | :---: |
38
+ | **ADAGE** | 90.00% | 47.37% | 62.07% | 19 |
39
+ | **ART** | 91.36% | 95.48% | 93.38% | 155 |
40
+ | **CARDINAL** | 98.44% | 98.37% | 98.40% | 2,878 |
41
+ | **CONTACT** | 100.00% | 83.33% | 90.91% | 18 |
42
+ | **DATE** | 97.38% | 97.27% | 97.33% | 2,603 |
43
+ | **DISEASE** | 96.72% | 97.52% | 97.12% | 121 |
44
+ | **EVENT** | 83.24% | 93.51% | 88.07% | 154 |
45
+ | **FACILITY** | 68.95% | 84.83% | 76.07% | 178 |
46
+ | **GPE** | 98.46% | 96.50% | 97.47% | 1,656 |
47
+ | **LANGUAGE** | 95.45% | 89.36% | 92.31% | 47 |
48
+ | **LAW** | 87.50% | 87.50% | 87.50% | 56 |
49
+ | **LOCATION** | 92.49% | 93.81% | 93.14% | 210 |
50
+ | **MISCELLANEOUS** | 100.00% | 76.92% | 86.96% | 26 |
51
+ | **MONEY** | 99.56% | 100.00% | 99.78% | 455 |
52
+ | **NON_HUMAN** | 0.00% | 0.00% | 0.00% | 1 |
53
+ | **NORP** | 95.71% | 95.45% | 95.58% | 374 |
54
+ | **ORDINAL** | 98.14% | 95.84% | 96.98% | 385 |
55
+ | **ORGANISATION** | 92.19% | 90.97% | 91.58% | 753 |
56
+ | **PERCENTAGE** | 99.08% | 99.08% | 99.08% | 437 |
57
+ | **PERSON** | 98.47% | 98.72% | 98.60% | 1,175 |
58
+ | **POSITION** | 96.15% | 97.79% | 96.96% | 587 |
59
+ | **PRODUCT** | 89.06% | 78.08% | 83.21% | 73 |
60
+ | **PROJECT** | 92.13% | 95.22% | 93.65% | 209 |
61
+ | **QUANTITY** | 97.58% | 98.30% | 97.94% | 411 |
62
+ | **TIME** | 94.81% | 96.63% | 95.71% | 208 |
63
+ | **micro avg** | **96.58%** | **96.66%** | **96.62%** | **13,189** |
64
+ | **macro avg** | **90.12%** | **87.51%** | **88.39%** | **13,189** |
65
+ | **weighted avg** | **96.67%** | **96.66%** | **96.63%** | **13,189** |