File size: 5,664 Bytes
ec15925
 
eec1a9a
 
 
9afc8dd
eec1a9a
 
4c952a9
eec1a9a
e6c62a9
2edba11
 
 
 
 
 
 
 
 
 
 
 
 
 
ec15925
cfe11f6
2ef6220
 
61111ae
22fb2c9
a1f011e
cc873a9
404f4ec
cc873a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05f13d5
 
 
 
 
cfe11f6
05f13d5
 
6687578
05f13d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfe11f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: cc-by-4.0
language:
- kk
metrics:
- seqeval
pipeline_tag: token-classification
tags:
- Named Entity Recognition
- NER
widget:
- text: >-
    Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан
    мемлекет.
  example_title: Example 1
- text: Ахмет Байтұрсынұлы  қазақ тілінің дыбыстық жүйесін алғашқы құрған ғалым.
  example_title: Example 2
- text: >-
    Қазақстан мен ЕуроОдақ арасындағы тауар айналым былтыр 38% өсіп, 40 миллиард
    долларға жетті. Екі тарап серіктестікті одан әрі нығайтуға мүдделі. Атап
    айтсақ, Қазақстан Еуропаға құны 2 млрд доллардан асатын 175 тауар экспорттын
    ұлғайтуға дайын.
  example_title: Example 3
datasets:
- yeshpanovrustem/kaznerd_cleaned
---

# A Named Entity Recognition Model for Kazakh
- The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
- The model was trained for 3 epochs on [*kaznerd_cleaned*](https://huggingface.co/datasets/yeshpanovrustem/kaznerd_cleaned).
- The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.

## How to use
You can use this model with the Transformers pipeline for NER. 

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("yeshpanovrustem/xlm-roberta-large-ner-kazakh")
model = AutoModelForTokenClassification.from_pretrained("yeshpanovrustem/xlm-roberta-large-ner-kazakh")

nlp = pipeline("ner", model = model, tokenizer = tokenizer)
example = "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет."

ner_results = nlp(example)
print(ner_results)
```

## Evaluation results on the validation and test sets
|  | Validation set |  |  | Test set|  |
|:---:| :---: | :---: | :---: | :---: | :---: |
| **Precision** | **Recall** | **F<sub>1</sub>-score** | **Precision** | **Recall** | **F<sub>1</sub>-score** |
| 96.58% | 96.66% | 96.62% | 96.49% | 96.86% | 96.67% |

## Model performance for the NE classes of the validation set
| NE Class | Precision | Recall | F<sub>1</sub>-score | Support |
| :---: | :---: | :---: | :---: | :---: |
| **ADAGE** | 90.00% | 47.37% | 62.07% | 19 |
| **ART** | 91.36% | 95.48% | 93.38% | 155 |
| **CARDINAL** | 98.44% | 98.37% | 98.40% | 2,878 |
| **CONTACT** | 100.00% | 83.33% | 90.91% | 18 |
| **DATE** | 97.38% | 97.27% | 97.33% | 2,603 |
| **DISEASE** | 96.72% | 97.52% | 97.12% | 121 |
| **EVENT** | 83.24% | 93.51% | 88.07% | 154 |
| **FACILITY** | 68.95% | 84.83% | 76.07% | 178 |
| **GPE** | 98.46% | 96.50% | 97.47% | 1,656 |
| **LANGUAGE** | 95.45% | 89.36% | 92.31% | 47 |
| **LAW** | 87.50% | 87.50% | 87.50% | 56 |
| **LOCATION** | 92.49% | 93.81% | 93.14% | 210 |
| **MISCELLANEOUS** | 100.00% | 76.92% | 86.96% | 26 |
| **MONEY** | 99.56% | 100.00% | 99.78% | 455 |
| **NON_HUMAN** | 0.00% | 0.00% | 0.00% | 1 |
| **NORP** | 95.71% | 95.45% | 95.58% | 374 |
| **ORDINAL** | 98.14% | 95.84% | 96.98% | 385 |
| **ORGANISATION** | 92.19% | 90.97% | 91.58% | 753 |
| **PERCENTAGE** | 99.08% | 99.08% | 99.08% | 437 |
| **PERSON** | 98.47% | 98.72% | 98.60% | 1,175 |
| **POSITION** | 96.15% | 97.79% | 96.96% | 587 |
| **PRODUCT** | 89.06% | 78.08% | 83.21% | 73 |
| **PROJECT** | 92.13% | 95.22% | 93.65% | 209 |
| **QUANTITY** | 97.58% | 98.30% | 97.94% | 411 |
| **TIME** | 94.81% | 96.63% | 95.71% | 208 |
| **micro avg** | **96.58%** | **96.66%** | **96.62%** | **13,189** |
| **macro avg** | **90.12%** | **87.51%** | **88.39%** | **13,189** |
| **weighted avg** | **96.67%** | **96.66%** | **96.63%** | **13,189** |

## Model performance for the NE classes of the test set
| NE Class | Precision | Recall | F<sub>1</sub>-score | Support |
| :---: | :---: | :---: | :---: | :---: |
| **ADAGE** | 71.43% | 29.41% | 41.67% | 17 |
| **ART** | 95.71% | 96.89% | 96.30% | 161 |
| **CARDINAL** | 98.43% | 98.60% | 98.51% | 2,789 |
| **CONTACT** | 94.44% | 85.00% | 89.47% | 20 |
| **DATE** | 96.59% | 97.60% | 97.09% | 2,584 |
| **DISEASE** | 87.69% | 95.80% | 91.57% | 119 |
| **EVENT** | 86.67% | 92.86% | 89.66% | 154 |
| **FACILITY** | 74.88% | 81.73% | 78.16% | 197 |
| **GPE** | 98.57% | 97.81% | 98.19% | 1,691 |
| **LANGUAGE** | 90.70% | 95.12% | 92.86% | 41 |
| **LAW** | 93.33% | 76.36% | 84.00% | 55 |
| **LOCATION** | 92.08% | 89.42% | 90.73% | 208 |
| **MISCELLANEOUS** | 86.21% | 96.15% | 90.91% | 26 |
| **MONEY** | 100.00% | 100.00% | 100.00% | 427 |
| **NON_HUMAN** | 0.00% | 0.00% | 0.00% | 1 |
| **NORP** | 99.46% | 99.18% | 99.32% | 368 |
| **ORDINAL** | 96.63% | 97.64% | 97.14% | 382 |
| **ORGANISATION** | 90.97% | 91.23% | 91.10% | 718 |
| **PERCENTAGE** | 98.05% | 98.05% | 98.05% | 462 |
| **PERSON** | 98.70% | 99.13% | 98.92% | 1,151 |
| **POSITION** | 96.36% | 97.65% | 97.00% | 597 |
| **PRODUCT** | 89.23% | 77.33% | 82.86% | 75 |
| **PROJECT** | 93.69% | 93.69% | 93.69% | 206 |
| **QUANTITY** | 97.26% | 97.02% | 97.14% | 403 |
| **TIME** | 94.95% | 94.09% | 94.52% | 220 |
| **micro avg** | **96.54%** | **96.85%** | **96.69%** | **13,072** |
| **macro avg** | **88.88%** | **87.11%** | **87.55%** | **13,072** |
| **weighted avg** | **96.55%** | **96.85%** | **96.67%** | **13,072** |