File size: 7,838 Bytes
ec15925
 
eec1a9a
 
 
9afc8dd
eec1a9a
 
4c952a9
eec1a9a
e6c62a9
2edba11
 
 
 
 
 
 
 
 
 
 
 
 
71f802f
ec15925
cfe11f6
2ef6220
 
baa45e0
22fb2c9
a1f011e
cc873a9
404f4ec
cc873a9
 
 
 
 
cbd8639
 
cc873a9
fef9186
 
cc873a9
 
 
fef9186
 
 
 
 
 
 
 
 
 
 
c49a73d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fef9186
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc873a9
 
05f13d5
 
 
 
 
cfe11f6
05f13d5
 
6687578
05f13d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfe11f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: cc-by-4.0
language:
- kk
metrics:
- seqeval
pipeline_tag: token-classification
tags:
- Named Entity Recognition
- NER
widget:
- text: >-
    Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан
    мемлекет.
  example_title: Example 1
- text: Ахмет Байтұрсынұлы  қазақ тілінің дыбыстық жүйесін алғашқы құрған ғалым.
  example_title: Example 2
- text: >-
    Қазақстан мен ЕуроОдақ арасындағы тауар айналым былтыр 38% өсіп, 40 миллиард
    долларға жетті. Екі тарап серіктестікті одан әрі нығайтуға мүдделі. Атап
    айтсақ, Қазақстан Еуропаға құны 2 млрд доллардан асатын 175 тауар экспорттын
    ұлғайтуға дайын.
  example_title: Example 3
datasets:
- yeshpanovrustem/ner-kazakh
---

# A Named Entity Recognition Model for Kazakh
- The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
- The model was trained for 3 epochs on [*ner_kazakh*](https://huggingface.co/datasets/yeshpanovrustem/kaznerd_cleaned).
- The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.

## How to use
You can use this model with the Transformers pipeline for NER. 

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("yeshpanovrustem/xlm-roberta-large-ner-kazakh")
model = AutoModelForTokenClassification.from_pretrained("yeshpanovrustem/xlm-roberta-large-ner-kazakh")

# aggregation_strategy = "none"
nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "none")
example = "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет."

ner_results = nlp(example)
for result in ner_results:
    print(result)

# output:
# {'entity': 'B-GPE', 'score': 0.9995646, 'index': 1, 'word': '▁Қазақстан', 'start': 0, 'end': 9}
# {'entity': 'I-GPE', 'score': 0.9994935, 'index': 2, 'word': '▁Республикасы', 'start': 10, 'end': 22}
# {'entity': 'B-LOCATION', 'score': 0.99906737, 'index': 4, 'word': '▁Шығыс', 'start': 25, 'end': 30}
# {'entity': 'I-LOCATION', 'score': 0.999153, 'index': 5, 'word': '▁Еуропа', 'start': 31, 'end': 37}
# {'entity': 'B-LOCATION', 'score': 0.9991597, 'index': 7, 'word': '▁Орталық', 'start': 42, 'end': 49}
# {'entity': 'I-LOCATION', 'score': 0.9991725, 'index': 8, 'word': '▁Азия', 'start': 50, 'end': 54}
# {'entity': 'I-LOCATION', 'score': 0.9992299, 'index': 9, 'word': 'да', 'start': 54, 'end': 56}

token = ""
label_list = []
token_list = []

for result in ner_results:
    if result["word"].startswith("▁"):
        if token:
            token_list.append(token.replace("▁", ""))
        token = result["word"]
        label_list.append(result["entity"])
    else:
        token += result["word"]

token_list.append(token.replace("▁", ""))

for token, label in zip(token_list, label_list):
    print(f"{token}\t{label}")

# output:
# Қазақстан	B-GPE
# Республикасы	I-GPE
# Шығыс	B-LOCATION
# Еуропа	I-LOCATION
# Орталық	B-LOCATION
# Азияда	I-LOCATION

# aggregation_strategy = "simple"
nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "simple")
example = "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет."

ner_results = nlp(example)
for result in ner_results:
    print(result)

# output:
# {'entity_group': 'GPE', 'score': 0.999529, 'word': 'Қазақстан Республикасы', 'start': 0, 'end': 22}
# {'entity_group': 'LOCATION', 'score': 0.9991102, 'word': 'Шығыс Еуропа', 'start': 25, 'end': 37}
# {'entity_group': 'LOCATION', 'score': 0.9991874, 'word': 'Орталық Азияда', 'start': 42, 'end': 56}

```

## Evaluation results on the validation and test sets
|  | Validation set |  |  | Test set|  |
|:---:| :---: | :---: | :---: | :---: | :---: |
| **Precision** | **Recall** | **F<sub>1</sub>-score** | **Precision** | **Recall** | **F<sub>1</sub>-score** |
| 96.58% | 96.66% | 96.62% | 96.49% | 96.86% | 96.67% |

## Model performance for the NE classes of the validation set
| NE Class | Precision | Recall | F<sub>1</sub>-score | Support |
| :---: | :---: | :---: | :---: | :---: |
| **ADAGE** | 90.00% | 47.37% | 62.07% | 19 |
| **ART** | 91.36% | 95.48% | 93.38% | 155 |
| **CARDINAL** | 98.44% | 98.37% | 98.40% | 2,878 |
| **CONTACT** | 100.00% | 83.33% | 90.91% | 18 |
| **DATE** | 97.38% | 97.27% | 97.33% | 2,603 |
| **DISEASE** | 96.72% | 97.52% | 97.12% | 121 |
| **EVENT** | 83.24% | 93.51% | 88.07% | 154 |
| **FACILITY** | 68.95% | 84.83% | 76.07% | 178 |
| **GPE** | 98.46% | 96.50% | 97.47% | 1,656 |
| **LANGUAGE** | 95.45% | 89.36% | 92.31% | 47 |
| **LAW** | 87.50% | 87.50% | 87.50% | 56 |
| **LOCATION** | 92.49% | 93.81% | 93.14% | 210 |
| **MISCELLANEOUS** | 100.00% | 76.92% | 86.96% | 26 |
| **MONEY** | 99.56% | 100.00% | 99.78% | 455 |
| **NON_HUMAN** | 0.00% | 0.00% | 0.00% | 1 |
| **NORP** | 95.71% | 95.45% | 95.58% | 374 |
| **ORDINAL** | 98.14% | 95.84% | 96.98% | 385 |
| **ORGANISATION** | 92.19% | 90.97% | 91.58% | 753 |
| **PERCENTAGE** | 99.08% | 99.08% | 99.08% | 437 |
| **PERSON** | 98.47% | 98.72% | 98.60% | 1,175 |
| **POSITION** | 96.15% | 97.79% | 96.96% | 587 |
| **PRODUCT** | 89.06% | 78.08% | 83.21% | 73 |
| **PROJECT** | 92.13% | 95.22% | 93.65% | 209 |
| **QUANTITY** | 97.58% | 98.30% | 97.94% | 411 |
| **TIME** | 94.81% | 96.63% | 95.71% | 208 |
| **micro avg** | **96.58%** | **96.66%** | **96.62%** | **13,189** |
| **macro avg** | **90.12%** | **87.51%** | **88.39%** | **13,189** |
| **weighted avg** | **96.67%** | **96.66%** | **96.63%** | **13,189** |

## Model performance for the NE classes of the test set
| NE Class | Precision | Recall | F<sub>1</sub>-score | Support |
| :---: | :---: | :---: | :---: | :---: |
| **ADAGE** | 71.43% | 29.41% | 41.67% | 17 |
| **ART** | 95.71% | 96.89% | 96.30% | 161 |
| **CARDINAL** | 98.43% | 98.60% | 98.51% | 2,789 |
| **CONTACT** | 94.44% | 85.00% | 89.47% | 20 |
| **DATE** | 96.59% | 97.60% | 97.09% | 2,584 |
| **DISEASE** | 87.69% | 95.80% | 91.57% | 119 |
| **EVENT** | 86.67% | 92.86% | 89.66% | 154 |
| **FACILITY** | 74.88% | 81.73% | 78.16% | 197 |
| **GPE** | 98.57% | 97.81% | 98.19% | 1,691 |
| **LANGUAGE** | 90.70% | 95.12% | 92.86% | 41 |
| **LAW** | 93.33% | 76.36% | 84.00% | 55 |
| **LOCATION** | 92.08% | 89.42% | 90.73% | 208 |
| **MISCELLANEOUS** | 86.21% | 96.15% | 90.91% | 26 |
| **MONEY** | 100.00% | 100.00% | 100.00% | 427 |
| **NON_HUMAN** | 0.00% | 0.00% | 0.00% | 1 |
| **NORP** | 99.46% | 99.18% | 99.32% | 368 |
| **ORDINAL** | 96.63% | 97.64% | 97.14% | 382 |
| **ORGANISATION** | 90.97% | 91.23% | 91.10% | 718 |
| **PERCENTAGE** | 98.05% | 98.05% | 98.05% | 462 |
| **PERSON** | 98.70% | 99.13% | 98.92% | 1,151 |
| **POSITION** | 96.36% | 97.65% | 97.00% | 597 |
| **PRODUCT** | 89.23% | 77.33% | 82.86% | 75 |
| **PROJECT** | 93.69% | 93.69% | 93.69% | 206 |
| **QUANTITY** | 97.26% | 97.02% | 97.14% | 403 |
| **TIME** | 94.95% | 94.09% | 94.52% | 220 |
| **micro avg** | **96.54%** | **96.85%** | **96.69%** | **13,072** |
| **macro avg** | **88.88%** | **87.11%** | **87.55%** | **13,072** |
| **weighted avg** | **96.55%** | **96.85%** | **96.67%** | **13,072** |