yeshpanovrustem's picture
Update README.md
fef9186
metadata
license: cc-by-4.0
language:
  - kk
metrics:
  - seqeval
pipeline_tag: token-classification
tags:
  - Named Entity Recognition
  - NER
widget:
  - text: >-
      Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан
      мемлекет.
    example_title: Example 1
  - text: Ахмет Байтұрсынұлы  қазақ тілінің дыбыстық жүйесін алғашқы құрған ғалым.
    example_title: Example 2
  - text: >-
      Қазақстан мен ЕуроОдақ арасындағы тауар айналым былтыр 38% өсіп, 40
      миллиард долларға жетті. Екі тарап серіктестікті одан әрі нығайтуға
      мүдделі. Атап айтсақ, Қазақстан Еуропаға құны 2 млрд доллардан асатын 175
      тауар экспорттын ұлғайтуға дайын.
    example_title: Example 3
datasets:
  - yeshpanovrustem/ner-kazakh

A Named Entity Recognition Model for Kazakh

How to use

You can use this model with the Transformers pipeline for NER.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("yeshpanovrustem/xlm-roberta-large-ner-kazakh")
model = AutoModelForTokenClassification.from_pretrained("yeshpanovrustem/xlm-roberta-large-ner-kazakh")

# aggregation_strategy = "none"
nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "none")
example = "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет."

ner_results = nlp(example)
for result in ner_results:
    print(result)

# output:
# {'entity': 'B-GPE', 'score': 0.9995646, 'index': 1, 'word': '▁Қазақстан', 'start': 0, 'end': 9}
# {'entity': 'I-GPE', 'score': 0.9994935, 'index': 2, 'word': '▁Республикасы', 'start': 10, 'end': 22}
# {'entity': 'B-LOCATION', 'score': 0.99906737, 'index': 4, 'word': '▁Шығыс', 'start': 25, 'end': 30}
# {'entity': 'I-LOCATION', 'score': 0.999153, 'index': 5, 'word': '▁Еуропа', 'start': 31, 'end': 37}
# {'entity': 'B-LOCATION', 'score': 0.9991597, 'index': 7, 'word': '▁Орталық', 'start': 42, 'end': 49}
# {'entity': 'I-LOCATION', 'score': 0.9991725, 'index': 8, 'word': '▁Азия', 'start': 50, 'end': 54}
# {'entity': 'I-LOCATION', 'score': 0.9992299, 'index': 9, 'word': 'да', 'start': 54, 'end': 56}

token = ""
label_list = []
token_list = []

for result in ner_results:
    if result["word"].startswith("▁"):
        if token:
            token_list.append(token.replace("▁", ""))
        token = result["word"]
        label_list.append(result["entity"])
    else:
        token += result["word"]

token_list.append(token.replace("▁", ""))

for token, label in zip(token_list, label_list):
    print(f"{token}\t{label}")

# output:
# Қазақстан	B-GPE
# Республикасы	I-GPE
# Шығыс	B-LOCATION
# Еуропа	I-LOCATION
# Орталық	B-LOCATION
# Азияда	I-LOCATION

# aggregation_strategy = "simple"
nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "simple")
example = "Қазақстан Республикасы — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет."

ner_results = nlp(example)
for result in ner_results:
    print(result)

# output:
# {'entity_group': 'GPE', 'score': 0.999529, 'word': 'Қазақстан Республикасы', 'start': 0, 'end': 22}
# {'entity_group': 'LOCATION', 'score': 0.9991102, 'word': 'Шығыс Еуропа', 'start': 25, 'end': 37}
# {'entity_group': 'LOCATION', 'score': 0.9991874, 'word': 'Орталық Азияда', 'start': 42, 'end': 56}

Evaluation results on the validation and test sets

Validation set Test set
Precision Recall F1-score Precision Recall F1-score
96.58% 96.66% 96.62% 96.49% 96.86% 96.67%

Model performance for the NE classes of the validation set

NE Class Precision Recall F1-score Support
ADAGE 90.00% 47.37% 62.07% 19
ART 91.36% 95.48% 93.38% 155
CARDINAL 98.44% 98.37% 98.40% 2,878
CONTACT 100.00% 83.33% 90.91% 18
DATE 97.38% 97.27% 97.33% 2,603
DISEASE 96.72% 97.52% 97.12% 121
EVENT 83.24% 93.51% 88.07% 154
FACILITY 68.95% 84.83% 76.07% 178
GPE 98.46% 96.50% 97.47% 1,656
LANGUAGE 95.45% 89.36% 92.31% 47
LAW 87.50% 87.50% 87.50% 56
LOCATION 92.49% 93.81% 93.14% 210
MISCELLANEOUS 100.00% 76.92% 86.96% 26
MONEY 99.56% 100.00% 99.78% 455
NON_HUMAN 0.00% 0.00% 0.00% 1
NORP 95.71% 95.45% 95.58% 374
ORDINAL 98.14% 95.84% 96.98% 385
ORGANISATION 92.19% 90.97% 91.58% 753
PERCENTAGE 99.08% 99.08% 99.08% 437
PERSON 98.47% 98.72% 98.60% 1,175
POSITION 96.15% 97.79% 96.96% 587
PRODUCT 89.06% 78.08% 83.21% 73
PROJECT 92.13% 95.22% 93.65% 209
QUANTITY 97.58% 98.30% 97.94% 411
TIME 94.81% 96.63% 95.71% 208
micro avg 96.58% 96.66% 96.62% 13,189
macro avg 90.12% 87.51% 88.39% 13,189
weighted avg 96.67% 96.66% 96.63% 13,189

Model performance for the NE classes of the test set

NE Class Precision Recall F1-score Support
ADAGE 71.43% 29.41% 41.67% 17
ART 95.71% 96.89% 96.30% 161
CARDINAL 98.43% 98.60% 98.51% 2,789
CONTACT 94.44% 85.00% 89.47% 20
DATE 96.59% 97.60% 97.09% 2,584
DISEASE 87.69% 95.80% 91.57% 119
EVENT 86.67% 92.86% 89.66% 154
FACILITY 74.88% 81.73% 78.16% 197
GPE 98.57% 97.81% 98.19% 1,691
LANGUAGE 90.70% 95.12% 92.86% 41
LAW 93.33% 76.36% 84.00% 55
LOCATION 92.08% 89.42% 90.73% 208
MISCELLANEOUS 86.21% 96.15% 90.91% 26
MONEY 100.00% 100.00% 100.00% 427
NON_HUMAN 0.00% 0.00% 0.00% 1
NORP 99.46% 99.18% 99.32% 368
ORDINAL 96.63% 97.64% 97.14% 382
ORGANISATION 90.97% 91.23% 91.10% 718
PERCENTAGE 98.05% 98.05% 98.05% 462
PERSON 98.70% 99.13% 98.92% 1,151
POSITION 96.36% 97.65% 97.00% 597
PRODUCT 89.23% 77.33% 82.86% 75
PROJECT 93.69% 93.69% 93.69% 206
QUANTITY 97.26% 97.02% 97.14% 403
TIME 94.95% 94.09% 94.52% 220
micro avg 96.54% 96.85% 96.69% 13,072
macro avg 88.88% 87.11% 87.55% 13,072
weighted avg 96.55% 96.85% 96.67% 13,072