For Turkish language, here is an easy-to-use NER application.

** Türkçe için kolay bir python NER (Bert + Transfer Learning) (İsim Varlık Tanıma) modeli...

Citation

Please cite if you use it in your study


@misc{yildirim2024finetuning,
      title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks}, 
      author={Savas Yildirim},
      year={2024},
      eprint={2401.17396},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}



@book{yildirim2021mastering,
  title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques},
  author={Yildirim, Savas and Asgari-Chenaghlu, Meysam},
  year={2021},
  publisher={Packt Publishing Ltd}
}

other detail

Thanks to @stefan-it, I applied the followings for training

cd tr-data

for file in train.txt dev.txt test.txt labels.txt do wget https://schweter.eu/storage/turkish-bert-wikiann/$file done

cd .. It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.

Run pre-training After downloading the dataset, pre-training can be started. Just set the following environment variables:

export MAX_LENGTH=128
export BERT_MODEL=dbmdz/bert-base-turkish-cased 
export OUTPUT_DIR=tr-new-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=625
export SEED=1

Then run pre-training:

python3 run_ner_old.py --data_dir ./tr-data3 \
--model_type bert \
--labels ./tr-data/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR-$SEED \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict \
--fp16

Usage

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
ner=pipeline('ner', model=model, tokenizer=tokenizer)
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")

Some results

Data1: For the data above Eval Results:

  • precision = 0.916400580551524
  • recall = 0.9342309684101502
  • f1 = 0.9252298787412536
  • loss = 0.11335893666411284

Test Results:

  • precision = 0.9192058759362955
  • recall = 0.9303010230367262
  • f1 = 0.9247201697271198
  • loss = 0.11182546521618497

Data2: https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt The performance for the data given by @kemalaraz is as follows

savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt

  • precision = 0.9461980692049029
  • recall = 0.959309358847465
  • f1 = 0.9527086063783312
  • loss = 0.037054269206847804

savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt

  • precision = 0.9458370635631155
  • recall = 0.9588201928530913
  • f1 = 0.952284378344882
  • loss = 0.035431676572445225
Downloads last month
1,264
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using savasy/bert-base-turkish-ner-cased 2