Turkish Text Classification

This model is a fine-tune model of https://github.com/stefan-it/turkish-bert by using text classification data where there are 7 categories as follows

code_to_label={
 'LABEL_0': 'dunya ',
 'LABEL_1': 'ekonomi ',
 'LABEL_2': 'kultur ',
 'LABEL_3': 'saglik ',
 'LABEL_4': 'siyaset ',
 'LABEL_5': 'spor ',
 'LABEL_6': 'teknoloji '}
 

Citation

Please cite the following papers if needed

@misc{yildirim2024finetuning,
      title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks}, 
      author={Savas Yildirim},
      year={2024},
      eprint={2401.17396},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}




@book{yildirim2021mastering,
  title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques},
  author={Yildirim, Savas and Asgari-Chenaghlu, Meysam},
  year={2021},
  publisher={Packt Publishing Ltd}
}

Data

The following Turkish benchmark dataset is used for fine-tuning

https://www.kaggle.com/savasy/ttc4900

Quick Start

Bewgin with installing transformers as follows

pip install transformers

# Code:
# import libraries
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelForSequenceClassification
tokenizer= AutoTokenizer.from_pretrained("savasy/bert-turkish-text-classification")

# build and load model, it take time depending on your internet connection
model= AutoModelForSequenceClassification.from_pretrained("savasy/bert-turkish-text-classification")

# make pipeline
nlp=pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# apply model
nlp("bla bla")
# [{'label': 'LABEL_2', 'score': 0.4753005802631378}]

code_to_label={
 'LABEL_0': 'dunya ',
 'LABEL_1': 'ekonomi ',
 'LABEL_2': 'kultur ',
 'LABEL_3': 'saglik ',
 'LABEL_4': 'siyaset ',
 'LABEL_5': 'spor ',
 'LABEL_6': 'teknoloji '}
 
code_to_label[nlp("bla bla")[0]['label']]
# > 'kultur '

How the model was trained


## loading data for Turkish text classification
import pandas as pd
# https://www.kaggle.com/savasy/ttc4900
df=pd.read_csv("7allV03.csv")
df.columns=["labels","text"]
df.labels=pd.Categorical(df.labels)

traind_df=...
eval_df=...

# model
from simpletransformers.classification import ClassificationModel
import torch,sklearn

model_args = {
    "use_early_stopping": True,
    "early_stopping_delta": 0.01,
    "early_stopping_metric": "mcc",
    "early_stopping_metric_minimize": False,
    "early_stopping_patience": 5,
    "evaluate_during_training_steps": 1000,
    "fp16": False,
    "num_train_epochs":3
}

model = ClassificationModel(
    "bert", 
    "dbmdz/bert-base-turkish-cased",
     use_cuda=cuda_available, 
     args=model_args, 
     num_labels=7
)
model.train_model(train_df, acc=sklearn.metrics.accuracy_score)

For other training models please check https://simpletransformers.ai/

For the detailed usage of Turkish Text Classification please check python notebook

Downloads last month
636
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using savasy/bert-turkish-text-classification 1