atsizelti's picture
Update README.md
315cc9e verified
metadata
language: tr
tags:
  - bert
  - turkish
  - text-classification
license: apache-2.0
datasets:
  - custom
metrics:
  - precision
  - recall
  - f1
  - accuracy

BERT-based Organization Detection Model for Turkish Texts

Model Description

This model is fine-tuned on the dbmdz/bert-base-turkish-uncased architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data.

Model Architecture

  • Base Model: BERT (dbmdz/bert-base-turkish-uncased)
  • Training Data: Twitter data from 4,000 random accounts and 4,000 accounts with high organization-related activity as determined by m3inference scores above 0.7, 8,000 accounts in total. The data was annotated based on user names, screen names, and descriptions using ChatGPT 4.

Training Setup

  • Tokenization: Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens.
  • Dataset Split: 80% training, 20% validation.
  • Training Parameters:
    • Epochs: 3
    • Training batch size: 8
    • Evaluation batch size: 16
    • Warmup steps: 500
    • Weight decay: 0.01

Hyperparameter Tuning

Performed using Optuna, achieving best settings:

  • Learning rate: 1.84e-05
  • Batch size: 16
  • Epochs: 3

Evaluation Metrics

  • Precision on Validation Set: 0.67 (organization class)
  • Recall on Validation Set: 0.81 (organization class)
  • F1-Score (Macro Average): 0.73
  • Accuracy: 0.94
  • Confusion Matrix on Validation Set:
[[1390, 60],
[ 28, 122]]
  • Hand-coded Sample of 1000 Accounts:
  • Precision: 0.89
  • Recall: 0.89
  • F1-Score (Macro Average): 0.89
  • Confusion Matrix:
    [[935, 4],
     [ 4, 31]]
    

How to Use

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("atsizelti/turkish_org_classifier")
tokenizer = AutoTokenizer.from_pretrained("atsizelti/turkish_org_classifier")

text = "Örnek metin buraya girilir."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)