|
--- |
|
language: "tr" |
|
tags: |
|
- "bert" |
|
- "turkish" |
|
- "text-classification" |
|
license: "apache-2.0" |
|
datasets: |
|
- "custom" |
|
metrics: |
|
- "precision" |
|
- "recall" |
|
- "f1" |
|
- "accuracy" |
|
--- |
|
|
|
|
|
# BERT-based Organization Detection Model for Turkish Texts |
|
|
|
## Model Description |
|
|
|
This model is fine-tuned on the `dbmdz/bert-base-turkish-uncased` architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data. |
|
|
|
## Model Architecture |
|
|
|
- **Base Model:** BERT (dbmdz/bert-base-turkish-uncased) |
|
- **Training Data:** Twitter data from 4,000 random accounts and 4,000 accounts with high organization-related activity as determined by m3inference scores above 0.7, 8,000 accounts in total. The data was annotated based on user names, screen names, and descriptions using ChatGPT 4. |
|
|
|
## Training Setup |
|
|
|
- **Tokenization:** Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens. |
|
- **Dataset Split:** 80% training, 20% validation. |
|
- **Training Parameters:** |
|
- Epochs: 3 |
|
- Training batch size: 8 |
|
- Evaluation batch size: 16 |
|
- Warmup steps: 500 |
|
- Weight decay: 0.01 |
|
|
|
## Hyperparameter Tuning |
|
|
|
Performed using Optuna, achieving best settings: |
|
- **Learning rate:** 1.84e-05 |
|
- **Batch size:** 16 |
|
- **Epochs:** 3 |
|
|
|
## Evaluation Metrics |
|
|
|
- **Precision on Validation Set:** 0.67 (organization class) |
|
- **Recall on Validation Set:** 0.81 (organization class) |
|
- **F1-Score (Macro Average):** 0.73 |
|
- **Accuracy:** 0.94 |
|
- **Confusion Matrix on Validation Set:** |
|
``` |
|
[[1390, 60], |
|
[ 28, 122]] |
|
``` |
|
|
|
- **Hand-coded Sample of 1000 Accounts:** |
|
- **Precision:** 0.89 |
|
- **Recall:** 0.89 |
|
- **F1-Score (Macro Average):** 0.89 |
|
- **Confusion Matrix:** |
|
``` |
|
[[935, 4], |
|
[ 4, 31]] |
|
``` |
|
|
|
## How to Use |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("atsizelti/turkish_org_classifier") |
|
tokenizer = AutoTokenizer.from_pretrained("atsizelti/turkish_org_classifier") |
|
|
|
text = "Örnek metin buraya girilir." |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
predictions = outputs.logits.argmax(-1) |
|
``` |
|
|
|
|