atsizelti's picture
Update README.md
315cc9e verified
---
language: "tr"
tags:
- "bert"
- "turkish"
- "text-classification"
license: "apache-2.0"
datasets:
- "custom"
metrics:
- "precision"
- "recall"
- "f1"
- "accuracy"
---
# BERT-based Organization Detection Model for Turkish Texts
## Model Description
This model is fine-tuned on the `dbmdz/bert-base-turkish-uncased` architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data.
## Model Architecture
- **Base Model:** BERT (dbmdz/bert-base-turkish-uncased)
- **Training Data:** Twitter data from 4,000 random accounts and 4,000 accounts with high organization-related activity as determined by m3inference scores above 0.7, 8,000 accounts in total. The data was annotated based on user names, screen names, and descriptions using ChatGPT 4.
## Training Setup
- **Tokenization:** Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens.
- **Dataset Split:** 80% training, 20% validation.
- **Training Parameters:**
- Epochs: 3
- Training batch size: 8
- Evaluation batch size: 16
- Warmup steps: 500
- Weight decay: 0.01
## Hyperparameter Tuning
Performed using Optuna, achieving best settings:
- **Learning rate:** 1.84e-05
- **Batch size:** 16
- **Epochs:** 3
## Evaluation Metrics
- **Precision on Validation Set:** 0.67 (organization class)
- **Recall on Validation Set:** 0.81 (organization class)
- **F1-Score (Macro Average):** 0.73
- **Accuracy:** 0.94
- **Confusion Matrix on Validation Set:**
```
[[1390, 60],
[ 28, 122]]
```
- **Hand-coded Sample of 1000 Accounts:**
- **Precision:** 0.89
- **Recall:** 0.89
- **F1-Score (Macro Average):** 0.89
- **Confusion Matrix:**
```
[[935, 4],
[ 4, 31]]
```
## How to Use
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("atsizelti/turkish_org_classifier")
tokenizer = AutoTokenizer.from_pretrained("atsizelti/turkish_org_classifier")
text = "Örnek metin buraya girilir."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
```