File size: 2,248 Bytes
4502a55
5f2d847
 
 
 
 
 
 
 
 
 
 
 
 
4502a55
 
 
5f2d847
4502a55
5f2d847
4502a55
5f2d847
4502a55
5f2d847
4502a55
5f2d847
f896ff0
4502a55
5f2d847
4502a55
5f2d847
 
 
 
 
 
 
 
4502a55
5f2d847
4502a55
5f2d847
 
 
 
4502a55
5f2d847
4502a55
5f2d847
 
d9f467b
5f2d847
 
 
 
 
 
4502a55
315cc9e
5f2d847
fc9d2f4
 
5f2d847
 
 
 
 
4502a55
5f2d847
4502a55
5f2d847
 
4502a55
5f2d847
 
4502a55
5f2d847
 
 
 
 
4502a55
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
language: "tr"
tags:
  - "bert"
  - "turkish"
  - "text-classification"
license: "apache-2.0"
datasets:
  - "custom"
metrics:
  - "precision"
  - "recall"
  - "f1"
  - "accuracy"
---


# BERT-based Organization Detection Model for Turkish Texts

## Model Description

This model is fine-tuned on the `dbmdz/bert-base-turkish-uncased` architecture for detecting organization accounts within Turkish Twitter. This initiative is part of the Politus Project's efforts to analyze organizational presence in social media data.

## Model Architecture

- **Base Model:** BERT (dbmdz/bert-base-turkish-uncased)
- **Training Data:** Twitter data from 4,000 random accounts and 4,000 accounts with high organization-related activity as determined by m3inference scores above 0.7, 8,000 accounts in total. The data was annotated based on user names, screen names, and descriptions using ChatGPT 4.

## Training Setup

- **Tokenization:** Used Hugging Face's AutoTokenizer, padding sequences to a maximum length of 128 tokens.
- **Dataset Split:** 80% training, 20% validation.
- **Training Parameters:** 
  - Epochs: 3
  - Training batch size: 8
  - Evaluation batch size: 16
  - Warmup steps: 500
  - Weight decay: 0.01

## Hyperparameter Tuning

Performed using Optuna, achieving best settings:
- **Learning rate:** 1.84e-05
- **Batch size:** 16
- **Epochs:** 3

## Evaluation Metrics

- **Precision on Validation Set:** 0.67 (organization class)
- **Recall on Validation Set:** 0.81 (organization class)
- **F1-Score (Macro Average):** 0.73
- **Accuracy:** 0.94
- **Confusion Matrix on Validation Set:**
 ```
[[1390, 60],
[ 28, 122]]
 ```

- **Hand-coded Sample of 1000 Accounts:**
- **Precision:** 0.89
- **Recall:** 0.89
- **F1-Score (Macro Average):** 0.89
- **Confusion Matrix:**
  ```
  [[935, 4],
   [ 4, 31]]
  ```

## How to Use

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("atsizelti/turkish_org_classifier")
tokenizer = AutoTokenizer.from_pretrained("atsizelti/turkish_org_classifier")

text = "Örnek metin buraya girilir."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
```