Update README.md

368df19 verified 12 months ago

5.75 kB

	---
	language: tr
	tags:
	- bert
	- turkish
	- text-classification
	- offensive-language-detection
	license: mit
	datasets:
	- offenseval2020_tr
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	---

	Offensive Language Detection For Turkish Language

	## Model Description
	This model has been fine-tuned using [dbmdz/bert-base-turkish-128k-uncased](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased) model with the [OffensEval 2020](https://huggingface.co/datasets/offenseval2020_tr) dataset.
	The offenseval-tr dataset contains 31,756 annotated tweets.

	## Dataset Distribution

	\| \| Non Offensive(0) \| Offensive (1)\|
	\|-----------\|------------------\|--------------\|
	\| Train \| 25625 \| 6131 \|
	\| Test \| 2812 \| 716 \|


	## Preprocessing Steps

	\| Process \| Description \|
	\|--------------------------------------------------\|---------------------------------------------------\|
	\| Accented character transformation \| Converting accented characters to their unaccented equivalents \|
	\| Lowercase transformation \| Converting all text to lowercase \|
	\| Removing @user mentions \| Removing @user formatted user mentions from text \|
	\| Removing hashtag expressions \| Removing #hashtag formatted expressions from text \|
	\| Removing URLs \| Removing URLs from text \|
	\| Removing punctuation and punctuated emojis \| Removing punctuation marks and emojis presented with punctuation from text \|
	\| Removing emojis \| Removing emojis from text \|
	\| Deasciification \| Converting ASCII text into text containing Turkish characters \|


	The performance of each pre-process was analyzed.
	Removing digits and keeping hashtags had no effect.


	## Usage

	Install necessary libraries:

	```pip install git+https://github.com/emres/turkish-deasciifier.git```

	```pip install keras_preprocessing```


	Pre-processing functions are below:

	```python

	from turkish.deasciifier import Deasciifier
	def deasciifier(text):
	deasciifier = Deasciifier(text)
	return deasciifier.convert_to_turkish()

	def remove_circumflex(text):
	circumflex_map = {
	'â': 'a',
	'î': 'i',
	'û': 'u',
	'ô': 'o',
	'Â': 'A',
	'Î': 'I',
	'Û': 'U',
	'Ô': 'O'
	}

	return ''.join(circumflex_map.get(c, c) for c in text)
	def turkish_lower(text):
	turkish_map = {
	'I': 'ı',
	'İ': 'i',
	'Ç': 'ç',
	'Ş': 'ş',
	'Ğ': 'ğ',
	'Ü': 'ü',
	'Ö': 'ö'
	}
	return ''.join(turkish_map.get(c, c).lower() for c in text)
	```

	Clean text using below function:

	```python
	import re

	def clean_text(text):
	# Metindeki şapkalı harfleri kaldırma
	text = remove_circumflex(text)
	# Metni küçük harfe dönüştürme
	text = turkish_lower(text)
	# deasciifier
	text = deasciifier(text)
	# Kullanıcı adlarını kaldırma
	text = re.sub(r"@\S*", " ", text)
	# Hashtag'leri kaldırma
	text = re.sub(r'#\S+', ' ', text)
	# URL'leri kaldırma
	text = re.sub(r"http\S+\|www\S+\|https\S+", ' ', text, flags=re.MULTILINE)
	# Noktalama işaretlerini ve metin tabanlı emojileri kaldırma
	text = re.sub(r'[^\w\s]\|(:\)\|:\(\|:D\|:P\|:o\|:O\|;\))', ' ', text)
	# Emojileri kaldırma
	emoji_pattern = re.compile("["
	u"\U0001F600-\U0001F64F" # emoticons
	u"\U0001F300-\U0001F5FF" # symbols & pictographs
	u"\U0001F680-\U0001F6FF" # transport & map symbols
	u"\U0001F1E0-\U0001F1FF" # flags (iOS)
	u"\U00002702-\U000027B0"
	u"\U000024C2-\U0001F251"
	"]+", flags=re.UNICODE)
	text = emoji_pattern.sub(r' ', text)

	# Birden fazla boşluğu tek boşlukla değiştirme
	text = re.sub(r'\s+', ' ', text).strip()
	return example
	```

	## Model Initialization

	```python
	# Load model directly
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
	model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")

	```
	Check if sentence is offensive like below:

	```python
	import numpy as np
	def is_offensive(sentence):
	d = {
	0: 'non-offensive',
	1: 'offensive'
	}
	normalize_text = clean_text(sentence)
	test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')

	test_sample = {k: v.to(device) for k, v in test_sample.items()}

	output = model(**test_sample)
	y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)

	print(normalize_text, "-->", d[y_pred[0]])
	return y_pred[0]

	```

	```python
	is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
	is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")
	```

	## Evaluation

	Evaluation results on test set shown on table below.
	We achive %89 accuracy on test set.
	## Model Performance Metrics

	\| Class \| Precision \| Recall \| F1-score \| Accuracy \|
	\|---------\|-----------\|--------\|----------\|----------\|
	\| Class 0 \| 0.92 \| 0.94 \| 0.93 \| 0.89 \|
	\| Class 1 \| 0.73 \| 0.67 \| 0.70 \| \|
	\| Macro \| 0.83 \| 0.80 \| 0.81 \| \|