zeynepgulhan's picture
Update README.md
368df19 verified
|
raw
history blame
5.75 kB
metadata
language: tr
tags:
  - bert
  - turkish
  - text-classification
  - offensive-language-detection
license: mit
datasets:
  - offenseval2020_tr
metrics:
  - accuracy
  - f1
  - precision
  - recall

Offensive Language Detection For Turkish Language

Model Description

This model has been fine-tuned using dbmdz/bert-base-turkish-128k-uncased model with the OffensEval 2020 dataset. The offenseval-tr dataset contains 31,756 annotated tweets.

Dataset Distribution

Non Offensive(0) Offensive (1)
Train 25625 6131
Test 2812 716

Preprocessing Steps

Process Description
Accented character transformation Converting accented characters to their unaccented equivalents
Lowercase transformation Converting all text to lowercase
Removing @user mentions Removing @user formatted user mentions from text
Removing hashtag expressions Removing #hashtag formatted expressions from text
Removing URLs Removing URLs from text
Removing punctuation and punctuated emojis Removing punctuation marks and emojis presented with punctuation from text
Removing emojis Removing emojis from text
Deasciification Converting ASCII text into text containing Turkish characters

The performance of each pre-process was analyzed. Removing digits and keeping hashtags had no effect.

Usage

Install necessary libraries:

pip install git+https://github.com/emres/turkish-deasciifier.git

pip install keras_preprocessing

Pre-processing functions are below:


from turkish.deasciifier import Deasciifier
def deasciifier(text):
    deasciifier = Deasciifier(text)
    return deasciifier.convert_to_turkish()

def remove_circumflex(text):
    circumflex_map = {
        'â': 'a',
        'î': 'i',
        'û': 'u',
        'ô': 'o',
        'Â': 'A',
        'Î': 'I',
        'Û': 'U',
        'Ô': 'O'
    }

    return ''.join(circumflex_map.get(c, c) for c in text)    
def turkish_lower(text):
    turkish_map = {
        'I': 'ı',
        'İ': 'i',
        'Ç': 'ç',
        'Ş': 'ş',
        'Ğ': 'ğ',
        'Ü': 'ü',
        'Ö': 'ö'
    }
    return ''.join(turkish_map.get(c, c).lower() for c in text)

Clean text using below function:

import re

def clean_text(text):
    # Metindeki şapkalı harfleri kaldırma
    text = remove_circumflex(text)
    # Metni küçük harfe dönüştürme
    text = turkish_lower(text)
    # deasciifier
    text = deasciifier(text)
    # Kullanıcı adlarını kaldırma
    text = re.sub(r"@\S*", " ", text)
    # Hashtag'leri kaldırma
    text = re.sub(r'#\S+', ' ', text)
    # URL'leri kaldırma
    text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE)
    # Noktalama işaretlerini ve metin tabanlı emojileri kaldırma
    text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text)
    # Emojileri kaldırma
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r' ', text)

    # Birden fazla boşluğu tek boşlukla değiştirme
    text = re.sub(r'\s+', ' ', text).strip()
    return example

Model Initialization

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")

Check if sentence is offensive like below:

import numpy as np
def is_offensive(sentence):
    d = {
        0: 'non-offensive',
        1: 'offensive'
    }
    normalize_text = clean_text(sentence)
    test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')

    test_sample = {k: v.to(device) for k, v in test_sample.items()}

    output = model(**test_sample)
    y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)

    print(normalize_text, "-->", d[y_pred[0]])
    return y_pred[0]
is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")

Evaluation

Evaluation results on test set shown on table below. We achive %89 accuracy on test set.

Model Performance Metrics

Class Precision Recall F1-score Accuracy
Class 0 0.92 0.94 0.93 0.89
Class 1 0.73 0.67 0.70
Macro 0.83 0.80 0.81