Thaweewat's picture
Update README.md
15dd611
|
raw
history blame
3.95 kB
metadata
license: apache-2.0
datasets:
  - KDAI-NLP/traffy-fondue-type-only
language:
  - th
metrics:
  - f1
tags:
  - roberta
widget:
  - text: แยกอโศกฝนตกน้ำท่วมหนักมากครับ ต้นไม้ก็ล้มขวางทางรถติดชห

Traffy Complaint Classification

This multi-label model is trained to automatically classify various types of traffic complaints expressed in Thai text, with the goal of minimizing the need for manual classification. Please note that the example inference provided by Hugging Face (Right-side UI) does not yet support multi-label classification. If you require multi-label classification, please use the code provided below.

Model Details

Model Name: KDAI-NLP/wangchanberta-traffy-multi
Tokenizer: airesearch/wangchanberta-base-att-spm-uncased
License: Apache License 2.0

How to Use


!pip install sentencepiece

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.nn.functional import sigmoid
import json

# Target lists
target_list = [
    'ความสะอาด', 'สายไฟ', 'สะพาน', 'ถนน', 'น้ำท่วม',
    'ร้องเรียน', 'ท่อระบายน้ำ', 'ความปลอดภัย', 'คลอง', 'แสงสว่าง',
    'ทางเท้า', 'จราจร', 'กีดขวาง', 'การเดินทาง', 'เสียงรบกวน',
    'ต้นไม้', 'สัตว์จรจัด', 'เสนอแนะ', 'คนจรจัด', 'ห้องน้ำ',
    'ป้ายจราจร', 'สอบถาม', 'ป้าย', 'PM2.5'
]

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
model = AutoModelForSequenceClassification.from_pretrained("KDAI-NLP/wangchanberta-traffy-multi")

# Example text to classify
text = "ช่วยด้วยครับถนนน้ำท่วมอีกแล้ว ต้นไม้ก็ล้มขวางทาง กลับบ้านไม่ได้"

# Encode the text using the tokenizer
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

# Get model predictions (logits)
with torch.no_grad():
    logits = model(**inputs).logits

# Apply sigmoid function to convert logits to probabilities
probabilities = sigmoid(logits)

# Map probabilities to corresponding labels
probabilities = probabilities.squeeze().tolist()
label_probabilities = zip(target_list, probabilities)

# Print labels with probabilities
for label, probability in label_probabilities:
    print(f"{label}: {probability:.4f}")

# Or JSON
# Create a dictionary for labels and probabilities
results_dict = {label: probability for label, probability in label_probabilities}

# Convert dictionary to JSON string
results_json = json.dumps(results_dict, ensure_ascii=False, indent=4)

# Print the JSON string
print(results_json)

Training Details

The model was trained on traffic complaint data API (included stopwords) using the airesearch/wangchanberta-base-att-spm-uncased base model. This is a multi-label classification task with a total of 24 classes.

Training Scores

Model Stopword Epoch Training Loss Validation Loss F1 Accuracy
wangchanberta-base-att-spm-uncased Included 0 0.0322 0.034822 0.7015 0.7569
wangchanberta-base-att-spm-uncased Included 2 0.0207 0.026364 0.8405 0.7821
wangchanberta-base-att-spm-uncased Included 4 0.0165 0.025142 0.8458 0.7934

Feel free to customize the README further if needed.