Sentiment Analysis for Darija (Arabic Dialect)
This repository hosts a Sentiment Analysis model for Darija (Moroccan Arabic dialect), built using BERT. The model is fine-tuned to classify text into two categories: positive and negative sentiment. It is designed to facilitate sentiment analysis in applications involving Darija text data, such as social media analysis, customer feedback, or market research.
Model Details
- Base Model: SI2M-Lab/DarijaBERT
- Task: Sentiment Classification (Binary)
- Architecture: BERT with a custom classification head and dropout regularization (0.3 dropout rate).
- Fine-Tuning Data: Dataset of labeled Darija text samples (positive and negative).
- Max Sequence Length: 128 tokens
How to Use
Load the Model and Tokenizer
To use this model for sentiment analysis, you can load it using the Transformers library:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("BenhamdaneNawfal/sentiment-analysis-darija")
model = AutoModelForSequenceClassification.from_pretrained("BenhamdaneNawfal/sentiment-analysis-darija")
# Example text
test_text = "هذا المنتج رائع جدا"
# Tokenize the text
inputs = tokenizer(test_text, return_tensors="pt", truncation=True, padding=True, max_length=128)
# Get model predictions
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax().item()
print(f"Predicted class: {predicted_class}")
Output Classes
- 0: Negative
- 1: Positive
Fine-Tuning Process
The model was fine-tuned using the following:
- Dataset: A dataset of Darija text labeled for sentiment.
- Loss Function: Cross-entropy loss for binary classification.
- Optimizer: AdamW with weight decay (0.01).
- Learning Rate: 5e-5 with linear warmup.
- Batch Size: 16 for training, 64 for evaluation.
- Early Stopping: Training stops if validation loss does not improve after 1 epoch.
Evaluation Metrics
The model's performance was evaluated using the following metrics:
- Accuracy: 80%
- Precision: 0.81%
- Recall: 0.79%
- F1-Score: 0.80%
Publishing on Hugging Face
The model and tokenizer were saved and uploaded to Hugging Face using the huggingface_hub
library. To reproduce or fine-tune this model, follow these steps:
- Save the model and tokenizer:
model.save_pretrained("darija-bert-model") tokenizer.save_pretrained("darija-bert-model")
Future Work
- Expand the dataset to include more labeled examples from diverse sources.
- Fine-tune the model for multi-class sentiment analysis (e.g., neutral, positive, negative).
- Explore the use of data augmentation techniques for better generalization.
Citation
If you use this model, please cite it as:
@misc{benhamdanenawfal2025darijabert,
author = {Benhamdane Nawfal},
title = {Sentiment Analysis for Darija (Arabic Dialect)},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/BenhamdaneNawfal/sentiment-analysis-darija}
}
Contact
For any questions or issues, feel free to contact me at: [[email protected]].
- Downloads last month
- 45