---
language: ar
tags:
- sentiment-analysis
- darija
- arabic
license: apache-2.0
datasets:
- custom
metrics:
- accuracy
- precision
- recall
- f1
---


# Sentiment Analysis for Darija (Arabic Dialect)

This repository hosts a **Sentiment Analysis model for Darija** (Moroccan Arabic dialect), built using **BERT**. The model is fine-tuned to classify text into two categories: **positive** and **negative** sentiment. It is designed to facilitate sentiment analysis in applications involving Darija text data, such as social media analysis, customer feedback, or market research.

---

## Model Details

- **Base Model**: [SI2M-Lab/DarijaBERT](https://huggingface.co/SI2M-Lab/DarijaBERT)
- **Task**: Sentiment Classification (Binary)
- **Architecture**: BERT with a custom classification head and dropout regularization (0.3 dropout rate).
- **Fine-Tuning Data**: Dataset of labeled Darija text samples (positive and negative).
- **Max Sequence Length**: 128 tokens

---

## How to Use

### Load the Model and Tokenizer

To use this model for sentiment analysis, you can load it using the Transformers library:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("BenhamdaneNawfal/sentiment-analysis-darija")
model = AutoModelForSequenceClassification.from_pretrained("BenhamdaneNawfal/sentiment-analysis-darija")

# Example text
test_text = "هذا المنتج رائع جدا"

# Tokenize the text
inputs = tokenizer(test_text, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Get model predictions
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax().item()

print(f"Predicted class: {predicted_class}")
```

### Output Classes
- **0**: Negative
- **1**: Positive

---

## Fine-Tuning Process

The model was fine-tuned using the following:

- **Dataset**: A dataset of Darija text labeled for sentiment.
- **Loss Function**: Cross-entropy loss for binary classification.
- **Optimizer**: AdamW with weight decay (0.01).
- **Learning Rate**: 5e-5 with linear warmup.
- **Batch Size**: 16 for training, 64 for evaluation.
- **Early Stopping**: Training stops if validation loss does not improve after 1 epoch.

---

## Evaluation Metrics

The model's performance was evaluated using the following metrics:

- **Accuracy**: 80%
- **Precision**: 0.81%
- **Recall**: 0.79%
- **F1-Score**: 0.80%
 
---

## Publishing on Hugging Face

The model and tokenizer were saved and uploaded to Hugging Face using the `huggingface_hub` library. To reproduce or fine-tune this model, follow these steps:

1. Save the model and tokenizer:
   ```python
   model.save_pretrained("darija-bert-model")
   tokenizer.save_pretrained("darija-bert-model")
   ```


---

## Future Work

- Expand the dataset to include more labeled examples from diverse sources.
- Fine-tune the model for multi-class sentiment analysis (e.g., neutral, positive, negative).
- Explore the use of data augmentation techniques for better generalization.

---

## Citation
If you use this model, please cite it as:

```
@misc{benhamdanenawfal2025darijabert,
  author = {Benhamdane Nawfal},
  title = {Sentiment Analysis for Darija (Arabic Dialect)},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/BenhamdaneNawfal/sentiment-analysis-darija}
}
```

---

## Contact
For any questions or issues, feel free to contact me at: [n.benhamdane2003@gmail.com].