BenhamdaneNawfal
commited on
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -1,8 +1,122 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Sentiment Analysis for Darija (Arabic Dialect)
|
2 |
+
|
3 |
+
This repository hosts a **Sentiment Analysis model for Darija** (Moroccan Arabic dialect), built using **BERT**. The model is fine-tuned to classify text into two categories: **positive** and **negative** sentiment. It is designed to facilitate sentiment analysis in applications involving Darija text data, such as social media analysis, customer feedback, or market research.
|
4 |
+
|
5 |
+
---
|
6 |
+
|
7 |
+
## Model Details
|
8 |
+
|
9 |
+
- **Base Model**: [SI2M-Lab/DarijaBERT](https://huggingface.co/SI2M-Lab/DarijaBERT)
|
10 |
+
- **Task**: Sentiment Classification (Binary)
|
11 |
+
- **Architecture**: BERT with a custom classification head and dropout regularization (0.3 dropout rate).
|
12 |
+
- **Fine-Tuning Data**: Dataset of labeled Darija text samples (positive and negative).
|
13 |
+
- **Max Sequence Length**: 128 tokens
|
14 |
+
|
15 |
+
---
|
16 |
+
|
17 |
+
## How to Use
|
18 |
+
|
19 |
+
### Load the Model and Tokenizer
|
20 |
+
|
21 |
+
To use this model for sentiment analysis, you can load it using the Transformers library:
|
22 |
+
|
23 |
+
```python
|
24 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
25 |
+
|
26 |
+
# Load the tokenizer and model
|
27 |
+
tokenizer = AutoTokenizer.from_pretrained("BenhamdaneNawfal/sentiment-analysis-darija")
|
28 |
+
model = AutoModelForSequenceClassification.from_pretrained("BenhamdaneNawfal/sentiment-analysis-darija")
|
29 |
+
|
30 |
+
# Example text
|
31 |
+
test_text = "هذا المنتج رائع جدا"
|
32 |
+
|
33 |
+
# Tokenize the text
|
34 |
+
inputs = tokenizer(test_text, return_tensors="pt", truncation=True, padding=True, max_length=128)
|
35 |
+
|
36 |
+
# Get model predictions
|
37 |
+
outputs = model(**inputs)
|
38 |
+
logits = outputs.logits
|
39 |
+
predicted_class = logits.argmax().item()
|
40 |
+
|
41 |
+
print(f"Predicted class: {predicted_class}")
|
42 |
+
```
|
43 |
+
|
44 |
+
### Output Classes
|
45 |
+
- **0**: Negative
|
46 |
+
- **1**: Positive
|
47 |
+
|
48 |
+
---
|
49 |
+
|
50 |
+
## Fine-Tuning Process
|
51 |
+
|
52 |
+
The model was fine-tuned using the following:
|
53 |
+
|
54 |
+
- **Dataset**: A dataset of Darija text labeled for sentiment.
|
55 |
+
- **Loss Function**: Cross-entropy loss for binary classification.
|
56 |
+
- **Optimizer**: AdamW with weight decay (0.01).
|
57 |
+
- **Learning Rate**: 5e-5 with linear warmup.
|
58 |
+
- **Batch Size**: 16 for training, 64 for evaluation.
|
59 |
+
- **Early Stopping**: Training stops if validation loss does not improve after 1 epoch.
|
60 |
+
|
61 |
+
---
|
62 |
+
|
63 |
+
## Evaluation Metrics
|
64 |
+
|
65 |
+
The model's performance was evaluated using the following metrics:
|
66 |
+
|
67 |
+
- **Accuracy**: 85%
|
68 |
+
- **Precision**: 87%
|
69 |
+
- **Recall**: 83%
|
70 |
+
- **F1-Score**: 85%
|
71 |
+
|
72 |
+
---
|
73 |
+
|
74 |
+
## Publishing on Hugging Face
|
75 |
+
|
76 |
+
The model and tokenizer were saved and uploaded to Hugging Face using the `huggingface_hub` library. To reproduce or fine-tune this model, follow these steps:
|
77 |
+
|
78 |
+
1. Save the model and tokenizer:
|
79 |
+
```python
|
80 |
+
model.save_pretrained("darija-bert-model")
|
81 |
+
tokenizer.save_pretrained("darija-bert-model")
|
82 |
+
```
|
83 |
+
|
84 |
+
2. Upload the model to Hugging Face:
|
85 |
+
```python
|
86 |
+
from huggingface_hub import upload_folder
|
87 |
+
|
88 |
+
upload_folder(
|
89 |
+
folder_path="darija-bert-model",
|
90 |
+
repo_id="BenhamdaneNawfal/sentiment-analysis-darija",
|
91 |
+
repo_type="model"
|
92 |
+
)
|
93 |
+
```
|
94 |
+
|
95 |
+
---
|
96 |
+
|
97 |
+
## Future Work
|
98 |
+
|
99 |
+
- Expand the dataset to include more labeled examples from diverse sources.
|
100 |
+
- Fine-tune the model for multi-class sentiment analysis (e.g., neutral, positive, negative).
|
101 |
+
- Explore the use of data augmentation techniques for better generalization.
|
102 |
+
|
103 |
+
---
|
104 |
+
|
105 |
+
## Citation
|
106 |
+
If you use this model, please cite it as:
|
107 |
+
|
108 |
+
```
|
109 |
+
@misc{benhamdanenawfal2025darijabert,
|
110 |
+
author = {Benhamdane Nawfal},
|
111 |
+
title = {Sentiment Analysis for Darija (Arabic Dialect)},
|
112 |
+
year = {2025},
|
113 |
+
publisher = {Hugging Face},
|
114 |
+
url = {https://huggingface.co/BenhamdaneNawfal/sentiment-analysis-darija}
|
115 |
+
}
|
116 |
+
```
|
117 |
+
|
118 |
+
---
|
119 |
+
|
120 |
+
## Contact
|
121 |
+
For any questions or issues, feel free to contact me at: [[email protected]].
|
122 |
+
|