language: ar
tags:
- translation
- nllb
- fine-tuning
- darija
- moroccan
- transformers
datasets:
- json
library_name: transformers
model_name: tachicart/nllb-ft-darija
NLLB Fine-tuned for Darija to Modern Standard Arabic Translation
This model is a fine-tuned version of facebook/nllb-200-distilled-600M
for translating Moroccan Darija (ary) to Modern Standard Arabic (ar). The model was fine-tuned on a custom dataset using the Hugging Face transformers
library.
The model is developed by : Tachicart Ridouane, Bouzoubaa Karim
[email protected]
Model Details
- Base Model:
facebook/nllb-200-distilled-600M
- Fine-tuning Library: Hugging Face
transformers
- Languages Supported: Moroccan Darija (ary), Modern Standard Arabic (ar)
- Training Dataset: Custom dataset of Moroccan Darija and Modern Standard Arabic pairs in JSON format.
Performance
The model has been evaluated on a validation set to ensure translation quality. While it excels at capturing colloquial Moroccan Arabic, ongoing improvements and additional data can further enhance its performance.
Limitations
Dataset Size: The custom dataset consists of 21,000 samples, which may limit coverage of diverse expressions and rare terms. Colloquial Variations: Moroccan Arabic has many dialectal variations, which might not all be covered equally.
How to Use
You can use the model with the transformers
library as follows:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("tachicart/nllb-ft-darija")
model = AutoModelForSeq2SeqLM.from_pretrained("tachicart/nllb-ft-darija")
# Example translation
inputs = tokenizer("كيفاش نقدر نربح بزاف ديال الفلوس بالزربة ", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))