nllb-ft-darija / README.md
tachicart's picture
Update README.md
cdedf15 verified
metadata
language: ar
tags:
  - translation
  - nllb
  - fine-tuning
  - darija
  - moroccan
  - transformers
datasets:
  - json
library_name: transformers
model_name: tachicart/nllb-ft-darija

NLLB Fine-tuned for Darija to Modern Standard Arabic Translation

This model is a fine-tuned version of facebook/nllb-200-distilled-600M for translating Moroccan Darija (ary) to Modern Standard Arabic (ar). The model was fine-tuned on a custom dataset using the Hugging Face transformers library. The model is developed by : Tachicart Ridouane, Bouzoubaa Karim [email protected]

Model Details

  • Base Model: facebook/nllb-200-distilled-600M
  • Fine-tuning Library: Hugging Face transformers
  • Languages Supported: Moroccan Darija (ary), Modern Standard Arabic (ar)
  • Training Dataset: Custom dataset of Moroccan Darija and Modern Standard Arabic pairs in JSON format.

Performance

The model has been evaluated on a validation set to ensure translation quality. While it excels at capturing colloquial Moroccan Arabic, ongoing improvements and additional data can further enhance its performance.

Limitations

Dataset Size: The custom dataset consists of 21,000 samples, which may limit coverage of diverse expressions and rare terms. Colloquial Variations: Moroccan Arabic has many dialectal variations, which might not all be covered equally.

How to Use

You can use the model with the transformers library as follows:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("tachicart/nllb-ft-darija")
model = AutoModelForSeq2SeqLM.from_pretrained("tachicart/nllb-ft-darija")

# Example translation
inputs = tokenizer("كيفاش نقدر نربح بزاف ديال الفلوس بالزربة  ", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))