File size: 5,304 Bytes
132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 326f9dc 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 326f9dc 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 132c1e0 4a3f682 326f9dc 4a3f682 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
---
library_name: transformers
tags:
- darija
- moroccan_darija
- translation
- seamless
- text-generation-inference
- Machine translation
- MA
- NLP
datasets:
- AnasAber/DoDA_sentences_darija_english
- HANTIFARAH/cleaned_subtitles_all_videos2
language:
- en
- ar
base_model:
- facebook/seamless-m4t-v2-large
pipeline_tag: text2text-generation
---
# Seamless Enhanced Darija-English Translation Model
## Model Details
- **Model Name**: seamless-darija-eng
- **Base Model**: facebook/seamless-m4t-v2-large
- **Model Type**: Fine-tuned translation model
- **Languages**: Moroccan Arabic (Darija) ↔ English
- **Developer**: Anas ABERCHIH
## Model Description
This model is a fine-tuned version of Facebook's Seamless large m4t-v2 model, specifically optimized for translation between Moroccan Arabic (Darija) and English.
It leverages the power of the base Seamless model while being tailored for the nuances of Darija, making it particularly effective for Moroccan Arabic to English translations and vice versa.
### Training Data
The model was trained on two datasets.
First on a dataset of 40,000 sentence pairs:
Training set: 32,780 pairs
Validation set: 5,785 pairs
Test set: 6,806 pairs
And second, on a dataset of 82,332 sentence pairs:
- Training set: 59,484 pairs
- Validation set: 10,498 pairs
- Test set: 12,350 pairs
Each entry in the dataset contains:
- Darija text (Arabic script)
- English translation
### Training Procedure
- **Training Duration**: Approximately 9 hours
- **Number of Epochs**: 5
## Intended Use
This model is intended to be used directly for translating text from Moroccan Arabic (Darija) to English.
It can be further fine-tuned, and deployed in various applications requiring translation services.
This version is more capable than the original model in Darija to English translation.
### Direct Use
This model is designed for:
1. Translating Moroccan Arabic (Darija) text to English
2. Translating English text to Moroccan Arabic (Darija)
It can be particularly useful for:
- Localization of content for Moroccan audiences
- Cross-cultural communication between Darija speakers and English speakers
- Assisting in the understanding of Moroccan social media content, informal writing, or dialect-heavy texts
### Downstream Use
The model can be integrated into various applications, such as:
- Machine translation systems focusing on Moroccan content
- Chatbots or virtual assistants for Moroccan users
- Content analysis tools for Moroccan social media or web content
- Educational tools for language learners (both Darija and English)
## Limitations and Bias
The model's performance may be influenced by biases present in the training data, such as the representation of certain dialectal variations or cultural nuances.
Additionally, the model's accuracy may vary depending on the complexity of the text being translated and the presence of out-of-vocabulary words.
### Out-of-Scope Use
This model should not be used for:
1. Legal or medical translations where certified human translators are required
2. Translating other Arabic dialects or Modern Standard Arabic (MSA) to English (or vice versa)
3. Understanding or generating spoken language directly (it's designed for text)
### Recommendations
- Always review the output for critical applications, especially when dealing with nuanced or context-dependent content
- Be aware that the model may not capture all regional variations within Moroccan Arabic
- For formal or professional content, consider post-editing by a human translator
## How to Get Started
To use this model:
1. Install the Transformers library:
```
pip install transformers
```
2. Load the model and tokenizer:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "AnasAber/seamless-darija-eng"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
3. Translate text:
```python
def translate(text, src_lang, tgt_lang):
inputs = tokenizer(text, return_tensors="pt")
translated = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang])
return tokenizer.batch_decode(translated, skip_special_tokens=True)[0]
# Darija to English
darija_text = "كيفاش نقدر نتعلم الإنجليزية بسرعة؟"
english_translation = translate(darija_text, src_lang="ary", tgt_lang="eng")
print(english_translation)
# English to Darija
english_text = "How can I learn English quickly?"
darija_translation = translate(english_text, src_lang="eng", tgt_lang="ary")
print(darija_translation)
```
Remember to handle exceptions and implement proper error checking in production environments.
## Ethical Considerations
- Respect privacy and data protection laws when using this model with user-generated content
- Be aware of potential biases in the training data that may affect translations
- Use the model responsibly and avoid applications that could lead to discrimination or harm
## Contact Information
For questions, citations, or feedback about this model, please contact Anas ABERCHIH at ![https://www.linkedin.com/in/anas-aberchih-%F0%9F%87%B5%F0%9F%87%B8-b6007121b/] or my linked github account. |