AnasAber
/

seamless-darija-eng

@@ -1,199 +1,152 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+tags:
+- darija
+- moroccan_darija
+- translation
+- seamless
+- text-generation-inference
+- Machine translation
+- MA
+- NLP
+datasets:
+- AnasAber/DoDA_sentences_darija_english
+- HANTIFARAH/cleaned_subtitles_all_videos2
+language:
+- en
+- ar
+base_model:
+- facebook/seamless-m4t-v2-large
+pipeline_tag: text2text-generation
 ---
+# Seamless Enhanced Darija-English Translation Model
 ## Model Details
+- **Model Name**: AnasAber/seamless-enhanced-darija-eng_v1.2
+- **Base Model**: facebook/seamless-m4t-v2-large
+- **Model Type**: Fine-tuned translation model
+- **Languages**: Moroccan Arabic (Darija) ↔ English
+- **Developer**: Anas ABERCHIH
+## Model Description
+This model is a fine-tuned version of Facebook's Seamless large m4t-v2 model, specifically optimized for translation between Moroccan Arabic (Darija) and English.
+It leverages the power of the base Seamless model while being tailored for the nuances of Darija, making it particularly effective for Moroccan Arabic to English translations and vice versa.
 ### Training Data
+The model was trained on two datasets.
+First on a dataset of 40,000 sentence pairs:
+Training set: 32,780 pairs
+Validation set: 5,785 pairs
+Test set: 6,806 pairs
+And second, on a dataset of 82,332 sentence pairs:
+- Training set: 59,484 pairs
+- Validation set: 10,498 pairs
+- Test set: 12,350 pairs
+Each entry in the dataset contains:
+- Darija text (Arabic script)
+- English translation
+### Training Procedure
+- **Training Duration**: Approximately 9 hours
+- **Number of Epochs**: 5
+## Intended Use
+This model is intended to be used directly for translating text from Moroccan Arabic (Darija) to English.
+It can be further fine tuned, and deployed in various applications requiring translation services.
+This version is more capable than the original model in Darija to English translation.
+### Direct Use
+This model is designed for:
+1. Translating Moroccan Arabic (Darija) text to English
+2. Translating English text to Moroccan Arabic (Darija)
+It can be particularly useful for:
+- Localization of content for Moroccan audiences
+- Cross-cultural communication between Darija speakers and English speakers
+- Assisting in the understanding of Moroccan social media content, informal writing, or dialect-heavy texts
+### Downstream Use
+The model can be integrated into various applications, such as:
+- Machine translation systems focusing on Moroccan content
+- Chatbots or virtual assistants for Moroccan users
+- Content analysis tools for Moroccan social media or web content
+- Educational tools for language learners (both Darija and English)
+## Limitations and Bias
+The model's performance may be influenced by biases present in the training data, such as the representation of certain dialectal variations or cultural nuances.
+Additionally, the model's accuracy may vary depending on the complexity of the text being translated and the presence of out-of-vocabulary words.
+### Out-of-Scope Use
+This model should not be used for:
+1. Legal or medical translations where certified human translators are required
+2. Translating other Arabic dialects or Modern Standard Arabic (MSA) to English (or vice versa)
+3. Understanding or generating spoken language directly (it's designed for text)
+### Recommendations
+- Always review the output for critical applications, especially when dealing with nuanced or context-dependent content
+- Be aware that the model may not capture all regional variations within Moroccan Arabic
+- For formal or professional content, consider post-editing by a human translator
+## How to Get Started
+To use this model:
+1. Install the Transformers library:
+   ```
+   pip install transformers
+   ```
+2. Load the model and tokenizer:
+   ```python
+   from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+   model_name = "AnasAber/seamless-enhanced-darija-eng_v1.2"
+   model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+   tokenizer = AutoTokenizer.from_pretrained(model_name)
+   ```
+3. Translate text:
+   ```python
+   def translate(text, src_lang, tgt_lang):
+       inputs = tokenizer(text, return_tensors="pt")
+       translated = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang])
+       return tokenizer.batch_decode(translated, skip_special_tokens=True)[0]
+   # Darija to English
+   darija_text = "كيفاش نقدر نتعلم الإنجليزية بسرعة؟"
+   english_translation = translate(darija_text, src_lang="ary", tgt_lang="eng")
+   print(english_translation)
+   # English to Darija
+   english_text = "How can I learn English quickly?"
+   darija_translation = translate(english_text, src_lang="eng", tgt_lang="ary")
+   print(darija_translation)
+   ```
+Remember to handle exceptions and implement proper error checking in production environments.
+## Ethical Considerations
+- Respect privacy and data protection laws when using this model with user-generated content
+- Be aware of potential biases in the training data that may affect translations
+- Use the model responsibly and avoid applications that could lead to discrimination or harm
+## Contact Information
+For questions, citations, or feedback about this model, please contact Anas ABERCHIH at ![https://www.linkedin.com/in/anas-aberchih-%F0%9F%87%B5%F0%9F%87%B8-b6007121b/] or my linked github account.