AnasAber commited on
Commit
4a3f682
·
verified ·
1 Parent(s): 687422e

Updated the README file

Browse files
Files changed (1) hide show
  1. README.md +121 -168
README.md CHANGED
@@ -1,199 +1,152 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
-
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
 
12
  ## Model Details
13
 
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
 
78
  ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
 
 
92
 
93
- #### Training Hyperparameters
 
 
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
 
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
100
 
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
 
 
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
 
 
182
 
183
- ## Glossary [optional]
 
 
 
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
 
 
 
 
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
 
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
 
 
 
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - darija
5
+ - moroccan_darija
6
+ - translation
7
+ - seamless
8
+ - text-generation-inference
9
+ - Machine translation
10
+ - MA
11
+ - NLP
12
+ datasets:
13
+ - AnasAber/DoDA_sentences_darija_english
14
+ - HANTIFARAH/cleaned_subtitles_all_videos2
15
+ language:
16
+ - en
17
+ - ar
18
+ base_model:
19
+ - facebook/seamless-m4t-v2-large
20
+ pipeline_tag: text2text-generation
21
  ---
22
+ # Seamless Enhanced Darija-English Translation Model
 
 
 
 
 
23
 
24
  ## Model Details
25
 
26
+ - **Model Name**: AnasAber/seamless-enhanced-darija-eng_v1.2
27
+ - **Base Model**: facebook/seamless-m4t-v2-large
28
+ - **Model Type**: Fine-tuned translation model
29
+ - **Languages**: Moroccan Arabic (Darija) ↔ English
30
+ - **Developer**: Anas ABERCHIH
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
+ ## Model Description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ This model is a fine-tuned version of Facebook's Seamless large m4t-v2 model, specifically optimized for translation between Moroccan Arabic (Darija) and English.
35
+ It leverages the power of the base Seamless model while being tailored for the nuances of Darija, making it particularly effective for Moroccan Arabic to English translations and vice versa.
 
 
 
 
 
36
 
37
  ### Training Data
38
 
39
+ The model was trained on two datasets.
 
 
 
 
 
 
 
 
 
 
40
 
41
+ First on a dataset of 40,000 sentence pairs:
42
 
43
+ Training set: 32,780 pairs
44
+ Validation set: 5,785 pairs
45
+ Test set: 6,806 pairs
46
 
47
+ And second, on a dataset of 82,332 sentence pairs:
48
 
49
+ - Training set: 59,484 pairs
50
+ - Validation set: 10,498 pairs
51
+ - Test set: 12,350 pairs
52
 
53
+ Each entry in the dataset contains:
54
+ - Darija text (Arabic script)
55
+ - English translation
56
 
57
+ ### Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
+ - **Training Duration**: Approximately 9 hours
60
+ - **Number of Epochs**: 5
61
 
62
+ ## Intended Use
63
 
64
+ This model is intended to be used directly for translating text from Moroccan Arabic (Darija) to English.
65
+ It can be further fine tuned, and deployed in various applications requiring translation services.
66
+ This version is more capable than the original model in Darija to English translation.
67
 
68
+ ### Direct Use
69
 
70
+ This model is designed for:
71
+ 1. Translating Moroccan Arabic (Darija) text to English
72
+ 2. Translating English text to Moroccan Arabic (Darija)
73
 
74
+ It can be particularly useful for:
75
+ - Localization of content for Moroccan audiences
76
+ - Cross-cultural communication between Darija speakers and English speakers
77
+ - Assisting in the understanding of Moroccan social media content, informal writing, or dialect-heavy texts
78
 
79
+ ### Downstream Use
80
 
81
+ The model can be integrated into various applications, such as:
82
+ - Machine translation systems focusing on Moroccan content
83
+ - Chatbots or virtual assistants for Moroccan users
84
+ - Content analysis tools for Moroccan social media or web content
85
+ - Educational tools for language learners (both Darija and English)
86
 
87
+ ## Limitations and Bias
88
 
89
+ The model's performance may be influenced by biases present in the training data, such as the representation of certain dialectal variations or cultural nuances.
90
+ Additionally, the model's accuracy may vary depending on the complexity of the text being translated and the presence of out-of-vocabulary words.
91
 
92
+ ### Out-of-Scope Use
93
 
94
+ This model should not be used for:
95
+ 1. Legal or medical translations where certified human translators are required
96
+ 2. Translating other Arabic dialects or Modern Standard Arabic (MSA) to English (or vice versa)
97
+ 3. Understanding or generating spoken language directly (it's designed for text)
98
 
99
+ ### Recommendations
100
 
101
+ - Always review the output for critical applications, especially when dealing with nuanced or context-dependent content
102
+ - Be aware that the model may not capture all regional variations within Moroccan Arabic
103
+ - For formal or professional content, consider post-editing by a human translator
104
+
105
+ ## How to Get Started
106
+
107
+ To use this model:
108
+
109
+ 1. Install the Transformers library:
110
+ ```
111
+ pip install transformers
112
+ ```
113
+
114
+ 2. Load the model and tokenizer:
115
+ ```python
116
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
117
+
118
+ model_name = "AnasAber/seamless-enhanced-darija-eng_v1.2"
119
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
120
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
121
+ ```
122
+
123
+ 3. Translate text:
124
+ ```python
125
+ def translate(text, src_lang, tgt_lang):
126
+ inputs = tokenizer(text, return_tensors="pt")
127
+ translated = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang])
128
+ return tokenizer.batch_decode(translated, skip_special_tokens=True)[0]
129
+
130
+ # Darija to English
131
+ darija_text = "كيفاش نقدر نتعلم الإنجليزية بسرعة؟"
132
+ english_translation = translate(darija_text, src_lang="ary", tgt_lang="eng")
133
+ print(english_translation)
134
+
135
+ # English to Darija
136
+ english_text = "How can I learn English quickly?"
137
+ darija_translation = translate(english_text, src_lang="eng", tgt_lang="ary")
138
+ print(darija_translation)
139
+ ```
140
+
141
+ Remember to handle exceptions and implement proper error checking in production environments.
142
+
143
+ ## Ethical Considerations
144
+
145
+ - Respect privacy and data protection laws when using this model with user-generated content
146
+ - Be aware of potential biases in the training data that may affect translations
147
+ - Use the model responsibly and avoid applications that could lead to discrimination or harm
148
+
149
+
150
+ ## Contact Information
151
+
152
+ For questions, citations, or feedback about this model, please contact Anas ABERCHIH at ![https://www.linkedin.com/in/anas-aberchih-%F0%9F%87%B5%F0%9F%87%B8-b6007121b/] or my linked github account.