HamzaNaser commited on
Commit
679fcf5
·
verified ·
1 Parent(s): 5926441

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -36
README.md CHANGED
@@ -1,59 +1,107 @@
1
  ---
2
- base_model: HamzaNaser/Dialects-to-MSA-Transformer
3
  library_name: transformers
4
- license: mit
5
- tags:
6
- - generated_from_trainer
7
  model-index:
8
  - name: Dialects-to-MSA-Transformer
9
  results: []
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
 
15
- # Dialects-to-MSA-Transformer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- This model is a fine-tuned version of [HamzaNaser/Dialects-to-MSA-Transformer](https://huggingface.co/HamzaNaser/Dialects-to-MSA-Transformer) on an unknown dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 0.5466
20
- - Blue: 46.2134
21
 
22
- ## Model description
 
 
 
23
 
24
- More information needed
25
 
26
- ## Intended uses & limitations
 
27
 
28
- More information needed
29
 
30
- ## Training and evaluation data
 
31
 
32
- More information needed
33
 
34
- ## Training procedure
 
35
 
36
- ### Training hyperparameters
37
 
38
- The following hyperparameters were used during training:
39
- - learning_rate: 4.5e-05
40
- - train_batch_size: 48
41
- - eval_batch_size: 48
42
- - seed: 42
43
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
44
- - lr_scheduler_type: linear
45
- - num_epochs: 1
46
 
47
- ### Training results
48
 
49
- | Training Loss | Epoch | Step | Validation Loss | Blue |
50
- |:-------------:|:-----:|:----:|:---------------:|:-------:|
51
- | 0.6811 | 1.0 | 8695 | 0.5466 | 46.2134 |
 
52
 
 
 
 
 
 
 
53
 
54
- ### Framework versions
55
 
56
- - Transformers 4.44.2
57
- - Pytorch 2.4.0+cu121
58
- - Datasets 2.16.1
59
- - Tokenizers 0.19.1
 
 
1
  ---
 
2
  library_name: transformers
3
+ license: apache-2.0
4
+ base_model: facebook/m2m100_418M
 
5
  model-index:
6
  - name: Dialects-to-MSA-Transformer
7
  results: []
8
+ datasets:
9
+ - HamzaNaser/Dialects-To-MSA-800K
10
+ language:
11
+ - ar
12
+ metrics:
13
+ - bleu
14
+ pipeline_tag: text2text-generation
15
+ tags:
16
+ - Dialects Conversion
17
+ - Text Correction
18
+ - Punctiating
19
+ - Diacretization
20
  ---
21
 
 
 
22
 
23
+ # Dialects-to-MSA-Transformer overview
24
+
25
+ ??????? Check and update Dataset name in the card yaml file, also check all other data in it pelase ???????
26
+
27
+ This Model is optimized to convert written text in various non Standard Classical Arabic into Classic Arabic, the model was Fine-Tuned on 0.8M pairs of sentence generated by OpenAI API gpt-4o-mini Text Generation Model, beside being able to convert Dialects into Classical Arabic, the model can also be used in other NLP tasks such as Text Correction, Diacretization and Sentence Punctuation.
28
+
29
+
30
+
31
+ # Model
32
+ Dialects-to-MSA-Transformer was Fine-Tuned on m2m100_418M, which consist of ~400M parameters, we could consider using larger Model to increase the performace of the resulting Model but more computations capability would be required.
33
+
34
+
35
+ # Dataset
36
+ ??????? Update the name of the dataset ???????
37
+ The Model was trained on `DialectsGeneration-Long-Finalized` Dataset, which consists of 0.8M of random crowled Arabic Tweets sentences with their corrosponding Classical conversion.
38
+ Arabic Tweets are selected randomly from the Arabic-Tweets Datasets https://huggingface.co/datasets/pain/Arabic-Tweets,and Classical Arabic sentences where generated using gpt-4o-mini model by prompting it to convert the given sentences into Corrected Classical Arabic text.
39
+
40
+
41
+ ## Dataset Limitations
42
+ The Dataset used to train the Model consist of some random Arabic Tweets that was not checked as described by the Dataset Authers hence its possible to find gramatically incorrect or semantically incomplted sentences, also the text where normalized in the Arabic-Tweets Datasets which might make it harder for the model to know the meaning of some sentences for some cases, even though the original Tweets crowled from the internet qulity was not perfect for our use case, the resulting trained model is relativley good, achiving Bleu score of ??? on the testing data.
43
+
44
+
45
+ - Below image shows one of the issues the input Dataset might have, some of the missing punctuations might flip the meaning of the sentenece completely, for example of we got the word "No" in the begining of a sentece in this case the "No" would negate the upcoming speach. However if the word "No" where to be followed by a comma, it then will negate the prevuous speach and prove the upcoming sentence instead, both cases are completly contradicted just by adding a single comma.
46
+
47
+ ??????? Normalized punc image ???????
48
+
49
+ - Another example for Data limitation is the sentece might be inaccurate or incomplete, making it harder for the GPT model to convert it to MSA or Classical Arabic, many examples for inconsistent input text can be found in the Dataset the Model trained on.
50
+
51
+
52
+ ## GPT Generated Data
53
+ Classical Arabic sentences used to train the model where generated using GPT Model, we used gpt-4o-mini as its cheaper and faster and produces a fine quality results, target text might not be 100% accurate as the quality of the dataset is not perfect, and mini vesion of GPT where used to reduce cost and time, expected model perfomance improvement if we would use the bigger GPT-4 version or even use GPT-5 or later versions when they released in the future.
54
+
55
+
56
+ ## Dialects Used to Trained the model
57
+ As an estimate on 200K random samples, most training samples are from Gulf region, the dialects regions is an estimate of the input Arabic-Tweets Dataset classified by DialectIdentifyer provided by camel tools https://camel-tools.readthedocs.io/en/latest/api/dialectid.html, we also can say that the Model can work better to convert below Regions' Dialects into MSA, below figure shows the approximate dilects by regions the Model were trained on.
58
+
59
+
60
+ ??????? Regions Image ???????
61
+
62
+
63
+
64
 
 
 
 
 
65
 
66
+ # Other use cases for the model
67
+ - Text Correction: Some of the input text might have typos, GPT model beside converting the Dialects text into MSA, it also fixed these typos, making it possible to use the Model as Text Correction.
68
+ - Punctiating: Input text in normalized with no punctuations, also GPT model were to add needed puctuactions while conversions, making the Model capable of adding missing or incorrect punctiations.
69
+ - Diacretization: Similar for Punctuating, input text has no diacretizations, we noticed that GPT output text include some of the mostly required diacretization that might affect the meaning of the sentence, this Model behavior can be adjusted through Prompt Engineering, also half way through generating the data the GPT model was prompted to include diacretizations while converting the text into MSA, although not all of the produce texts even after adjusting the prompt contains these diacs, it might be possible to use the Model to add some of the most important diacretization to the input sentence.
70
 
 
71
 
72
+ # Training
73
+ In this section will discuss trainig procedure we followed as well metrics used and final results.
74
 
 
75
 
76
+ ## Steps
77
+ Developing the Model ran into several steps, first step is by training the Model on 800K portion of the data twice using LoRA and traditional Full Parameters training, to see how much the performace would be affected if we used any of the PEFT techniques when training the final Model with larger split of the data, we proceed with Full param method in training as the Model accuracy were affected by the reduction of the number of trainable parameters with not much reduction in training time, might consider experimenting the use of PEFT in future project.
78
 
 
79
 
80
+ ## Bleu Score Metric
81
+ Dialects to MSA or Classical Arabic conversion is a sequence to sequence task closer for Machine Translation, as we are try to convert the same exact sentence but in differnt formation or rules, we ends up chosing Bleu score for evaluating our Model performance as it measures how similar the occurrence word in predict and correct MSA text, and it is a good and widly used metric for Seq2Seq MT.
82
 
 
83
 
84
+ ## Testing
85
+ To ensure correct performance measuring for the Model we need to make sure that the input and generated sequences are ~100% accurate when evaluating our Model, we need to inspect portion of the data to use as a test split since we got some data quality issues for the input and generated text,
86
+ Inspecting large paris of texts might be tedious, thus we have taken a sample of 553 paris of sentences as a testing split and inspected them manually, ~25% of the data needed modifications or was being discarded, resulting test data size is 477 inspected sentences, even though the test split size is small, it still help to provice enough indication if the Model is improving when changing certain parameters or its over or under fitting on the training data.
 
 
 
 
 
87
 
88
+ ## Results
89
 
90
+ | Data Set Size | GPU Device | Epochs | Training Time | Blue Score |
91
+ |:-------------:|:----------:|:----------:|:---------------:|:------------:|
92
+ | 0.8M | A100 | 3 | 6Hrs | ??????? |
93
+ | 3.0M | A100 | 1 | ??????? | ??????? |
94
 
95
+ ## Costs and Resources
96
+ ??????? Update costs as per exact used after finishing the model ???????
97
+ ??????? Adjust GPU costs after finishing up training ???????
98
+ There are two main computing resources when building Dialects to MSA Transformer, one is the generation of MSA sequences using GPT model, the second resource is the GPU used to train and adjust the parameters of the pretrained Model.
99
+ - OpenAI API: Generating the data took around a week with small batches fed into the API due to limited max tokens sizes and due to arabic being tokenized in the char level in the GPT Model, total costs for the API is around 30$.
100
+ - GPU: T4, A100 provided by Google Colab total computing units are ?????? which around 10$
101
 
 
102
 
103
+ # Possible Future Improvements
104
+ As discussed eariler we got some limitations related with the used data and the used model, and we expect to achive better model performance by utilzing better data or GPT model to generate the Classical Arabic text
105
+ - Data: The data used to train the Model is random Arabic Tweets and its possible to find incorrect sentences grammatical and syntactical, we may use customized and more accurate data in the future which expect to increase the performance of the Model, also using larger input of texts with variaty of dialects is also considered.
106
+ - GPT: We could use larger or any future released versions of GPT models instead of the smaller one "gpt-4o-mini" which expect to improve the qulity for the generated Classical Arabic sentences.
107
+ - Model: We also might consider using larger Model, which also expect to improve the performance