chad-brouze commited on
Commit
819655b
·
verified ·
1 Parent(s): c8db23c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -93
README.md CHANGED
@@ -1,118 +1,179 @@
1
  ---
2
  base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
3
- datasets:
4
- - generator
5
- library_name: peft
 
 
 
 
6
  license: apache-2.0
7
  tags:
8
- - trl
9
- - sft
10
- - generated_from_trainer
11
  - african-languages
12
-
13
- benchmark_visualization: assets/Benchmarks_(1).pdf
14
-
15
- model-index:
16
- - name: llama-8b-south-africa
17
- results:
18
- - task:
19
- type: text-generation
20
- name: African Language Evaluation
21
- dataset:
22
- name: afrimgsm_direct_xho
23
- type: text-classification
24
- split: test
25
- metrics:
26
- - name: Accuracy
27
- type: accuracy
28
- value: 0.02
29
- - task:
30
- type: text-generation
31
- name: African Language Evaluation
32
- dataset:
33
- name: afrimgsm_direct_zul
34
- type: text-classification
35
- split: test
36
- metrics:
37
- - name: Accuracy
38
- type: accuracy
39
- value: 0.045
40
- - task:
41
- type: text-generation
42
- name: African Language Evaluation
43
- dataset:
44
- name: afrimmlu_direct_xho
45
- type: text-classification
46
- split: test
47
- metrics:
48
- - name: Accuracy
49
- type: accuracy
50
- value: 0.29
51
- - task:
52
- type: text-generation
53
- name: African Language Evaluation
54
- dataset:
55
- name: afrimmlu_direct_zul
56
- type: text-classification
57
- split: test
58
- metrics:
59
- - name: Accuracy
60
- type: accuracy
61
- value: 0.29
62
- - task:
63
- type: text-generation
64
- name: African Language Evaluation
65
- dataset:
66
- name: afrixnli_en_direct_xho
67
- type: text-classification
68
- split: test
69
- metrics:
70
- - name: Accuracy
71
- type: accuracy
72
- value: 0.44
73
- - task:
74
- type: text-generation
75
- name: African Language Evaluation
76
- dataset:
77
- name: afrixnli_en_direct_zul
78
- type: text-classification
79
- split: test
80
- metrics:
81
- - name: Accuracy
82
- type: accuracy
83
- value: 0.43
84
 
85
  model_description: |
86
- This model is a fine-tuned version of [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) on the generator dataset.
87
- [Alpaca Cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) translated into Xhose, Zulu, Tswana, Northern Sotho and Afrikaans using machine translation.
88
 
89
- The model could only be evaluated in Xhosa and Zulu due to Iroko language availability. Its aim is to show cross-lingual transfer can be achieved at a low cost. Translation cost roughly $370 per language and training cost roughly $15 using an Akash Compute Network GPU.
 
 
 
 
90
 
91
  training_details:
92
- loss: 1.0571
93
  hyperparameters:
94
  learning_rate: 0.0002
95
  train_batch_size: 4
96
  eval_batch_size: 8
97
- seed: 42
98
- distributed_type: multi-GPU
99
  gradient_accumulation_steps: 2
100
  total_train_batch_size: 8
101
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
102
  lr_scheduler_type: cosine
103
  lr_scheduler_warmup_ratio: 0.1
104
  num_epochs: 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
- training_results:
107
- final_loss: 1.0959
108
- epochs: 0.9999
109
- steps: 5596
110
- validation_loss: 1.0571
111
 
112
  framework_versions:
113
- peft: 0.12.0
114
- transformers: 4.44.2
115
  pytorch: 2.4.1+cu121
 
 
116
  datasets: 3.0.0
117
  tokenizers: 0.19.1
118
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
3
+ model_name: llama-8b-south-africa
4
+ languages:
5
+ - Xhosa
6
+ - Zulu
7
+ - Tswana
8
+ - Northern Sotho
9
+ - Afrikaans
10
  license: apache-2.0
11
  tags:
 
 
 
12
  - african-languages
13
+ - multilingual
14
+ - instruction-tuning
15
+ - transfer-learning
16
+ library_name: peft
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  model_description: |
19
+ This model is a fine-tuned version of Meta's LLaMA-3.1-8B-Instruct model, specifically adapted for South African languages. The training data consists of the Alpaca Cleaned dataset translated into five South African languages: Xhosa, Zulu, Tswana, Northern Sotho, and Afrikaans using machine translation techniques.
 
20
 
21
+ Key Features:
22
+ - Base architecture: LLaMA-3.1-8B-Instruct
23
+ - Training approach: Instruction tuning via translated datasets
24
+ - Target languages: 5 South African languages
25
+ - Cost-efficient: Total cost ~$1,870 ($370/language for translation + $15 for training)
26
 
27
  training_details:
 
28
  hyperparameters:
29
  learning_rate: 0.0002
30
  train_batch_size: 4
31
  eval_batch_size: 8
 
 
32
  gradient_accumulation_steps: 2
33
  total_train_batch_size: 8
34
+ optimizer: "Adam with betas=(0.9,0.999) and epsilon=1e-08"
35
  lr_scheduler_type: cosine
36
  lr_scheduler_warmup_ratio: 0.1
37
  num_epochs: 1
38
+ seed: 42
39
+ distributed_type: multi-GPU
40
+
41
+ results:
42
+ final_loss: 1.0959
43
+ validation_loss: 0.0571
44
+ total_steps: 5596
45
+ completed_epochs: 0.9999
46
+
47
+ model_evaluation:
48
+ xhosa:
49
+ afrimgsm:
50
+ accuracy: 0.02
51
+ afrimmlu:
52
+ accuracy: 0.29
53
+ afrixnli:
54
+ accuracy: 0.44
55
+ zulu:
56
+ afrimgsm:
57
+ accuracy: 0.045
58
+ afrimmlu:
59
+ accuracy: 0.29
60
+ afrixnli:
61
+ accuracy: 0.43
62
 
63
+ limitations: |
64
+ - Current evaluation metrics are limited to Xhosa and Zulu due to Iroko language availability
65
+ - Machine translation was used for training data generation, which may impact quality
66
+ - Low performance on certain tasks (particularly AfriMGSM) suggests room for improvement
 
67
 
68
  framework_versions:
 
 
69
  pytorch: 2.4.1+cu121
70
+ transformers: 4.44.2
71
+ peft: 0.12.0
72
  datasets: 3.0.0
73
  tokenizers: 0.19.1
74
+
75
+ resources:
76
+ benchmark_visualization: assets/Benchmarks_(1).pdf
77
+ training_dataset: https://huggingface.co/datasets/yahma/alpaca-cleaned
78
+ ---
79
+
80
+ # LLaMA-3.1-8B South African Languages Model
81
+
82
+ This model card provides detailed information about the LLaMA-3.1-8B model fine-tuned for South African languages. The model demonstrates cost-effective cross-lingual transfer learning for African language processing.
83
+
84
+ ## Model Overview
85
+
86
+ The model is based on Meta's LLaMA-3.1-8B-Instruct architecture and has been fine-tuned on translated versions of the Alpaca Cleaned dataset. The training approach leverages machine translation to create instruction-tuning data in five South African languages, making it a cost-effective solution for multilingual AI development.
87
+
88
+ ## Training Methodology
89
+
90
+ ### Dataset Preparation
91
+ The training data was created by translating the Alpaca Cleaned dataset into five target languages:
92
+ - Xhosa
93
+ - Zulu
94
+ - Tswana
95
+ - Northern Sotho
96
+ - Afrikaans
97
+
98
+ Machine translation was used to generate the training data, with a cost of $370 per language.
99
+
100
+ ### Training Process
101
+ The model was trained using the PEFT (Parameter-Efficient Fine-Tuning) library on the Akash Compute Network. Key aspects of the training process include:
102
+ - Single epoch training
103
+ - Multi-GPU distributed training setup
104
+ - Cosine learning rate schedule with 10% warmup
105
+ - Adam optimizer with β1=0.9, β2=0.999, ε=1e-08
106
+ - Total training cost: $15
107
+
108
+ ## Performance Evaluation
109
+
110
+ ### Evaluation Scope
111
+ Current evaluation metrics are available for two languages:
112
+ 1. Xhosa (xho)
113
+ 2. Zulu (zul)
114
+
115
+ Evaluation was conducted using three benchmark datasets:
116
+
117
+ ### AfriMGSM Results
118
+ - Xhosa: 2.0% accuracy
119
+ - Zulu: 4.5% accuracy
120
+
121
+ ### AfriMMIU Results
122
+ - Xhosa: 29.0% accuracy
123
+ - Zulu: 29.0% accuracy
124
+
125
+ ### AfriXNLI Results
126
+ - Xhosa: 44.0% accuracy
127
+ - Zulu: 43.0% accuracy
128
+
129
+ ## Limitations and Considerations
130
+
131
+ 1. **Evaluation Coverage**
132
+ - Only Xhosa and Zulu could be evaluated due to limitations in available benchmarking tools
133
+ - Performance on other supported languages remains unknown
134
+
135
+ 2. **Training Data Quality**
136
+ - Reliance on machine translation may impact the quality of training data
137
+ - Potential artifacts or errors from the translation process could affect model performance
138
+
139
+ 3. **Performance Gaps**
140
+ - Notably low performance on AfriMGSM tasks indicates room for improvement
141
+ - Further investigation needed to understand performance disparities across tasks
142
+
143
+ ## Technical Requirements
144
+
145
+ The model requires the following framework versions:
146
+ - PyTorch: 2.4.1+cu121
147
+ - Transformers: 4.44.2
148
+ - PEFT: 0.12.0
149
+ - Datasets: 3.0.0
150
+ - Tokenizers: 0.19.1
151
+
152
+ ## Usage Example
153
+
154
+ ```python
155
+ from transformers import AutoModelForCausalLM, AutoTokenizer
156
+
157
+ # Load the model and tokenizer
158
+ model_name = "meta-llama/llama-8b-south-africa"
159
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
160
+ model = AutoModelForCausalLM.from_pretrained(model_name)
161
+
162
+ # Example usage for text generation
163
+ text = "Translate to Xhosa: Hello, how are you?"
164
+ inputs = tokenizer(text, return_tensors="pt")
165
+ outputs = model.generate(**inputs, max_length=50)
166
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
167
+ print(result)
168
+ ```
169
+
170
+ ## License
171
+
172
+ This model is released under the Apache 2.0 license. The full license text can be found at https://www.apache.org/licenses/LICENSE-2.0.txt
173
+
174
+ ## Acknowledgments
175
+
176
+ - Meta AI for the base LLaMA-3.1-8B-Instruct model
177
+ - Akash Network for providing computing resources
178
+ - Contributors to the Alpaca Cleaned dataset
179
+ - The African NLP community for benchmark datasets and evaluation tools