Update README.md

819655b verified 6 days ago

5.59 kB

	---
	base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
	model_name: llama-8b-south-africa
	languages:
	- Xhosa
	- Zulu
	- Tswana
	- Northern Sotho
	- Afrikaans
	license: apache-2.0
	tags:
	- african-languages
	- multilingual
	- instruction-tuning
	- transfer-learning
	library_name: peft

	model_description: \|
	This model is a fine-tuned version of Meta's LLaMA-3.1-8B-Instruct model, specifically adapted for South African languages. The training data consists of the Alpaca Cleaned dataset translated into five South African languages: Xhosa, Zulu, Tswana, Northern Sotho, and Afrikaans using machine translation techniques.

	Key Features:
	- Base architecture: LLaMA-3.1-8B-Instruct
	- Training approach: Instruction tuning via translated datasets
	- Target languages: 5 South African languages
	- Cost-efficient: Total cost ~$1,870 ($370/language for translation + $15 for training)

	training_details:
	hyperparameters:
	learning_rate: 0.0002
	train_batch_size: 4
	eval_batch_size: 8
	gradient_accumulation_steps: 2
	total_train_batch_size: 8
	optimizer: "Adam with betas=(0.9,0.999) and epsilon=1e-08"
	lr_scheduler_type: cosine
	lr_scheduler_warmup_ratio: 0.1
	num_epochs: 1
	seed: 42
	distributed_type: multi-GPU

	results:
	final_loss: 1.0959
	validation_loss: 0.0571
	total_steps: 5596
	completed_epochs: 0.9999

	model_evaluation:
	xhosa:
	afrimgsm:
	accuracy: 0.02
	afrimmlu:
	accuracy: 0.29
	afrixnli:
	accuracy: 0.44
	zulu:
	afrimgsm:
	accuracy: 0.045
	afrimmlu:
	accuracy: 0.29
	afrixnli:
	accuracy: 0.43

	limitations: \|
	- Current evaluation metrics are limited to Xhosa and Zulu due to Iroko language availability
	- Machine translation was used for training data generation, which may impact quality
	- Low performance on certain tasks (particularly AfriMGSM) suggests room for improvement

	framework_versions:
	pytorch: 2.4.1+cu121
	transformers: 4.44.2
	peft: 0.12.0
	datasets: 3.0.0
	tokenizers: 0.19.1

	resources:
	benchmark_visualization: assets/Benchmarks_(1).pdf
	training_dataset: https://huggingface.co/datasets/yahma/alpaca-cleaned
	---

	# LLaMA-3.1-8B South African Languages Model

	This model card provides detailed information about the LLaMA-3.1-8B model fine-tuned for South African languages. The model demonstrates cost-effective cross-lingual transfer learning for African language processing.

	## Model Overview

	The model is based on Meta's LLaMA-3.1-8B-Instruct architecture and has been fine-tuned on translated versions of the Alpaca Cleaned dataset. The training approach leverages machine translation to create instruction-tuning data in five South African languages, making it a cost-effective solution for multilingual AI development.

	## Training Methodology

	### Dataset Preparation
	The training data was created by translating the Alpaca Cleaned dataset into five target languages:
	- Xhosa
	- Zulu
	- Tswana
	- Northern Sotho
	- Afrikaans

	Machine translation was used to generate the training data, with a cost of $370 per language.

	### Training Process
	The model was trained using the PEFT (Parameter-Efficient Fine-Tuning) library on the Akash Compute Network. Key aspects of the training process include:
	- Single epoch training
	- Multi-GPU distributed training setup
	- Cosine learning rate schedule with 10% warmup
	- Adam optimizer with β1=0.9, β2=0.999, ε=1e-08
	- Total training cost: $15

	## Performance Evaluation

	### Evaluation Scope
	Current evaluation metrics are available for two languages:
	1. Xhosa (xho)
	2. Zulu (zul)

	Evaluation was conducted using three benchmark datasets:

	### AfriMGSM Results
	- Xhosa: 2.0% accuracy
	- Zulu: 4.5% accuracy

	### AfriMMIU Results
	- Xhosa: 29.0% accuracy
	- Zulu: 29.0% accuracy

	### AfriXNLI Results
	- Xhosa: 44.0% accuracy
	- Zulu: 43.0% accuracy

	## Limitations and Considerations

	1. Evaluation Coverage
	- Only Xhosa and Zulu could be evaluated due to limitations in available benchmarking tools
	- Performance on other supported languages remains unknown

	2. Training Data Quality
	- Reliance on machine translation may impact the quality of training data
	- Potential artifacts or errors from the translation process could affect model performance

	3. Performance Gaps
	- Notably low performance on AfriMGSM tasks indicates room for improvement
	- Further investigation needed to understand performance disparities across tasks

	## Technical Requirements

	The model requires the following framework versions:
	- PyTorch: 2.4.1+cu121
	- Transformers: 4.44.2
	- PEFT: 0.12.0
	- Datasets: 3.0.0
	- Tokenizers: 0.19.1

	## Usage Example

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load the model and tokenizer
	model_name = "meta-llama/llama-8b-south-africa"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	# Example usage for text generation
	text = "Translate to Xhosa: Hello, how are you?"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50)
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result)
	```

	## License

	This model is released under the Apache 2.0 license. The full license text can be found at https://www.apache.org/licenses/LICENSE-2.0.txt

	## Acknowledgments

	- Meta AI for the base LLaMA-3.1-8B-Instruct model
	- Akash Network for providing computing resources
	- Contributors to the Alpaca Cleaned dataset
	- The African NLP community for benchmark datasets and evaluation tools