|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Alfaxad/Inkuba-Mono-Swahili |
|
language: |
|
- sw |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
tags: |
|
- gemma2 |
|
- text-2-text |
|
- text-generation |
|
- llms |
|
base_model: |
|
- google/gemma-2-2b |
|
--- |
|
|
|
|
|
|
|
|
|
|
|
|
|
# Gemma2-2B-Swahili-Preview |
|
Gemma2-2B-Swahili-Preview is a Swahili variation of the base language model Gemma2 2B fine-tuned on the Inkuba-Mono Swahili dataset, designed to enhance Swahili language understanding through monolingual training. |
|
|
|
## Model Details |
|
- **Developer:** Alfaxad Eyembe |
|
- **Base Model:** google/gemma-2-2b |
|
- **Model Type:** Decoder-only transformer |
|
- **Language:** Swahili |
|
- **License:** Apache 2.0 |
|
- **Fine-tuning Approach:** Low-Rank Adaptation (LoRA) |
|
|
|
## Training Data |
|
The model was fine-tuned on a focused subset of the Inkuba-Mono dataset: |
|
- 1,000,000 randomly selected examples |
|
- Total tokens: 60,831,073 |
|
- Average text length: 101.33 characters |
|
- Diverse Swahili text sources including news, social media, and various domains |
|
|
|
## Training Details |
|
- **Fine-tuning Method:** LoRA |
|
- **Training Steps:** 2,500 |
|
- **Batch Size:** 2 |
|
- **Gradient Accumulation Steps:** 32 |
|
- **Learning Rate:** 2e-4 |
|
- **Training Time:** ~7.5 hours |
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6375af60e3413701a9f01c0f/8fVULkKb92JTk8-65KE5R.png) |
|
|
|
|
|
|
|
## Model Capabilities |
|
This model is designed for: |
|
- Swahili text continuation |
|
- Natural language understanding |
|
- Contextual text generation |
|
- Base language modeling for Swahili |
|
|
|
## Usage |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("alfaxadeyembe/gemma2-2b-swahili-preview") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"alfaxadeyembe/gemma2-2b-swahili-preview", |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
|
|
# Set to evaluation mode |
|
model.eval() |
|
|
|
# Example usage |
|
prompt = "Katika soko la Kariakoo, teknolojia mpya imewezesha" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=500, |
|
do_sample=True, |
|
temperature=0.7, |
|
top_p=0.95 |
|
) |
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(response) |
|
``` |
|
|
|
## Key Features |
|
- Natural Swahili text continuation |
|
- Strong cultural context understanding |
|
- Efficient parameter updates through LoRA |
|
- Diverse domain knowledge integration |
|
|
|
## Limitations |
|
- Not instruction-tuned |
|
- Base language modeling capabilities |
|
- Performance varies across different text domains |
|
|
|
## Citation |
|
```bibtex |
|
@misc{gemma2-2b-swahili-preview, |
|
author = {Alfaxad Eyembe}, |
|
title = {Gemma2-2B-Swahili-Preview: Swahili Variation of Gemma2 2B}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
journal = {Hugging Face Model Hub}, |
|
} |
|
``` |
|
|
|
## Contact |
|
For questions or feedback, please reach out through: |
|
- HuggingFace: [@alfaxadeyembe](https://huggingface.co/alfaxad) |
|
- X : [@alfxad](https://twitter.com/alfxad) |