--- license: mit datasets: - eligapris/kirundi-english language: - rn library_name: transformers tags: - kirundi - rn --- # Kirundi Tokenizer and LoRA Model ## Model Description This repository contains two main components: 1. A BPE tokenizer trained specifically for the Kirundi language (ISO code: run) 2. A LoRA adapter trained for Kirundi language processing ### Tokenizer Details - **Type**: BPE (Byte-Pair Encoding) - **Vocabulary Size**: 30,000 tokens - **Special Tokens**: [UNK], [CLS], [SEP], [PAD], [MASK] - **Pre-tokenization**: Whitespace-based ### LoRA Adapter Details - **Base Model**: [To be filled with your chosen base model] - **Rank**: 8 - **Alpha**: 32 - **Target Modules**: Query and Value attention matrices - **Dropout**: 0.05 ## Intended Uses & Limitations ### Intended Uses - Text processing for Kirundi language - Machine translation tasks involving Kirundi - Natural language understanding tasks for Kirundi content - Foundation for developing Kirundi language applications ### Limitations - The tokenizer is trained on a specific corpus and may not cover all Kirundi dialects - Limited to the vocabulary observed in the training data - Performance may vary on domain-specific text ## Training Data The model components were trained on the Kirundi-English parallel corpus: - **Dataset**: eligapris/kirundi-english - **Size**: 21.4k sentence pairs - **Nature**: Parallel corpus with Kirundi and English translations - **Domain**: Mixed domain including religious, general, and conversational text ## Training Procedure ### Tokenizer Training - Trained using Hugging Face's Tokenizers library - BPE algorithm with a vocabulary size of 30k - Includes special tokens for task-specific usage - Trained on the Kirundi portion of the parallel corpus ### LoRA Training [To be filled with your specific training details] - Number of epochs: - Batch size: - Learning rate: - Training hardware: - Training time: ## Evaluation Results [To be filled with your evaluation metrics] - Coverage statistics: - Out-of-vocabulary rate: - Task-specific metrics: ## Environmental Impact [To be filled with training compute details] - Estimated CO2 emissions: - Hardware used: - Training duration: ## Technical Specifications ### Model Architecture - Tokenizer: BPE-based with custom vocabulary - LoRA Configuration: - r=8 (rank) - α=32 (scaling) - Trained on specific attention layers - Dropout rate: 0.05 ### Software Requirements ```python dependencies = { "transformers": ">=4.30.0", "tokenizers": ">=0.13.0", "peft": ">=0.4.0" } ``` ## How to Use ### Loading the Tokenizer ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast.from_pretrained("path_to_tokenizer") ``` ### Loading the LoRA Model ```python from peft import PeftModel, PeftConfig from transformers import AutoModelForSequenceClassification config = PeftConfig.from_pretrained("path_to_lora_model") model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path) model = PeftModel.from_pretrained(model, "path_to_lora_model") ``` ## Contact Eligapris --- ## Updates and Versions - v1.0.0 (Initial Release) - Base tokenizer and LoRA model - Trained on Kirundi-English parallel corpus - Basic functionality and documentation ## Acknowledgments - Dataset provided by eligapris - Hugging Face's Transformers and Tokenizers libraries - PEFT library for LoRA implementation