--- license: mit datasets: - mlabonne/FineTome-100k language: - en metrics: - accuracy - bertscore - code_eval new_version: meta-llama/Llama-3.1-8B-Instruct pipeline_tag: text-generation library_name: adapter-transformers --- # Llama-3.2-3B- ![License](https://img.shields.io/badge/License-Apache%202.0-blue) ![Python](https://img.shields.io/badge/Python-3.8%2B-green) ![Framework](https://img.shields.io/badge/Framework-Unsloth-ff69b4) ![Model](https://img.shields.io/badge/Model-Llama_3.2_3B-orange) This repository contains code to fine-tune the **Llama-3.2-3B-Instruct** model using Unsloth for efficient training. The model is optimized for conversational tasks and supports 4-bit quantization, LoRA adapters, and GGUF export. ## Model Overview - **Base Model**: [`Llama-3.2-3B-Instruct`](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct) - **Fine-Tuning Dataset**: [FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) (converted to Llama-3.1 chat format) - **Features**: - 4-bit quantization for reduced memory usage - LoRA adapters (1-10% parameter updates) - Sequence length: 2048 (RoPE scaling supported) - Optimized for Tesla T4 GPUs ## 🚀 Quick Start ### Load this model as: ```python from llama_cpp import Llama from huggingface_hub import hf_hub_download # Download model from Hugging Face Hub model_path = hf_hub_download( repo_id="Omarrran/llama3_2_3B", filename="unsloth.Q4_K_M.gguf", cache_dir="./models" # Save to models directory ) # Initialize LLM with proper configuration llm = Llama( model_path=model_path, n_ctx=2048, # Context window size n_threads=8, # CPU threads to use n_gpu_layers=35 # GPU layers for acceleration (if available) ) # Create a generation function def generate_text(prompt, max_tokens=200): output = llm.create_chat_completion( messages=[{"role": "user", "content": prompt}], max_tokens=max_tokens, temperature=0.7, stop=["\n"] ) return output['choices'][0]['message']['content'] # Example usage if __name__ == "__main__": prompt = "Explain quantum computing in simple terms:" response = generate_text(prompt) print(f"Prompt: {prompt}\nResponse: {response}") ``` ### Installation ```bash pip install unsloth pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git ``` ### Load Model ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Llama-3.2-3B-Instruct", max_seq_length=2048, dtype=None, # Auto-detect (bf16 for Ampere+ GPUs) load_in_4bit=True, ) ``` ### Run Inference ```python messages = [{"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"}] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda") outputs = model.generate( inputs, max_new_tokens=64, temperature=1.5, min_p=0.1, ) print(tokenizer.decode(outputs[0])) ``` ## 🛠️ Training ### Data Preparation The dataset is standardized to Llama-3.1 chat format: ```python from unsloth.chat_templates import get_chat_template, standardize_sharegpt tokenizer = get_chat_template(tokenizer, "llama-3.1") # Adds system prompts dataset = load_dataset("mlabonne/FineTome-100k", split="train") dataset = standardize_sharegpt(dataset) # Converts to role/content format ``` ### LoRA Configuration ```python model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_alpha=16, use_gradient_checkpointing="unsloth", # 30% less VRAM ) ``` ### Training Arguments ```python from trl import SFTTrainer trainer = SFTTrainer( model=model, train_dataset=dataset, dataset_text_field="text", max_seq_length=2048, args=TrainingArguments( per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-4, max_steps=60, # Demo: set to 60 steps. For full training, use num_train_epochs=1 fp16=not is_bfloat16_supported(), bf16=is_bfloat16_supported(), optim="adamw_8bit", ), ) ``` ## 💾 Saving & Deployment ### Save LoRA Adapters ```python model.save_pretrained("llama3_2_3B") tokenizer.save_pretrained("llama3_2_3B") ``` ### Export to GGUF (for llama.cpp) ```python model.save_pretrained_gguf( "model", tokenizer, quantization_method="q4_k_m", # Recommended quantization ) ``` ### Upload to Hugging Face Hub ```python model.push_to_hub_gguf( "your-username/llama3_2_3B", tokenizer, quantization_method=["q4_k_m", "q8_0"], # Multiple formats token="hf_your_token_here", ) ``` ## 📊 Performance | Metric | Value | |----------------------|----------------| | Training Time (60 steps) | ~7.5 minutes | | Peak VRAM Usage | 6.5 GB | | Quantized Size (Q4_K_M) | ~1.9 GB | ## 📜 Notes - **Knowledge Cutoff**: December 2023 (updated to July 2024 via fine-tuning) - Use `temperature=1.5` and `min_p=0.1` for best results ([reference](https://x.com/menhguin/status/1826132708508213629)) - For 2x faster inference, enable `FastLanguageModel.for_inference(model)` ## 🤝 Contributing - Report issues - Star the repo if you find this useful! ⭐ ## License Apache 2.0. See [LICENSE on top of Model Card] ``` ``` ### Key Fixes Added: 1. **Model Download**: Uses `huggingface_hub` to properly download the GGUF file 2. **Correct Initialization**: Uses `Llama()` constructor instead of non-existent `from_pretrained()` 3. **GPU Support**: Added `n_gpu_layers` for GPU acceleration (set to 0 if using CPU-only) 4. **Chat Completion**: Uses the recommended `create_chat_completion` method ### Requirements: ```bash pip install llama-cpp-python huggingface_hub ``` ### For Better Performance: - Set `n_gpu_layers` based on your VRAM (40+ for large models) - Add `verbose=False` to constructor to suppress logs - Use `llama.cpp` optimizations: ```python Llama( model_path=model_path, n_batch=512, use_mmap=True, use_mlock=True ) ``` ### Common Errors to Handle: ```python try: llm = Llama(model_path=model_path) except Exception as e: print(f"Error loading model: {str(e)}") # Check if file exists: os.path.exists(model_path) # Verify file integrity: check file size matches original ```