Llama-3.2 1B 4-bit Quantized Model
Model Overview
- Base Model: Meta-Llama/Llama-3.2-1B
- Model Name: rautaditya/llama-3.2-1b-4bit-gptq
- Quantization: 4-bit GPTQ (Generative Pretrained Transformer Quantization)
Model Description
This is a 4-bit quantized version of the Llama-3.2 1B model, designed to reduce model size and inference latency while maintaining reasonable performance. The quantization process allows for more efficient deployment on resource-constrained environments.
Key Features
- Reduced model size
- Faster inference times
- Compatible with Hugging Face Transformers
- GPTQ quantization for optimal compression
Quantization Details
- Quantization Method: GPTQ (Generative Pretrained Transformer Quantization)
- Bit Depth: 4-bit
- Base Model: Llama-3.2 1B
- Quantization Library: AutoGPTQ
Installation Requirements
pip install transformers accelerate auto-gptq torch
Usage
Transformers Pipeline
from transformers import AutoTokenizer, pipeline
ModelFolder = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(ModelFolder)
pipe = pipeline(
"text-generation",
model=ModelFolder,
tokenizer=tokenizer,
device_map="auto"
)
prompt = "What is the meaning of life?"
generated_text = pipe(prompt, max_length=100)
print(generated_text)
Direct Model Loading
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_name = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
device_map="auto"
)
Performance Considerations
- Memory Efficiency: Significantly reduced memory footprint compared to full-precision model
- Inference Speed: Faster inference due to reduced computational requirements
- Potential Accuracy Trade-off: Minor performance degradation compared to full-precision model
Limitations
- May show slight differences in output quality compared to the original model
- Performance can vary based on specific use case and inference environment
Recommended Use Cases
- Low-resource environments
- Edge computing
- Mobile applications
- Embedded systems
- Rapid prototyping
License
Please refer to the original Meta Llama 3.2 model license for usage restrictions and permissions.
Citation
If you use this model, please cite:
@misc{llama3.2_4bit_quantized,
title={Llama-3.2 1B 4-bit Quantized Model},
author={Raut, Aditya},
year={2024},
publisher={Hugging Face}
}
Contributions and Feedback
- Open to suggestions and improvements
- Please file issues on the GitHub repository for any bugs or performance concerns
Acknowledgments
- Meta AI for the base Llama-3.2 model
- Hugging Face Transformers team
- AutoGPTQ library contributors
- Downloads last month
- 155
Model tree for rautaditya/llama-3.2-1b-4bit-gptq
Base model
meta-llama/Llama-3.2-1B