|
--- |
|
license: llama3.1 |
|
tags: |
|
- gguf |
|
- llama3 |
|
pipeline_tag: text-generation |
|
datasets: |
|
- froggeric/imatrix |
|
language: |
|
- en |
|
library_name: ggml |
|
--- |
|
|
|
# Meta-Llama-3.1-405B-Instruct-GGUF |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6604e5b21eb292d6df393365/o7DiWuILyzaPLh4Ne1JKr.png) |
|
|
|
Low bit quantizations of Meta's Llama 3.1 405B Instruct model. Quantized from ollama q4_0 GGUF. |
|
|
|
Quantized with llama.cpp [b3449](https://github.com/ggerganov/llama.cpp/releases/tag/b3449) |
|
|
|
| Quant | Notes | |
|
|-------------|--------------------------------------------| |
|
| BF16 | Brain floating point, very high quality, smaller than F16 | |
|
| Q8_0 | 8-bit quantization, high quality, larger size | |
|
| Q6_K | 6-bit quantization, very good quality-to-size ratio | |
|
| Q5_K | 5-bit quantization, good balance of quality and size | |
|
| Q5_0 | Alternative 5-bit quantization, slightly different balance | |
|
| Q4_K_M | 4-bit quantization, good for production use | |
|
| Q4_K_S | 4-bit quantization, faster inference, efficient for scaling | |
|
| Q4_0 | Basic 4-bit quantization, good for experimentation | |
|
| Q3_K_L | 3-bit quantization, high-quality with more VRAM requirement | |
|
| Q3_K_M | 3-bit quantization, good balance between speed and accuracy | |
|
| Q3_K_S | 3-bit quantization, faster inference with minor quality loss | |
|
| Q2_K | 2-bit quantization, suitable for general inference tasks | |
|
| IQ2_S | Integer 2-bit quantization, optimized for small VRAM environments | |
|
| IQ2_XXS | Integer 2-bit quantization, best for ultra-low memory footprint | |
|
| IQ1_M | Integer 1-bit quantization, usable |
|
| IQ1_S | Integer 1-bit quantization, not recommended |
|
|
|
For higher quality quantizations (q4+), please refer to [nisten/meta-405b-instruct-cpu-optimized-gguf](https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf). |
|
|
|
Regarding the `smaug-bpe` tokenizer, this doesn't make a difference (they are identical). However, if you have concerns you can use the following command to set the `llama-bpe` tokenizer: |
|
``` |
|
./gguf-py/scripts/gguf_new_metadata.py --pre-tokenizer "llama-bpe" Llama-3.1-405B-Instruct-old.gguf LLama-3.1-405B-Instruct-fixed.gguf |
|
``` |
|
|
|
## imatrix |
|
|
|
Generated from Q2_K quant. |
|
|
|
imatrix calibration data: `groups_merged.txt` |