leafspark's picture
readme: add information about tokenizer
b520c32 verified
|
raw
history blame
1.82 kB
---
license: llama3.1
tags:
- gguf
- llama3
pipeline_tag: text-generation
datasets:
- froggeric/imatrix
language:
- en
library_name: ggml
---
# Meta-Llama-3.1-405B-Instruct-GGUF
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6604e5b21eb292d6df393365/o7DiWuILyzaPLh4Ne1JKr.png)
Low bit quantizations of Meta's Llama 3.1 405B Instruct model. Quantized from ollama q4_0 GGUF.
Quantized with llama.cpp [b3449](https://github.com/ggerganov/llama.cpp/releases/tag/b3449)
| Quant | Notes |
|-------------|--------------------------------------------|
| Q2_K | Suitable for general inference tasks |
| IQ2_XXS | Best for ultra-low memory footprint |
| IQ2_S | Optimized for small VRAM environments |
| Q3_K_M | Good balance between speed and accuracy |
| Q3_K_S | Faster inference with minor quality loss |
| Q3_K_L | High-quality with more VRAM requirement |
| Q4_K_M | Superior balance, suitable for production (although this is dequanted from q4_0, don't expect higher quality) |
| Q4_0 | Basic quantization, good for experimentation|
| Q4_K_S | Fast inference, efficient for scaling |
For higher quality quantizations (q4+), please refer to [nisten/meta-405b-instruct-cpu-optimized-gguf](https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf).
Regarding the `smaug-bpe` tokenizer, this doesn't make a difference (they are identical). However, if you have concerns you can use the following command to set the `llama-bpe` tokenizer:
```
./gguf-py/scripts/gguf_new_metadata.py --pre-tokenizer "llama-bpe" Llama-3.1-405B-Instruct-old.gguf LLama-3.1-405B-Instruct-fixed.gguf
```
## imatrix
Generated from Q2_K quant.
imatrix calibration data: `groups_merged.txt`