File size: 2,358 Bytes
362d372
 
dd1a343
 
 
 
88a477e
 
 
 
 
362d372
 
 
 
5c35509
88a477e
362d372
 
88a477e
 
b520c32
 
b30d5e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
362d372
60feb0f
 
b520c32
 
 
 
 
362d372
 
dd1a343
a29e824
dd1a343
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: llama3.1
tags:
- gguf
- llama3
pipeline_tag: text-generation
datasets:
- froggeric/imatrix
language:
- en
library_name: ggml
---

# Meta-Llama-3.1-405B-Instruct-GGUF

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6604e5b21eb292d6df393365/o7DiWuILyzaPLh4Ne1JKr.png)

Low bit quantizations of Meta's Llama 3.1 405B Instruct model. Quantized from ollama q4_0 GGUF.

Quantized with llama.cpp [b3449](https://github.com/ggerganov/llama.cpp/releases/tag/b3449)

| Quant       | Notes                                      |
|-------------|--------------------------------------------|
| BF16        | Brain floating point, very high quality, smaller than F16 |
| Q8_0        | 8-bit quantization, high quality, larger size |
| Q6_K        | 6-bit quantization, very good quality-to-size ratio |
| Q5_K        | 5-bit quantization, good balance of quality and size |
| Q5_0        | Alternative 5-bit quantization, slightly different balance |
| Q4_K_M      | 4-bit quantization, good for production use |
| Q4_K_S      | 4-bit quantization, faster inference, efficient for scaling |
| Q4_0        | Basic 4-bit quantization, good for experimentation |
| Q3_K_L      | 3-bit quantization, high-quality with more VRAM requirement |
| Q3_K_M      | 3-bit quantization, good balance between speed and accuracy |
| Q3_K_S      | 3-bit quantization, faster inference with minor quality loss |
| Q2_K        | 2-bit quantization, suitable for general inference tasks |
| IQ2_S       | Integer 2-bit quantization, optimized for small VRAM environments |
| IQ2_XXS     | Integer 2-bit quantization, best for ultra-low memory footprint |
| IQ1_M       | Integer 1-bit quantization, usable
| IQ1_S       | Integer 1-bit quantization, not recommended

For higher quality quantizations (q4+), please refer to [nisten/meta-405b-instruct-cpu-optimized-gguf](https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf).

Regarding the `smaug-bpe` tokenizer, this doesn't make a difference (they are identical). However, if you have concerns you can use the following command to set the `llama-bpe` tokenizer:
```
./gguf-py/scripts/gguf_new_metadata.py --pre-tokenizer "llama-bpe" Llama-3.1-405B-Instruct-old.gguf LLama-3.1-405B-Instruct-fixed.gguf
```

## imatrix

Generated from Q2_K quant.

imatrix calibration data: `groups_merged.txt`