leafspark's picture
readme: add information about tokenizer
b520c32 verified
|
raw
history blame
1.82 kB
metadata
license: llama3.1
tags:
  - gguf
  - llama3
pipeline_tag: text-generation
datasets:
  - froggeric/imatrix
language:
  - en
library_name: ggml

Meta-Llama-3.1-405B-Instruct-GGUF

image/png

Low bit quantizations of Meta's Llama 3.1 405B Instruct model. Quantized from ollama q4_0 GGUF.

Quantized with llama.cpp b3449

Quant Notes
Q2_K Suitable for general inference tasks
IQ2_XXS Best for ultra-low memory footprint
IQ2_S Optimized for small VRAM environments
Q3_K_M Good balance between speed and accuracy
Q3_K_S Faster inference with minor quality loss
Q3_K_L High-quality with more VRAM requirement
Q4_K_M Superior balance, suitable for production (although this is dequanted from q4_0, don't expect higher quality)
Q4_0 Basic quantization, good for experimentation
Q4_K_S Fast inference, efficient for scaling

For higher quality quantizations (q4+), please refer to nisten/meta-405b-instruct-cpu-optimized-gguf.

Regarding the smaug-bpe tokenizer, this doesn't make a difference (they are identical). However, if you have concerns you can use the following command to set the llama-bpe tokenizer:

./gguf-py/scripts/gguf_new_metadata.py --pre-tokenizer "llama-bpe" Llama-3.1-405B-Instruct-old.gguf LLama-3.1-405B-Instruct-fixed.gguf

imatrix

Generated from Q2_K quant.

imatrix calibration data: groups_merged.txt