what is the meaning of these suffix

#3
by zenwangzy24 - opened

what is the meaning of these suffix like Q5_K?

image.png

I got the answer from ChatGPT, does it make sense?
Q2_K, Q3_K_L, Q3_K_M, Q3_K_S: These appear to specify a version or configuration of the model. "Q" might stand for "Quarter" or another relevant metric, followed by a number that could indicate a version number or a sequence. "K" might represent a specific configuration or feature, and "L", "M", "S" might indicate different sizes or performance levels (e.g., Large, Medium, Small).
Q4_0, Q4_1: Here, "Q4" might similarly indicate a version of the model, with the following numbers "0" and "1" potentially representing different iterations or variants of that version.
Q5_0, Q5_1, Q5_K_M, Q5_K_S: Similarly, "Q5" represents another version, with "0" and "1" possibly being different iterations, and "K_M" and "K_S" indicating specific configurations or sizes.
Q6_K, Q8_0: These are different version numbers again, with "Q6" and "Q8" potentially marking two different points in a sequence, and "K" and "0" possibly signifying specific configurations or iterations.

They are different levels of quantization.
Smaller Q numbers indicate heavier quantization (i.e. greater quality loss) but with reduced memory usage. K means it's using llama.cpp's K-type quants. For example, Q4_0 is using an older quant method.
The S, M, L (small, medium, large) just means more or lessquantization within that same level (e.g. Q3_K_S is quantized more heavily Q3_K_L).
I'm not an expert in this field but I hope you get the idea.

When I run this Meta-Llama-3-8B-Instruct.Q6_K.gguf under LM studio, it shows Meta-Llama-3-7B-Instruct.Q6_K.
Why is that? Is that normal?

One thing a really miss about the bloke's uploads was that he provided estimated VRAM usage for each quant type. Is there any way to determine that?

Quant Factory org

@x3v0 not sure about that yet, but will try if we can include those estimations

Quant Factory org

@x3v0 The VRAM can be estimated as the size of the file you want to load + some buffer for context (1-2 GB could be fine). e.g if you want to load Q2_K (3.18 GB) you would need approximately >=4.18 GB of VRAM to run it.
I will try to include these in model description soon

Do you have any recommended version in terms of the tradeoff between quality loss and vram usage?

@cbML I usually go with Q6_K as the default, then in case of any troubles (like not enough VRAM, or too slow inference) I drop to Q5_K_M, Q5_K_S or Q4_K_M.

You can look at PPL drops caused by different quantization methods measured on Llama 2 70B here: https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md

I spent awhile looking into this and this is the best info I can get, based on a conversation with Claude and some impressive research/data on this topic (https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md?plain=1). Here are the comparisons between full-size models and all the way down to Q2:

Full size model (FP16) vs Q8_0:
Practically indistinguishable. Even very discerning users would be hard-pressed to notice any difference in outputs.

Q6_K:
Still very close to the original. Most users wouldn't notice any difference in day-to-day use.
In extremely nuanced or complex tasks, there might be very occasional, minor discrepancies.

Q5_K_M/S:
For general use, still very good. Most users wouldn't notice issues in typical interactions.
In more demanding tasks (e.g., complex reasoning, nuanced language understanding), there might be occasional small errors or slightly less precise answers.
Creative writing might be a tiny bit less polished, but still high quality.

Q4_K_M/S:
General use still quite good, but more discerning users might start to notice some differences.
More frequent minor errors or imprecisions in complex tasks.
Slightly less nuanced understanding of context or subtle implications.
Creative outputs might be a bit less sophisticated or varied.

Q3_K_L/M:
Noticeable reduction in quality for more complex tasks.
More frequent errors or misunderstandings, especially with nuanced queries.
Less consistent performance in long-form content generation.
May struggle more with very specific or technical topics.

Q2_K:
Significant reduction in overall quality, noticeable to most users.
More frequent and obvious errors in various tasks.
Less coherent or relevant responses to complex queries.
Reduced ability to handle nuance or context.
Creative writing would be notably less sophisticated.

In the context of llama.cpp's quantization methods, the letters "S", "M", and "L" in the filenames stand for Small, Medium, and Large, respectively. These letters indicate different quantization mixes that balance model size and performance quality.

Here's how they differ:

Small ("S"): The model uses a uniform quantization type across all tensors, prioritizing the smallest possible size. For example:
LLAMA_FTYPE_MOSTLY_Q3_K_S: Uses GGML_TYPE_Q3_K for all tensors.

Medium ("M"): Certain critical tensors are quantized with higher precision to improve performance while keeping the model size moderate. For example:
LLAMA_FTYPE_MOSTLY_Q3_K_M: Uses GGML_TYPE_Q4_K for tensors like attention.wv, attention.wo, and feed_forward.w2, and GGML_TYPE_Q3_K for the rest.

Large ("L"): More tensors are quantized at higher precision, leading to better quality but a larger model size. For example:
LLAMA_FTYPE_MOSTLY_Q3_K_L: Uses GGML_TYPE_Q5_K for key tensors, with the rest being GGML_TYPE_Q3_K.

Why This Matters:

Model Size vs. Quality Trade-off: The different quantization mixes allow users to choose a model that fits their hardware constraints while achieving the desired performance.
Flexible Deployment: Smaller models (with "S") are suitable for edge devices with limited resources, while larger models (with "L") can be used where more computational power is available.

Summary:

"S" (Small): Prioritizes minimal size, uses lower-precision quantization throughout.
"M" (Medium): Balances size and quality by using higher precision on essential tensors.
"L" (Large): Focuses on quality with higher precision on more tensors, resulting in a larger model.

Reference from the Posts:

The quantization mixes are defined as follows in the llama.cpp code:

LLAMA_FTYPE_MOSTLY_Q3_K_S: Uses GGML_TYPE_Q3_K for all tensors.
LLAMA_FTYPE_MOSTLY_Q3_K_M: Uses GGML_TYPE_Q4_K for certain tensors, else GGML_TYPE_Q3_K.
LLAMA_FTYPE_MOSTLY_Q3_K_L: Uses GGML_TYPE_Q5_K for certain tensors, else GGML_TYPE_Q3_K.

These definitions illustrate how the "S," "M," and "L" variants adjust the quantization types for specific tensors to achieve the desired balance between model size and performance.

Sources:
https://github.com/ggerganov/llama.cpp/pull/1684
https://github.com/ggerganov/llama.cpp/discussions/5063

Sign up or log in to comment