Regarding quantized .gguf models
Wanted to ask which quantized version would give the best ideal outputs (similar to the original 7B TIR hosted on numina - safetensors). I tried the q6_k and it seems to perform in a decent manner.
Wanted to ask the best possible versions from the ones available:f_16
q5_k
q6_k
q8_0
q8_p
I will be using the type suggested to further fine tune the model on various other datasets.
They are in this order:
f16
q8_0
q6_k
q5_k
but q6_k is not degraded as far as I can tell.
try them all and decide... there is obviously a trade-off with the size, but all these version keep the output and embed tensors to f16 ehich makes them way better than the normal quantizations.
They are in this order:
f16
q8_0
q6_k
q5_kbut q6_k is not degraded as far as I can tell.
try them all and decide... there is obviously a trade-off with the size, but all these version keep the output and embed tensors to f16 ehich makes them way better than the normal quantizations.
I tried the q6_k
and it seems to work well (atleast for the initial prompt inferences). I had downloaded the q8_0 from another person - https://huggingface.co/reach-vb/NuminaMath-7B-TIR-Q8_0-GGUF and the outputs were completely off (continued hallucinations). Will try your q8_0
; hopefully that is something I can work with. I think the best (and highest in terms of size) model I could get to work with open-webui was q8_0
of the other person so hopefully I can compare the difference between your q8_0
and q6_k
and see which I can keep to continue my project further.
As always, thanks for the prompt reply!