Graphed by gpt4
Ah, thank you Ipechman. What a sweet sight, thank you very much !
Your graph is wrong if my data are wrong, or not put in the right order.
Except on one thing : I see a mismatch on Wintergoddess 32k TQA. which is at 39.65728274 and not 20. The rest seems coherent.
The IQ3_XXS quant is absolutely amazing, and beyond 3_K_XS & 3_K_S, it rivals Q3_K_M as well (even if the tokens divergence are likely higher on IQ3_XXS than on Q3_K_M as Artefact2 illustrated on his graph).
Hey guys, can you please share how much vRAM the smallest models require?
Well, the size of the model + a couple of gigabytes from 4k context.
So, even for IQ2_XXS, you'll need 24GB for a full offload, and 16GB for a quasi-full offload (70+ layers) with Lowvram option..
BUT, by offloading something like 45 layers instead of 81 on your GPU with LlamaCPP or KoboldCPP, you should be able to run IQ2_XXS on a 3060 with 12GB VRAM at a low, but nevertheless sustainable speed to get an answer in a few minutes. The Lowvram options is also useful. Check the documentation of your inference tool to get the exact command line for your needs, and make tests with GPU-Z to monitor precisely your RAM occupation.
Ah, thank you Ipechman. What a sweet sight, thank you very much !
Your graph is wrong if my data are wrong, or not put in the right order.
Except on one thing : I see a mismatch on Wintergoddess 32k TQA. which is at 39.65728274 and not 20. The rest seems coherent.
The IQ3_XXS quant is absolutely amazing, and beyond 3_K_XS & 3_K_S, it rivals Q3_K_M as well (even if the tokens divergence are likely higher on IQ3_XXS than on Q3_K_M as Artefact2 illustrated on his graph).
Source?
" as Artefact2 illustrated on his graph".
https://huggingface.co/Artefact2 is the guy, he published these graphs in the discussions of his models.