Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
bartowskiΒ 
posted an update 24 days ago
Post
11719
Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

Huh interesting, however all inference engines need to adapt newer llama.cpp version correct? Q4_0 and IQ4_NL? Just scrolled throught the pull request. How do you know IQ4_NL should work this way also?

Β·

oh right sorry, forgot to include that PR, i'll add it above but it's here:

https://github.com/ggerganov/llama.cpp/pull/10541

I think the inference engines will just need to update to the newer versions and they'll get the repacking logic for free, if that's what you meant then yes

Btw for anyone late to the game - Q4_0_N_M quants should still work as expected in KoboldCpp. The runtime repack for q4_0 should work as well, so you have multiple options.

Β·

hell yeah. wish we could still offline compile, i get why it's not sustainable in the future but also until there's better support and more options would be nice to keep it around

Interesting, in this case will description "Legacy format, generally not worth using over similarly sized formats" of Q4_0 change to something like "ARM recommended (Do not use in Apple Silicons)" - or will IQ4_NL added in list and recommend that over Q4_0?

Β·

I've updated it to "Legacy format, offers online repacking for ARM and AVX CPU inference.", it is still overall legacy but with the online repacking is worth considering for speed

I'm hoping that IQ4_NL gets a few more packing options in the near future

Hello bartowski, have you considered doing q4_1 quants? My testing has consistently found that q4_1 is the best quant for Macs. More info here:
https://huggingface.co/mradermacher/model_requests/discussions/299

Β·

Don't love adding more formats but if your results are accurate it does seem worth including

A bit annoying, isn't it? Some time ago I asked you for arm version of gemma-2-9b-it-abliterated. So now it won't work again. I guess there is no Q4_0 ?

Β·

oh, yeah, of course.. I added all the ARM quants but then not Q4_0 which is now the only one that would work haha..

I'll go any make a Q4_0 for it I suppose ! just this once

Now that the software I'm using updated the llamacpp version, I'm changingbgguf. I don't get what's meant with IQ4_NL does this include IQ4_XS? So IQ4_XS is also supposed to run performant on arm or just Q4_0?

On a side note, since I had good performance with Q4ks in the past I would wish that that would also benefit from these change.

Β·

No it does not include the XS, the reason Q4_0 and IQ4_NL work i think is because they don't do any clever packing of the scaling factors, that's why K quants and IQ4_XS (which is like NL but with some K quant logic) don't work yet