iMatrix, IQ2_XS & IQ2_XXS

#2
by Nexesenex - opened

Hey Wolfram,

Did you use a iMatrix for iQ3_XXS? (wikitext?, ctx size and chunks?)
And if yes, could you use it to make IQ2_XS & IQ2_XXS quants?

I'm stuck with 36GB of VRAM and I would love to test this second version of Miquliz!

Also, a 103b parameters version would be lovely if it doesn't damage much the model, to reach higher context.
Then, the upcoming IQ1_S quant might allow people with 24GB VRAM to test the model, and to fully offload a 103b.

Thanks for all your great work !

I support, is it possible to quantize it to IQ2_XS or IQ2_XXS? Please

@Nexesenex :

Quantized the IQ3_XXS with default settings, so it didn't need an imatrix. The smaller quants do, and I wasn't sure which one would be best, so I didn't make those yet.

Is there already a consensus on what imatrix dataset to use? I've read Importance matrix calculations work best on near-random data · ggerganov/llama.cpp · Discussion #5006 but there doesn't seem to be a clear recommendation, and the thread ended abruptly without a real conclusion.

So if I'm going to do this, I'd like to do it properly, and that means I need more information. Happy to get any further info or pointers to look into this some more.

Hey @wolfram ,

IQ quants benefit more of iMatrix than K quants, and will drop slightly its perplexity (more than 1% already on a Q3_K_S size (I tried with and without on a linear rope 8 wintergoddes 32k model), so much more in a "heavily driven quant" like smaller IQ quants are, and the lower bpw the quant is, the higher iMatrix benefit is) with iMatrix, that's not neglectable.

And there's a consensus on the necessity :

Accordingly to this PR :
https://github.com/ggerganov/llama.cpp/pull/5334

"IQ3_XXS can give a very bad quantization when used without an importance matrix (imatrix), see #5332.

Instead of adding a warning or even disallowing IQ3_XXS quantization without an imatrix, this PR prevents a bad outcome by using Q3_K for the attn_v tensors, and a mix of Q4_K and Q3_K for the ffn_down tensors when no imatrix has been supplied. This results in a somewhat larger quantized model (e.g., 2.61 GiB vs 2.5 GiB for 7B LLaMAs) but a more reasonable PPL (e.g., 5.4923 for LLaMA-v2-7B and a context of 4096 vs 100+)."

On the methodology, I read all these threads extensively, and I'd reccomand out of it an iMatrix built on wiki.train.raw with 512 ctx and 2,000 chunks.
That's the solution adopted by Artefact2, and his quantizations are for now baseline to me.
Beyond that, the person to ask to for his settings is Ikawakrow, because he's the author of the iMatrix system and IQ quants, and his IQ quants of Llama2 were excellent from day one.

As for myself, I can't do better with my rig than ctx 32 (which performed well in my extensive tests on small models, starting from.. 50 chunks or even 25 lol) for 70b models, and I kept the system for most of my quants, even of lower parameters models, but with more chunks (hundreds to thousands). But that's the iMatrix of the poor.

Thank you for the super detailed comment. All of that is very insightful and warrants further investigation and experimentation. So I'm going to make a new IQ3_XXS with imatrix for comparison.

Glad that I found this discussion. Whenever you have an idea on what is the best way to move forward I would love if you could re-upload the iQ3_XXS version with an imatrix. thanks in advance!

I am uploading IQ2_XXS/XS to https://huggingface.co/dranger003/miquliz-120b-v2.0-iMat.GGUF for those interested.
And thanks to @wolfram for this great model! I think I'm going to post my IQ3_XXS as well so we can compare both imatrix models.

@dranger003 Hey, that's great, you beat me to it - saving my weekend. Oh, and you even made a Q8, that's excellent.

I'll update my model cards and link to your quants. Thank you very much!

Thanks a lot both @dranger003 and @wolfram

As an AI researcher and consultant I will also deep dive into the importance matrix concept soon. From my theoretical thinking, we should be able to heavily increase coherence and model output quality by using best practices in these areas, these have to be found out by experimenting and working together.

It will be very interesting to see what we as a community can do, led by Iwan who makes these excellent SOTA quants and of course Georgi from llama.cpp!

Well, @Xonaz81 , that thread might interest you.

https://github.com/ggerganov/llama.cpp/discussions/5263

iMatrix can boost the precision of GGUF quants not only of English, but other languages supported by a LLM as well, and this even in a combined manner with a polyglot dataset based iMatrix.

@Nexesenex Hi, i would also like to improve quality of my quants especially for DarkForest-20B-v2.0, could you please share what dataset are you using for calibration(wiki, something custom?), and how to compute imatrix and then use it during quatization. Command line example would be greatly appreciated, I'm a little bit lost right now. Thank you in advance.

Hello @TeeZee .

For now, I'm using wiki.train.raw to train my iMatrixes. But it's English only.
I spent a lot of time on Kyllene, and for example, having a polygot iMatrix (I invite you to read this thread : https://github.com/ggerganov/llama.cpp/discussions/5263 ) would help very much the Yi 34b models to make a few less grammatical mistakes to the point of being usable in French and German, for example. I made a request for such a training file here : https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8506910

Windows Command lines I used for Kyllene after conversion in Q8_0 (I'm too short in RAM to manipulate efficiently FP16, I thus surrender <0.1% of perplexity in conversion to Q8_0). I have 64GB of RAM, and 36GB of VRAM.

  1. iMatrix

imatrix -m Y:\text-generation-webui\models\TeeZee_Kyllene-34B-v1.1-b1924-Q8_0.gguf -f wiki.train.raw -o Y:\iMatrix\TeeZee_Kyllene-34B-v1.1-b1924-Q8_0.iMatrix_Wiki_c32_ch3250.dat --verbosity 1 --keep-imatrix 250 -b 32 -ngl 99 -mg 0 -ts 24,12 -c 32 --chunks 3250

-c for the context size, --chunks for the number of steps. 32/2500 or 512/2000 are good compromises.

  1. Quantize

quantize --allow-requantize --imatrix Y:\iMatrix\TeeZee_Kyllene-34B-v1.1-b1924-Q8_0.iMatrix_Wiki_c32_ch3250.dat Y:\text-generation-webui\models\TeeZee_Kyllene-34B-v1.1-b1924-Q8_0.gguf Z:\text-generation-webui\models\TeeZee_Kyllene-34B-v1.1-b2128-iMat-c32_ch3250-IQ1_S.gguf IQ1_S

Adapt to the quantization desired, of course.

Hello @Nexesenex

Is there a generally recommended baseline before you see improvements for context size and chunk size?
I've recently tried computing Imatrix files with different settings and using testing them on quants from IQ2_XS, IQ2_XXS, iQ3_XXS and even q5_k_m to see if it made a difference.
Tried with many different types of datasets, like:

  • 20k_wikitext which some recommend
  • A mix of training data (open-hermes-2.5, openmathinstruct, open-orca) which I scrambled via such that the training data would be more random since I loaded the files one by one, but not incoherent
  • A large file of random data, consisting of random strings per one line and random numbers, which I sampled from random.org

But it seems like when I measure the perplexity on wikitext2, the results are either no change or worse.

Tried different context sizes, but I have no idea whether longer or shorter context sizes are ideal.
And will that somehow effect longer context sizes on the model for better or worse?

Do you run the IMatrix llamacpp program with or without samplers?

I tried letting Imatrix run for 2500 chunks, thinking it was not enough compute time, but that didn't work either. But I haven't tried all combinations, since it takes a while.
So, either the calibration data I choose are bad, or I simply didn't run the calibration for long enough. Do you have any experience with the amount of data to process before getting results?

Or I'm doing something completely wrong or measuring wrongly, and I haven't seen it yet.

Kind regards, would greatly appreciate feedback :-)

Well @fieryTransition , you're already ahead of me in terms of depth of testing.

@Kalomaze , @Artefact2 , and of course @ikawrakow are the people you want to talk to, because they are the most knowledgeable about iMatrix, on the thread that I linked and the threads mentioned in that thread.

On my side, I observed that -c 32 --ch 50 on wikitext.train.raw is a good baseline for the iMatrix of the poor, it seems that most of the guiding offered by iMatrix is already approximated from a few dozens of chunks. I didn't observe degradation at high context even with such imatrix (I understand why you ask, I asked myself the same lol), but the emerging consensus is -ctx 512 and chunks 2000 (1 million tokens total), and I indeed observed a slight perplexity decrease (0.1 to 0.3% with such an imatrix compared to my small ones, and a bit less than that if I make 2000 or 3000 chunks at ctx 32 instead of 25 or 50).

Kind regards as well. :-)

Sign up or log in to comment