Generational loss?

#1
by FlareRebellion - opened

I don't intend this in a mean way, but is this conversion really worth it? Isn't this like making a copy of an already poor copy? Or another analogy: Seems to me this is akin to transcoding a 128kbit Mp3 into a 64 kbit AAC. Maybe I'm entirely off base, but this must be so many times worse than a fp16-base Q3_K_M, I'd rather offload fewer layers to the gpu, use the original Q4 and deal with it being slow.

And you're perfectly correct to do as you see fit.

Now, my take : Actually, a Q5_K_M quant is not "so far away" from a fp16.

Untitled.png

Credit for the graph : Artefact2

And a Q3_K_M quant made with iMatrix from Q5_K_M (especially with a Q8_0 intermediary step) rivals a Q3_K_M quant from fp16 made without, by regaining the lost perplexity, and of course beats hands down an Q2_K made from the fp16, as the benchs I'm doing to verify this are actually demonstrating.

So, I think that your analogy with a 128kbit/s mp3, already very lossy compared to the lossless source unlike Q5_K_M, is a bit harsh. See it instead as a 320kbit/s passed into .wav, then shrank to a 192kbit/s instead, so folks who can't play the 320 or the 224 kbit/s don't have to go straight down to the 128kbit/s.

well, yeah, you are right, in the median, as your graph shows, q5 is quite alright. The problem with repeated quantisation is that they exacerbate the bad quantisations, which are already quite significant in q5 (the q99 quantil)

just a quick example, here's lenna.png converted to q75 jpeg (yes i know, apples and oranges, but i think it relates at least a little bit)

original:

image.png

q75:

image.png

looks quite alright, right?

and here's a q50 made from the q75 (ouch):

image.png

and this is the q50 from the original png we started with (not nearly as terrible)

image.png

The added degradation is quite shocking, at least to me, and entirely unexpected, since the q75 we started with seems so close to the original, but the previously imperceptible errors became multiplied leading to gross artifacts.

The specifics of musical, visual, and statistical compression are way beyond my pay-grade, but I of course understand your point, which is pertinent. 2 arguments and a conjecture to answer it, though :

  1. My idea of loss for LLM, which are basically "sentence completion" engines based on "statistical matrix" from my understanding, is that a fp16 is already an heavy compressed "database".

  2. Also, we're just adding to the "compression" a very small fraction of its already existing "ratio" which is already huge, and thus, a much lesser fraction than the one you are presenting me and this from an original source.

  3. Also, for LLMs, it's more Ogg Vorbis than mp3, in the sense that it's somehow "rescalable" if we're close enough from the "best approximation" (and that's the case with Q6_K and Q5_K_M). When I requant from Q5_K_M and bench, the perplexity of the obtained Q8_0 is actually a bit lower than the original Q5_K_M. I'm sorry to not know the right technical words to explain better what I mean.

Then, the benchmarks are quite flawed, but I'm used to these numbers, and they check. Also, in use, you can indeed feel the loss of a quant. But between a requant with iMatrix from Q5_K_M, and a quant from fp16 without iMatrix, the degree of "loss" from the source is imperceptible to me. Like you multiply 0.02 (arbitrary number : the original compression of the fp16 compared to its training data) by 0.9 for a "proper quant", or 0.01998 by 0.902 (for the requant from Q5_K_M with the iMatrix).

Forgive me if this is a noob question, but what about instead of trying to upscale the model to a fp16 70B model...would it be possible to just convert the Q5 itself into a base model as is? Like say in the 20B~ range given the file's size, keeping the model's capabilities in the state it's already in. Or is such a thing not possible?

What you regain by reconverting to Q8_0 from Q5_K_M is 0.1% of perplexity at best, like an approximation which would somehow turn right. It can't be better than that, even if you go to fp16 from that Q5_K_M.
Some folks already pulled out the fp16 from the Q5_K_M after dequantization, and it will be a base for further training.
But it can't be as good as the original fp16 was, only very close, and the benchmarks we have do not actually measure all the subtle degradations between a real fp16 and a Q5_K_M dequant/conversion to FP16. So that's a loss still.
But the fp16 dequanted from the Q5_K_M retain the capbilities that the Q5_K_M has, at least.

Sign up or log in to comment