[llama.cpp PR#6844] Custom Quantizations

#8
by Virt-io - opened

Conversation about quantization, mainly this PR

@Lewdiculous

I think it might be worth exploring, finding a good balance between quality and speed.

I am currently experimenting with the config below:

# Used for everything not specified below.
ftype=IQ4_NL

token_embd.weight=Q8_0
output.weight=Q8_0

# These are quite small, keeping them in a higher quantization to help with context.
blk.*.attn_output.weight=F16
blk.*.attn_?.weight=F16

Edit: Seems the code above is 6.95 BPW, I will try reducing it. It is pretty fast though.

Lewdiculous changed discussion title from Quantization to Custom Quantizations - llama.cpp PR#6844
AetherArchitectural org

I never played with customizing layers as such.

Lewdiculous changed discussion title from Custom Quantizations - llama.cpp PR#6844 to Quantizations and llama.cpp PR#6844

I updated the llama.cpp but still only get degraded quants. Are there any tutorials or something like that to use llama.cpp for llama 3 models? I only know the convert.py method (python convert.py ./models/myllama3merge --vocab-type bpe)

AetherArchitectural org

That's why we're here <3

Lewdiculous changed discussion title from Quantizations and llama.cpp PR#6844 to [llama.cpp PR#6844] Custom Quantizations
Lewdiculous changed discussion status to closed

Sign up or log in to comment