Is this tested?
I've tried the Q2 and received:
llama_model_load: error loading model: tensor 'blk.8.ffn_up_exps.weight' data is not within the file bounds, model is corrupted or incomplete
llama_load_model_from_file: failed to load model
main: error: unable to load model
Yes it is tested. This is my new favorite model and I'm using it all the time. It is fully uncensored. Despite being larger than LLama 405B it delivers 16 tokens/second token generation speed on CPU at i1-IQ4_XS. It achieves this performance because it consists of 128 experts with only two being active for each token. I'm currently using i1-IQ4_XS but used the static quants a few days ago and the worked perfectly fine. I however recommend to use something larger than static Q2 or at least i1-Q2_K for better quality.
@csabakecskemeti
The reason you are getting this error is probably because you wrongly concatenated the parts. If you are on linux you need to do the following: cat snowflake-arctic-instruct.Q2_K.gguf.part1of4 snowflake-arctic-instruct.Q2_K.gguf.part2of4 snowflake-arctic-instruct.Q2_K.gguf.part3of4 snowflake-arctic-instruct.Q2_K.gguf.part4of4 > snowflake-arctic-instruct.Q2_K.gguf
. I however recommend to just redownload the model from http://hf.tst.eu/model#snowflake-arctic-instruct-GGUF so it already is concatenated correctly. I further recommend to at least go with the same sized but much higher quality i1-Q2_K if you can't go any larger.
I've tried with llama.cpp (only have 64GB vram so also used: GGML_CUDA_ENABLE_UNIFIED_MEMORY=1).
I think I see what was the issue, when I'm splitting the I'm using the llamacpp splotter and the filename as: xyz.Q5_K_M-00012-of-00013.gguf so when I run it I don;t need to can anything just refer to the first part of the GGUF and llama.cpp figures it out.
The main reason I've also quantized it but I cannot get any output token.
The conversion didn't worked out of the box, failed on the gguf_writer.py
expert_params += (size // shape[-3])
IndexError: tuple index out of range
particylarly here:
if "_exps." in name:
expert_params += (size // shape[-3])
expert_sum += shape[-3]
n_expert_tensors += 1
I found the blk.0.ffn_norm_exps.weight though has "_exp" postfix the shape is 1D (7168,)
I'm not sure if my download is corrupt (don't think so) or this layer is not representing "separate tensors for each expert" (I hope you get my point)
Anyway I just wanted to ask if you faced the same issue during qunatization, or what has been used to qunatize this?
Thanks!
I've tried with llama.cpp (only have 64GB vram so also used: GGML_CUDA_ENABLE_UNIFIED_MEMORY=1).
I recommend to use -ngl NUMBER_OF_LAYERS_THAT_FIT_IN_GPU_MEMORY
instead unless you use llama.cpp RPC using multiple PCs where GGML_CUDA_ENABLE_UNIFIED_MEMORY is mandatory. Using -ngl you get double the token processing speed and can load a model as large as RAM + GPU memory. Then just choose the largest quant that fits.
I think I see what was the issue, when I'm splitting the I'm using the llamacpp splotter and the filename as: xyz.Q5_K_M-00012-of-00013.gguf so when I run it I don;t need to can anything just refer to the first part of the GGUF and llama.cpp figures it out.
This only works for GGUFs in the llama.cpp split format but not spitted GGUFs so this method will not work for any of the models provided by mradermacher. You must manually concatenate the parts before loading the concatenated file with llama.cpp.
I found the blk.0.ffn_norm_exps.weight though has "_exp" postfix the shape is 1D (7168,)
Anyway I just wanted to ask if you faced the same issue during qunatization, or what has been used to qunatize this?
Yes we did face the same issue. I solved it by replacing shape[-3] with 128 as it is well known the model has 128 experts. The sole purpose of the get_total_parameter_count
function is to count the number of parameters for GGUF metadata so even if this is wrong it should not affect inference in any way.
if "_exps." in name:
print(name, shape)
expert_params += (size // 128) #shape[-3])
expert_sum += 128 #shape[-3]
n_expert_tensors += 1
else:
shared_params += size
total_params += size
We also experienced another issue during imatrix computation. No matter what dataset we tried we were unable to activate 1 out of the 128 experts on layer 0 so with the currently released imatrix quants the first layers is statically quantized while the other layers have the imatrix applied. But despite this the imatrix quants turned out much better than the static quants.
Yeah this is just statistic...
unfortunately my quants are not working correctly. I'll check yours again.
Thanks!
By re-converting and re-splitting I've made my version to work too.