'blk.49.ffn_gate.weight' data is not within the file bounds
Using koboldcpp and Luminum-v0.1-123B.Q5_K_S.gguf.part1of2. Also seems to be happening with Luminum-v0.1-123B.Q4_K_S.gguf.part1of2
Is this just an incompatibility with koboldcpp? Am I the only one experiencing this? Is there a command line option missing?
llama_model_load: error loading model: tensor 'blk.49.ffn_gate.weight' data is not within the file bounds, model is corrupted or incomplete
llama_load_model_from_file: failed to load model
Segmentation fault
Using koboldcpp and Luminum-v0.1-123B.Q5_K_S.gguf.part1of2. Also seems to be happening with Luminum-v0.1-123B.Q4_K_S.gguf.part1of2
Is this just an incompatibility with koboldcpp? Am I the only one experiencing this? Is there a command line option missing?
llama_model_load: error loading model: tensor 'blk.49.ffn_gate.weight' data is not within the file bounds, model is corrupted or incomplete
llama_load_model_from_file: failed to load model
Segmentation fault
You need to concatenate the parts using cat using the following command before loading them in llama.cpp:
cat Luminum-v0.1-123B.Q5_K_S.gguf.part1of2 Luminum-v0.1-123B.Q5_K_S.gguf.part2of2 > Luminum-v0.1-123B.Q5_K_S.gguf
You need to concatenate the parts using cat using the following command before loading them in llama.cpp:
Thank you for that guidance. I had not done that step before with a multi-part gguf. Working now! Appreciated.
Thank you for that guidance. I had not done that step before with a multi-part gguf. Working now! Appreciated.
No problem. It's because this are multi-part GGUFs and not split GGUFs. The official llama.cpp split format is kind of a pain so many like to ignore it and instead take the GGUF file and split it into multiple parts. What's even more confusing is that you cannot concatenate split GGUFs as a special tool (llama-gguf-split --merge
) is required to convert them to a normal GGUF but most users don't care as unlike the multi-part GGUFs you can load the split GGUFs as they are natively supported by llama-cpp. No idea why the llama.cpp developers came up with such an overcomplicated solution to split a file. If you want to read mradermacher's opinion about this topic read https://huggingface.co/mradermacher/model_requests#why-dont-you-use-gguf-split. It is quite likely that he will switch to split GGUFs in the near future but for the time being all his quants larger than 50 GB are multi-part GGUFs and not split GGUFs.
Yes, I plan to do gguf-split versions "soon", but am already weary of the support nightmare that two formats will cause :)
I think there are many small reasons for the overcomplicated solution. Most importantly, it's to avoid splitting tensors in a bad way. On unix, you'd have to split them on page-size boundaries so you can still mmap everything. And maybe that can't be done on windows even, making a special spit program practically a requirement (some people have split at boundaries not divisible by 4k for example). llama splits on tensor-boundaries (explaining the uneven sizes of parts). And maybe it helps to have all the information about parts more explicitly in the first part. And lastly, it is "nice" to have self-describing files. Oh, and both my early qwuants and quants by others had wild west split filenames (mostly .aa, .ab from posix split, but sometimes others). Arguably, I also invented my own file format by calling them .partXofX
. And that's just what I can come up with.
So, I do see the engineering reasons on why they might have done it. Still, I would have first tried to work with the existing de-facto standard. Especially since this whole format had to be developed only because of (arbitrary?) huggingface limitations - a service who main purpose is to store humungously large files should have humungously large file support, really.