GGUF?
Please ^-^
there is a GGUF file provided in the repo
it's quite large, can we get a quant?
Seems really large for a GGUF file, I have enough memory but why is it so large? Is it FP16 etc.? Other variants should be provided. I found these though, haven't tested them
https://huggingface.co/mlabonne/gemma-2b-GGUF
He also has a 7B version, but the repo seems empty:
https://huggingface.co/mlabonne/gemma-7b-it-GGUF
Download at your own risk of course
You can run quantize
(include in llama.cpp
repo) to get Q8_0 versions. I expect the community will spring up with various quantized versions very soon too.
why is it so large? Is it FP16 etc.?
Yes, it is float 32.
This is the command to quantize to 4-bits. It assumes you have llama.cpp
built and installed.
8-bit: quantize gemma-7b.gguf ./gemma-7b-Q8_0.gguf Q8_0
4-bit: quantize gemma-7b.gguf ./gemma-7b-Q4_K_M.gguf Q4_K_M
I tried the GGUF from https://huggingface.co/rahuldshetty/gemma-7b-it-gguf-quantized in ollama
But it crashes! Any facing same issue?
@aptha a dumb question but are you compling from the latest ollama source, including updating its llama.cpp submodule?
llm = CTransformers(model="mlabonne/Gemmalpaca-2B-GGUF", model_file="gemmalpaca-2b.Q8_0.gguf", model_type="gemma", gpu_layers=0)
this one doesn't work. Is there a generic way to open gguf files with CTransformers?