Converted to HF with transformers 4.30.0.dev0
, then quantized to 4 bit with GPTQ (Group size 32
):
python llama.py ../llama-65b-hf c4 --wbits 4 --true-sequential --act-order --groupsize 32 --save_safetensors 4bit-32g.safetensors
PPL should be marginally better than group size 128 at the cost of more VRAM. An A6000 should still be able to fit it all at full 2048 context.
Note that this model was quantized under GPTQ's cuda
branch. Which means it should work with 0cc4m's KoboldAI fork:
https://github.com/0cc4m/KoboldAI
- Downloads last month
- 8
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.