My exllamav2 based quantization for Xwin-LM-70B-V0.1 targetted for 48G VRAM, seems to have hit a sweet spot in evaluations.
- Original model: https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1
- Exllamav2 4.8bpw conversion from https://huggingface.co/firelzrd/Xwin-LM-70B-V0.1-fp16-safetensors.
- Fits in 48G (2x24G) VRAM with 4k or 8k context with or without the 8bit cache enabled.
- Recommended settings: 6400 context, alpha_value 1.6, gpu_split 20,23.5
- alpha_value at or over 1.75 seems to result in an occasional 'stutter', very obvious when the model outputs dates. Ex ("The Sixth Sense (19999)")
- Seems to have hit some lucky quantization and the 4.800b was better than the 4bit-128g, 4bit-32g, Q4_K_S, 4.650b, 4.900b and even the 5.000b!
- Experimentation has shown that alpha_value at 1.6 instead of 1.75 seems better at 1.5x context and even 1.5625x
- Maybe obvious to some but there is no perplexity impact to using an 8bit cache.
Made using exllamav2/convert.py with the following command:
python3 convert.py -i models/firelzrd_Xwin-LM-70B-V0.1-fp16-safetensors/ \
-cf models/matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \
-o tmp/ \
-c parquet/wikitext-test.parquet \
-b 4.800
Perplexity (wikitext) evaluated as:
Model | Perplexity | Comment (alpha_value) |
---|---|---|
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.21780776977539 | 4096 ctx |
matatonic_Xwin-LM-70B-V0.1-exl2-4.900b | 3.2188525199890137 | 4096 ctx (not released) |
firelzrd_Xwin-LM-70B-V0.1-exl2_5-bpw | 3.22019362449646 | 4096 ctx (8b cache) |
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.239454746246338 | 5120 ctx (1.375) |
LoneStriker_Xwin-LM-70B-V0.1-4.65bpw-h6-exl2 | 3.2419090270996094 | 4096 ctx |
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.2434027194976807 | 6400 ctx (1.6) |
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.2434027194976807 | 6400 ctx (1.6, 8b cache) |
xwin-lm-70b-v0.1.Q4_K_S.gguf | 3.2480294704437256 | 4096 ctx |
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.253002405166626 | 6144 ctx (1.75) |
TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-32g-actorder_True | 3.266364574432373 | 4096 ctx |
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.278069496154785 | 6656 ctx (1.95) |
TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-128g-actorder_True | 3.2803425788879395 | 4096 ctx |
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.304278612136841 | 7168 ctx (2.125) |
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.359946727752685 | 8192 ctx (2.5) |
*) Should be better than xwin-lm-70b-v0.1.Q4_K_M.gguf also, which reports 4.8bpw, but so far my perplexity eval has not been successful.
- Downloads last month
- 13
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.