|
--- |
|
license: llama2 |
|
--- |
|
|
|
My exllamav2 based quantization for Xwin-LM-70B-V0.1 targetted for 48G VRAM, seems to have hit a sweet spot in evaluations. |
|
|
|
* Original model: https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1 |
|
* Exllamav2 4.8bpw conversion from https://huggingface.co/firelzrd/Xwin-LM-70B-V0.1-fp16-safetensors. |
|
* Fits in 48G (2x24G) VRAM with 4k or 8k context with or without the 8bit cache enabled. |
|
* Recommended settings: 6400 context, alpha_value 1.6, gpu_split 20,23.5 |
|
* alpha_value at or over 1.75 seems to result in an occasional 'stutter', very obvious when the model outputs dates. Ex ("The Sixth Sense (19999)") |
|
* Seems to have hit some lucky quantization and the 4.800b was better than the 4bit-128g, 4bit-32g, Q4_K_S, 4.650b, 4.900b and even the 5.000b! |
|
* Experimentation has shown that alpha_value at 1.6 instead of 1.75 seems better at 1.5x context and even 1.5625x |
|
* Maybe obvious to some but there is no perplexity impact to using an 8bit cache. |
|
|
|
Made using exllamav2/convert.py with the following command: |
|
|
|
```bash |
|
python3 convert.py -i models/firelzrd_Xwin-LM-70B-V0.1-fp16-safetensors/ \ |
|
-cf models/matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \ |
|
-o tmp/ \ |
|
-c parquet/wikitext-test.parquet \ |
|
-b 4.800 |
|
``` |
|
|
|
Perplexity (wikitext) evaluated as: |
|
|
|
| Model | Perplexity | Comment (alpha_value) | |
|
|---|---|---| |
|
| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.21780776977539 | 4096 ctx | |
|
| matatonic_Xwin-LM-70B-V0.1-exl2-4.900b | 3.2188525199890137 | 4096 ctx (not released) | |
|
| firelzrd_Xwin-LM-70B-V0.1-exl2_5-bpw | 3.22019362449646 | 4096 ctx (8b cache) | |
|
| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.239454746246338 | 5120 ctx (1.375) | |
|
| LoneStriker_Xwin-LM-70B-V0.1-4.65bpw-h6-exl2 | 3.2419090270996094 | 4096 ctx | |
|
| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.2434027194976807 | 6400 ctx (1.6) | |
|
| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.2434027194976807 | 6400 ctx (1.6, 8b cache) | |
|
| xwin-lm-70b-v0.1.Q4_K_S.gguf | 3.2480294704437256 | 4096 ctx | |
|
| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.253002405166626 | 6144 ctx (1.75) | |
|
| TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-32g-actorder_True | 3.266364574432373 | 4096 ctx | |
|
| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.278069496154785 | 6656 ctx (1.95) | |
|
| TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-128g-actorder_True | 3.2803425788879395 | 4096 ctx | |
|
| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.304278612136841 | 7168 ctx (2.125) | |
|
| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b | 3.359946727752685 | 8192 ctx (2.5) | |
|
|
|
*) Should be better than xwin-lm-70b-v0.1.Q4_K_M.gguf also, which reports 4.8bpw, but so far my perplexity eval has not been successful. |
|
|