matatonic
/

Xwin-LM-70B-V0.1-exl2-4.800b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Xwin-LM-70B-V0.1-exl2-4.800b / README.md

matatonic's picture

Update README.md

9744c43 about 1 year ago

|

history blame contribute delete

2.62 kB

	---
	license: llama2
	---

	My exllamav2 based quantization for Xwin-LM-70B-V0.1 targetted for 48G VRAM, seems to have hit a sweet spot in evaluations.

	* Original model: https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1
	* Exllamav2 4.8bpw conversion from https://huggingface.co/firelzrd/Xwin-LM-70B-V0.1-fp16-safetensors.
	* Fits in 48G (2x24G) VRAM with 4k or 8k context with or without the 8bit cache enabled.
	* Recommended settings: 6400 context, alpha_value 1.6, gpu_split 20,23.5
	* alpha_value at or over 1.75 seems to result in an occasional 'stutter', very obvious when the model outputs dates. Ex ("The Sixth Sense (19999)")
	* Seems to have hit some lucky quantization and the 4.800b was better than the 4bit-128g, 4bit-32g, Q4_K_S, 4.650b, 4.900b and even the 5.000b!
	* Experimentation has shown that alpha_value at 1.6 instead of 1.75 seems better at 1.5x context and even 1.5625x
	* Maybe obvious to some but there is no perplexity impact to using an 8bit cache.

	Made using exllamav2/convert.py with the following command:

	```bash
	python3 convert.py -i models/firelzrd_Xwin-LM-70B-V0.1-fp16-safetensors/ \
	-cf models/matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \
	-o tmp/ \
	-c parquet/wikitext-test.parquet \
	-b 4.800
	```

	Perplexity (wikitext) evaluated as:

	\| Model \| Perplexity \| Comment (alpha_value) \|
	\|---\|---\|---\|
	\| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \| 3.21780776977539 \| 4096 ctx \|
	\| matatonic_Xwin-LM-70B-V0.1-exl2-4.900b \| 3.2188525199890137 \| 4096 ctx (not released) \|
	\| firelzrd_Xwin-LM-70B-V0.1-exl2_5-bpw \| 3.22019362449646 \| 4096 ctx (8b cache) \|
	\| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \| 3.239454746246338 \| 5120 ctx (1.375) \|
	\| LoneStriker_Xwin-LM-70B-V0.1-4.65bpw-h6-exl2 \| 3.2419090270996094 \| 4096 ctx \|
	\| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \| 3.2434027194976807 \| 6400 ctx (1.6) \|
	\| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \| 3.2434027194976807 \| 6400 ctx (1.6, 8b cache) \|
	\| xwin-lm-70b-v0.1.Q4_K_S.gguf \| 3.2480294704437256 \| 4096 ctx \|
	\| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \| 3.253002405166626 \| 6144 ctx (1.75) \|
	\| TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-32g-actorder_True \| 3.266364574432373 \| 4096 ctx \|
	\| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \| 3.278069496154785 \| 6656 ctx (1.95) \|
	\| TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-128g-actorder_True \| 3.2803425788879395 \| 4096 ctx \|
	\| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \| 3.304278612136841 \| 7168 ctx (2.125) \|
	\| matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \| 3.359946727752685 \| 8192 ctx (2.5) \|

	*) Should be better than xwin-lm-70b-v0.1.Q4_K_M.gguf also, which reports 4.8bpw, but so far my perplexity eval has not been successful.