V8 vs V16?
I noticed there are three quant versions of this model you did....two V16 and this one (V8). Which one is more intelligent? And what does the V8 and V16 mean?
I'm sorry if this is confusing. The model's name includes the vector length, codebook (lookup table) size, and residual codebook size. For example, "Qwen2.5-72B-Instruct-v8-k65536-256-woft" refers to "Qwen2.5-72B-Instruct", where: Vector Length is 8, Number of Centroids is 65536 (2^16), Number of Residual Centroids is 256 (2^8). The equivalent bitwidth calculation is:
Index: log2(65536) = 16 / 8 = 2 bits,
Residual Index: log2(256) = 8 / 8 = 1 bit,
Total Bitwidth: 2 + 1 = 3 bits,
Model Size Estimation: 70B * 3 bits / 8 bits per Byte = 26.25 GB.
You can refer to this table for an estimation of the bitwidth: https://github.com/microsoft/vptq?tab=readme-ov-file#models-from-open-source-community
V16 means the vector length is 16, which means vectors of length 16 are represented by a single index.
For example, the model "Qwen2.5-72B-Instruct-v16-k65536-65536-woft" available at https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-65536-woft represents: a vector length of 16, with Number of Centroids: 65536 (2^16), and Number of Residual Centroids: 65536 (2^16). The equivalent bitwidth calculation is:
Index: log2(65536) = 16 / 16 = 1 bit, Residual Index: log2(65536) = 16 / 16 = 1 bit, Total Bitwidth: 1 + 1 = 2 bits.
Typically, a larger bitwidth offers higher intelligence.