YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
KVQuant is a methodology for efficient KV cache quantization that incorporates several innovations to acheive accurate low-precision quantization, thereby enabling efficient long context length inference.
TLDR: KVQuant addresses the memory bottleneck with long context length inference by quantizing the KV cache to low precision. KVQuant achieves high accuracy with low-precision KV cache quantization by considering several consistent patterns observed in cached KV values across different LLMs, and by developing methods to exploit these patterns, including:
- Per-channel, Pre-RoPE Key quantization to better match the outlier channels in Keys
- Non-Uniform Quantization (NUQ) to better represent the non-uniform activations
- Dense-and-Sparse Quantization to mitigate the impacts of numerical outliers on quantization difficulty
- Q-Norm to mitigate distribution shift at ultra low precisions (eg. 2-bit)
- Attention-Sink Aware Quantization to avoid quantization error with the first token, which is disproportionately sensitive to quantization error
For more details please check out our paper.
Model description
Quantizer file for running DBRX with 2-bit KV cache using KVQuant.
- Base Model: DBRX-Instruct
- Bitwidth: 2-bit
- Sparsity Level: 1%