Pixtral-12B-2409: FP8 Dynamic Quant + FP8 KV Cache

Quant of mistral-community/pixtral-12b using LLM Compressor for optimised inference on VLLM.

FP8 dynamic quant on language model, and FP8 quant of KV cache. multi_modal_projector and vision_tower left in FP16 since it's a small part of the model.

Calibrated on 2048 ultrachat samples.

Example VLLM usage

vllm serve nintwentydo/pixtral-12b-FP8-dynamic-FP8-KV-cache --quantization fp8 --kv-cache-dtype fp8

Supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).

Edit: Something seems to be wrong with the tokenizer. If you have any issues add --tokenizer mistral-community/pixtral-12b to your VLLM command line args.

Downloads last month
177
Safetensors
Model size
12.7B params
Tensor type
BF16
·
F8_E4M3
·
Inference Examples
Inference API (serverless) does not yet support vllm models for this pipeline type.

Model tree for nintwentydo/pixtral-12b-FP8-dynamic-FP8-KV-cache

Quantized
(3)
this model

Dataset used to train nintwentydo/pixtral-12b-FP8-dynamic-FP8-KV-cache