QVQ-72B-Preview AWQ 4-Bit Quantized Version

This repository provides the AWQ 4-bit quantized version of the QVQ-72B-Preview model, originally developed by Qwen. This model's weights are padded with zeros before quantization to ensure compatibility with multi-GPU tensor parallelism by resolving divisibility constraints. The padding minimally impacts computation while enabling efficient scaling across multiple GPUs.

QVQ-72B-Preview

Introduction

QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities.

Performance

QVQ-72B-Preview o1-2024-12-17 gpt-4o-2024-05-13 Claude3.5 Sonnet-20241022 Qwen2VL-72B
MMMU(val) 70.3 77.3 69.1 70.4 64.5
MathVista(mini) 71.4 71.0 63.8 65.3 70.5
MathVision(full) 35.9 30.4 35.6 25.9
OlympiadBench 20.4 25.9 11.2

QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark, showcasing QVQ's powerful ability in multidisciplinary understanding and reasoning. Furthermore, the significant improvements on MathVision highlight the model's progress in mathematical reasoning tasks. OlympiadBench also demonstrates the model's enhanced ability to tackle challenging problems.

But It's Not All Perfect: Acknowledging the Limitations

While QVQ-72B-Preview exhibits promising performance that surpasses expectations, it’s important to acknowledge several limitations:

  1. Language Mixing and Code-Switching: The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses.
  2. Recursive Reasoning Loops: There's a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer.
  3. Safety and Ethical Considerations: Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model.
  4. Performance and Benchmark Limitations: Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of Qwen2-VL-72B. During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over Qwen2-VL-72B in basic recognition tasks like identifying people, animals, or plants.

Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qvq-72b-preview,
    title = {QVQ: To See the World with Wisdom},
    url = {https://qwenlm.github.io/blog/qvq-72b-preview/},
    author = {Qwen Team},
    month = {December},
    year = {2024}
}
@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}
Downloads last month
2,932
Safetensors
Model size
12.6B params
Tensor type
I32
·
FP16
·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for kosbu/QVQ-72B-Preview-AWQ

Base model

Qwen/Qwen2-VL-72B
Quantized
(29)
this model