Are Vision-Language Models Truly Understanding Multi-vision Sensor?
Abstract
Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse multi-vision sensor data, such as thermal, depth, and X-ray information, is essential. However, we find that current VLMs process multi-vision sensor images without deep understanding of sensor information, disregarding each sensor's unique physical properties. This limitation restricts their capacity to interpret and respond to complex questions requiring multi-vision sensor reasoning. To address this, we propose a novel Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning on multi-vision sensor tasks, helping to bridge the core information gap between images and sensor data. Extensive experimental results validate that the proposed DNA method can significantly improve the multi-vision sensor reasoning for VLMs.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models (2024)
- Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning (2024)
- Synthetic Vision: Training Vision-Language Models to Understand Physics (2024)
- PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation (2024)
- Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models (2024)
- CompCap: Improving Multimodal Large Language Models with Composite Captions (2024)
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper