arxiv:2501.04001

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Published on Jan 7

· Submitted by

LXT on Jan 8

Upvote

Authors:

Haobo Yuan ,

Xiangtai Li ,

Tao Zhang ,

Shilin Xu ,

Jiashi Feng ,

Abstract

This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.

View arXiv page View PDF Add to collection

Community

LXT

Paper author Paper submitter about 22 hours ago

•

edited about 22 hours ago

We present Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning.

Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content.

Code: https://github.com/magic-research/Sa2VA

Paper: https://arxiv.org/abs/2501.04001

Huggingface: https://huggingface.co/ByteDance/Sa2VA-4B

Project Page: https://lxtgh.github.io/project/sa2va/

Welcome to try our MLLM models and send your feedback to [email protected] or [email protected] or [email protected].