Abstract
LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing (2023)
- Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning (2023)
- Improved Baselines with Visual Instruction Tuning (2023)
- Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants (2023)
- Lightweight In-Context Tuning for Multimodal Unified Models (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
summarize
LLaVA-Plus: Revolutionizing Multimodal Assistants with Tool Learning
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 7
Browse 7 models citing this paperDatasets citing this paper 0
No dataset linking this paper