Papers
arxiv:2310.03744

Improved Baselines with Visual Instruction Tuning

Published on Oct 5, 2023
Β· Submitted by akhaliq on Oct 6, 2023
Authors:
,

Abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Paper author

Check out our LLaVA-1.6 blog post as well!

LLaVA-1.6: Improved reasoning, OCR, and world knowledge
https://llava-vl.github.io/blog/2024-01-30-llava-1-6/

Demo: https://llava.hliu.cc/

There is also an updated technical report here: Improved Baselines with Visual Instruction Tuning

Unlocking the Power of Simple Modifications in Multimodal Learning

Links πŸ”—:

πŸ‘‰ Subscribe: https://www.youtube.com/@Arxflix
πŸ‘‰ Twitter: https://x.com/arxflix
πŸ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

This comment has been hidden
This comment has been hidden

Sign up or log in to comment

Models citing this paper 39

Browse 39 models citing this paper

Datasets citing this paper 4

Spaces citing this paper 44

Collections including this paper 22