MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Abstract
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models (2024)
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer (2024)
- Efficient Multimodal Learning from Data-centric Perspective (2024)
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model (2024)
- PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Here's my summary:
This paper from Apple presents MM1, a family of multimodal AI models that combine vision and language understanding. The researchers conducted extensive experiments to identify the key factors driving performance in these models, testing different architectural choices and pre-training data mixtures.
My highlights from the paper:
Big one of course: The largest MM1 model (30B dense) achieves state-of-the-art few-shot learning on multimodal benchmarks
Key points:
- MM1 includes both dense models up to 30B parameters and mixture-of-experts (MoE) variants
- Image resolution has the biggest impact on performance, more than model size
- Specific vision-language connector design has little effect
- Mixing interleaved image+text, caption, and text-only data in pre-training is crucial
- 5:5:1 ratio of caption, interleaved, and text data works best
- Synthetic caption data helps for few-shot learning
- The 30B dense model beats prior SOTA on VQA and captioning tasks
The core insight is that deliberate data and architecture choices, not just scale, are key to building performant multimodal models. The MM1 models also exhibit impressive emergent abilities like multi-image reasoning and in-context few-shot learning.
Amazing report. Thanks.
Wen models?
Thanks for providing vast amount of cooking receipes for building vision language model
I have one question regarding this paper.
Do you have experiments with (a) simple linear connector model without compression token number (b) the linear connector with compressed token number (c) C-abstractor that compresses the image token numbers (d) C-abstractor without compressing the token number?
I want to know additional recipe for compression of image tokens
Good questions. I'm assuming by "compression token number" you are referring to using fewer output image tokens from the connector than it was provided as input. In this work, we only considered connectors that supported a reduction in the total number of image tokens, because we train with 16 images in each sequence at a resolution of 378x378 pixels per image. With patch size 14, this results in (378/14)^2=729 output patches for every image. Multiplied by 16 images, and this gives 11,664 image patches ("tokens") for each sequence (and we use a batch of 512 sequences per pre-training step).
This is a lot of image tokens! Instead, we explored using at most 144 tokens per image (5x reduction). This number is partially motivated by the results from the HoneyBee paper, which provides some ablations you may be interested in: https://arxiv.org/abs/2312.06742
How did you choose the Empirical Setup before you conducted ablations on "image encoder" "resolution" "VL-connector" and "data composition" choices?
It quite confuses me if you choose another invariance when doing certain ablation. [The whole work is very impressive because the number of state combinations is very large]
@bmckinz How many tokens are used in pre-training? Paper says 100B tokens are used for pre-training, but from the paper, 200k (step) * 4096 (seq) * 512 (bsz) = 400B tokens seems to be used for the training.
Unpacking MM1: The Future of Multimodal Large Language Models
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
In the paper, the authors mentioned the following on page 10:
We initialize both the image encoder and the underlying LLM decoder weights for MM1 from in-house pre-trained models. We then perform multimodal pre-training on the above data mix for 200k steps (approx. 400B tokens). All models are pretrained entirely unfrozen with sequence length 4096, up to 16 images per sequence at 378ร378 resolution, with a batch size of 512 sequences.
Given that the multimodal pre-training dataset contains both texts and images, I am wondering what loss function was used during this multimodal pre-training phase. It seems not mentioned in the paper.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper