Good folks at @nvidia have just released NVLM 1.0, a family of frontier-class multimodal large language models that achieve state-of-the-art results across vision-language tasks.

Here is how they did it:

1. Model Architecture Design:
- Developed three model architectures:
a) NVLM-D: Decoder-only architecture
b) NVLM-X: Cross-attention-based architecture
c) NVLM-H: Novel hybrid architecture

2. Vision Encoder:
- Used InternViT-6B-448px-V1-5 as the vision encoder
- Implemented dynamic high-resolution (DHR) input handling

3. Language Model:
- Used Qwen2-72B-Instruct as the base LLM

4. Training Data Curation:
- Carefully curated high-quality pretraining and supervised fine-tuning datasets
- Included diverse task-oriented datasets for various capabilities

5. Pretraining:
- Froze LLM and vision encoder
- Trained only modality-alignment modules (e.g., MLP projector, cross-attention layers)
- Used a large batch size of 2048

6. Supervised Fine-Tuning (SFT):
- Unfroze LLM while keeping the vision encoder frozen
- Trained on multimodal SFT datasets and high-quality text-only SFT data
- Implemented 1-D tile tagging for dynamic high-resolution inputs

7. Evaluation:
- Evaluated on multiple vision-language benchmarks
- Compared performance to leading proprietary and open-source models

8. Optimization:
- Iterated on model designs and training approaches
- Used smaller 34B models for faster experimentation before scaling to 72B

9. Now comes the best part...Open-Sourcing:
- Released model weights and full technical details to the research community

The paper provides fascinating insights into architecture design, training data curation, and achieving production-grade multimodality. A must-read for anyone working on multimodal AI!

upvoted a paper 4 months ago

OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 109

updated a collection 4 months ago

Cool

Collection

5 items • Updated Sep 20, 2024

Richard "TacImpulse" Scott

AI & ML interests

Recent Activity

Organizations

TacImpulse's activity

AuraSR-v2

AdvancedLivePortrait-WebUI

MusePose

Tortoise Tts

tencent/HunyuanVideo

MV Adapter I2MV SDXL

MonsterMMORPG/94_CUDA_Fix

LTX-Video-Playground

PR Puppet Sora

MagicQuill

DimensionX

Open NotebookLM

Remove Background WebGPU

Llama 3.2 3B WebGPU

Whisper Large V3 Turbo WebGPU

DTG (DanTagGen / Danbooru Tag Generator) Demo with WD Tagger & Florence 2 SD3 Captioner

Molmo 7B D 0924

OmniGen: Unified Image Generation

Cool