-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 182 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 49 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 41
Collections
Discover the best community collections!
Collections including paper arxiv:2411.19930
-
instruction-pretrain/finance-Llama3-8B
Text Generation • Updated • 815 • 55 -
AdaptLLM/finance-chat
Text Generation • Updated • 928 • 87 -
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper • 2411.19930 • Published • 25 -
HuggingFaceM4/Idefics3-8B-Llama3
Image-Text-to-Text • Updated • 27.1k • 261
-
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper • 2411.19930 • Published • 25 -
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
Paper • 2412.10704 • Published • 15 -
Multi-task retriever fine-tuning for domain-specific and efficient RAG
Paper • 2501.04652 • Published • 10 -
M-A-D/Mixed-Arabic-Datasets-Repo
Viewer • Updated • 209M • 1.7k • 29
-
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper • 2410.13861 • Published • 53 -
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Paper • 2411.07975 • Published • 27 -
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 72 -
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper • 2411.14402 • Published • 43
-
Rethinking Data Selection at Scale: Random Selection is Almost All You Need
Paper • 2410.09335 • Published • 16 -
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning
Paper • 2410.06456 • Published • 36 -
Emergent properties with repeated examples
Paper • 2410.07041 • Published • 8 -
Personalized Visual Instruction Tuning
Paper • 2410.07113 • Published • 70
-
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Paper • 2410.02740 • Published • 52 -
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Paper • 2410.01215 • Published • 30 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 106 -
EuroLLM: Multilingual Language Models for Europe
Paper • 2409.16235 • Published • 26