-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 49 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 41
Collections
Discover the best community collections!
Collections including paper arxiv:2410.11779
-
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Paper • 2411.14257 • Published • 9 -
Distinguishing Ignorance from Error in LLM Hallucinations
Paper • 2410.22071 • Published -
DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations
Paper • 2410.18860 • Published • 9 -
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Paper • 2410.11779 • Published • 25
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper • 2408.10188 • Published • 51 -
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper • 2408.08872 • Published • 98 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 124 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • 2408.12528 • Published • 51
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Paper • 2403.05530 • Published • 62 -
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Paper • 2410.11779 • Published • 25 -
What Matters in Transformers? Not All Attention is Needed
Paper • 2406.15786 • Published • 30 -
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention
Paper • 2410.10774 • Published • 25
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 41 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22