-
NVLM: Open Frontier-Class Multimodal LLMs
Paper β’ 2409.11402 β’ Published β’ 73 -
BRAVE: Broadening the visual encoding of vision-language models
Paper β’ 2404.07204 β’ Published β’ 18 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper β’ 2403.18814 β’ Published β’ 45 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper β’ 2409.17146 β’ Published β’ 106
Collections
Discover the best community collections!
Collections including paper arxiv:2409.12191
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper β’ 2405.15223 β’ Published β’ 12 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper β’ 2405.15574 β’ Published β’ 53 -
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 87 -
Matryoshka Multimodal Models
Paper β’ 2405.17430 β’ Published β’ 31
-
The Llama 3 Herd of Models
Paper β’ 2407.21783 β’ Published β’ 110 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper β’ 2409.12191 β’ Published β’ 76 -
Baichuan Alignment Technical Report
Paper β’ 2410.14940 β’ Published β’ 50 -
A Survey of Small Language Models
Paper β’ 2410.20011 β’ Published β’ 40
-
Qwen2.5-Coder Technical Report
Paper β’ 2409.12186 β’ Published β’ 139 -
Attention Heads of Large Language Models: A Survey
Paper β’ 2409.03752 β’ Published β’ 89 -
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
Paper β’ 2409.02634 β’ Published β’ 90 -
OmniGen: Unified Image Generation
Paper β’ 2409.11340 β’ Published β’ 109
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper β’ 2409.17146 β’ Published β’ 106 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper β’ 2409.12191 β’ Published β’ 76 -
mistralai/Pixtral-12B-2409
Image-Text-to-Text β’ Updated β’ 562 -
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text β’ Updated β’ 54.9k β’ 309
-
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Paper β’ 2409.08513 β’ Published β’ 11 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper β’ 2409.08264 β’ Published β’ 43 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper β’ 2409.12191 β’ Published β’ 76 -
LLMs + Persona-Plug = Personalized LLMs
Paper β’ 2409.11901 β’ Published β’ 32
-
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 87 -
Visual Instruction Tuning
Paper β’ 2304.08485 β’ Published β’ 13 -
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 37 -
PALO: A Polyglot Large Multimodal Model for 5B People
Paper β’ 2402.14818 β’ Published β’ 23