aslessor
's Collections
Vision
updated
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
•
2406.16860
•
Published
•
59
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
•
2407.02477
•
Published
•
21
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
•
2408.10188
•
Published
•
51
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
124
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
•
2408.13257
•
Published
•
26
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time
Series Forecasters
Paper
•
2408.17253
•
Published
•
37
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper
•
2409.01704
•
Published
•
83
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free
Real Image Editing
Paper
•
2409.01322
•
Published
•
95
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
72
Phidias: A Generative Model for Creating 3D Content from Text, Image,
and 3D Conditions with Reference-Augmented Diffusion
Paper
•
2409.11406
•
Published
•
25
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
Vision-Language Models
Paper
•
2410.09733
•
Published
•
8
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
•
2409.02889
•
Published
•
55
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
52
Progressive Multimodal Reasoning via Active Retrieval
Paper
•
2412.14835
•
Published
•
69