che111
's Collections
General Multimodal Learning
updated
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
Modalities
Paper
•
2401.14405
•
Published
•
12
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal
LLMs
Paper
•
2406.18521
•
Published
•
28
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
Representations
Paper
•
2408.12590
•
Published
•
35
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
92
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
124
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
Resolution
Paper
•
2409.12961
•
Published
•
25
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
•
2410.02740
•
Published
•
52
Video Instruction Tuning With Synthetic Data
Paper
•
2410.02713
•
Published
•
38
AuroraCap: Efficient, Performant Video Detailed Captioning and a New
Benchmark
Paper
•
2410.03051
•
Published
•
4
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video
Large Language Models
Paper
•
2410.03290
•
Published
•
7
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive
Transformer for Efficient Finegrained Image Generation
Paper
•
2410.01912
•
Published
•
14
MIO: A Foundation Model on Multimodal Tokens
Paper
•
2409.17692
•
Published
•
53
Emu3: Next-Token Prediction is All You Need
Paper
•
2409.18869
•
Published
•
94
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
•
2409.20566
•
Published
•
54
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
•
2410.13848
•
Published
•
32
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
•
2408.15998
•
Published
•
84
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
•
2411.04923
•
Published
•
20
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for
Joint Video Highlight Detection and Moment Retrieval
Paper
•
2412.01558
•
Published
•
4
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
•
2412.02611
•
Published
•
23
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
•
2412.04467
•
Published
•
105
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
•
2412.04424
•
Published
•
58
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
52
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
•
2412.08635
•
Published
•
41
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary
Embedding Distillation
Paper
•
2412.09585
•
Published
•
10
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
•
2412.09283
•
Published
•
19
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
•
2412.14475
•
Published
•
52
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
•
2412.15204
•
Published
•
32
Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Paper
•
2412.14233
•
Published
•
6