MLLM-as-a-Judge for Image Safety without Human Labeling Paper • 2501.00192 • Published 6 days ago • 22
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published 4 days ago • 75
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey Paper • 2412.18619 • Published 21 days ago • 49
On the Compositional Generalization of Multimodal LLMs for Medical Imaging Paper • 2412.20070 • Published 9 days ago • 40
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization Paper • 2412.18525 • Published 13 days ago • 62
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks Paper • 2412.18072 • Published 13 days ago • 14
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation Paper • 2412.18176 • Published 13 days ago • 15
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models Paper • 2412.18609 • Published 12 days ago • 13
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search Paper • 2412.18319 • Published 13 days ago • 34
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding Paper • 2412.17295 • Published 14 days ago • 9
Diving into Self-Evolving Training for Multimodal Reasoning Paper • 2412.17451 • Published 14 days ago • 41
FastVLM: Efficient Vision Encoding for Vision Language Models Paper • 2412.13303 • Published 19 days ago • 13
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception Paper • 2412.14233 • Published 18 days ago • 6
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer Paper • 2412.13871 • Published 19 days ago • 17
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper • 2412.14171 • Published 18 days ago • 23
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning Paper • 2412.11974 • Published 21 days ago • 9
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain Paper • 2412.13018 • Published 20 days ago • 41
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models Paper • 2412.12606 • Published 20 days ago • 41