Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Paper ā¢ 2411.04996 ā¢ Published Nov 7, 2024 ā¢ 50
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents Paper ā¢ 2410.10594 ā¢ Published Oct 14, 2024 ā¢ 24
UI Agent Collection a collection of algorithmic agents for user interfaces/interactions and program synthesis ā¢ 236 items ā¢ Updated 2 days ago ā¢ 38
GUICourse: From General Vision Language Models to Versatile GUI Agents Paper ā¢ 2406.11317 ā¢ Published Jun 17, 2024 ā¢ 1
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper ā¢ 2403.11703 ā¢ Published Mar 18, 2024 ā¢ 16
view article Article ColPali: Efficient Document Retrieval with Vision Language Models š By manu ā¢ Jul 5, 2024 ā¢ 183
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper ā¢ 2406.18521 ā¢ Published Jun 26, 2024 ā¢ 28
view article Article An Analysis of Chinese LLM Censorship and Bias with Qwen 2 Instruct By leonardlin ā¢ Jun 11, 2024 ā¢ 50
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper ā¢ 2405.21075 ā¢ Published May 31, 2024 ā¢ 20
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation Paper ā¢ 2405.14598 ā¢ Published May 23, 2024 ā¢ 11
RoHM: Robust Human Motion Reconstruction via Diffusion Paper ā¢ 2401.08570 ā¢ Published Jan 16, 2024 ā¢ 1
MultiBooth: Towards Generating All Your Concepts in an Image from Text Paper ā¢ 2404.14239 ā¢ Published Apr 22, 2024 ā¢ 8
Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper ā¢ 2405.09818 ā¢ Published May 16, 2024 ā¢ 126
What matters when building vision-language models? Paper ā¢ 2405.02246 ā¢ Published May 3, 2024 ā¢ 101
DistilBERT release Collection Original DistilBERT model, checkpoints obtained from using teacher-student learning from the original BERT checkpoints. ā¢ 6 items ā¢ Updated Apr 17, 2024 ā¢ 15