Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper • 2412.04424 • Published Dec 5, 2024 • 59
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24, 2024 • 60
Automated Data Curation for Robust Language Model Fine-Tuning Paper • 2403.12776 • Published Mar 19, 2024
Image Sculpting: Precise Object Editing with 3D Geometry Control Paper • 2401.01702 • Published Jan 2, 2024 • 19
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition Paper • 2203.07996 • Published Feb 24, 2022
Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models Paper • 2211.10950 • Published Nov 20, 2022