MultiModal - a Feynman Collection

Feynman 's Collections

LLM-RAG

fuseLLM

Agent

voice

MultiModal

updated Jul 30, 2024

MM-LLMs: Recent Advances in MultiModal Large Language Models

Paper • 2401.13601 • Published Jan 24, 2024 • 45
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20, 2024 • 14
Neural Network Diffusion

Paper • 2402.13144 • Published Feb 20, 2024 • 95
FlashTex: Fast Relightable Mesh Texturing with LightControlNet

Paper • 2402.13251 • Published Feb 20, 2024 • 13
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

Paper • 2403.00522 • Published Mar 1, 2024 • 44
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Paper • 2403.04692 • Published Mar 7, 2024 • 39
OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

Paper • 2403.01779 • Published Mar 4, 2024 • 28
CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Paper • 2403.05034 • Published Mar 8, 2024 • 20
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

Paper • 2403.05121 • Published Mar 8, 2024 • 22
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 26
DressCode: Autoregressively Sewing and Generating Garments from Text Guidance

Paper • 2401.16465 • Published Jan 29, 2024 • 11
Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

Paper • 2405.17405 • Published May 27, 2024 • 14
Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Paper • 2405.15757 • Published May 24, 2024 • 14
Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Paper • 2405.20204 • Published May 30, 2024 • 35
ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27, 2024 • 42
Honeybee: Locality-enhanced Projector for Multimodal LLM

Paper • 2312.06742 • Published Dec 11, 2023 • 9