henern
's Collections
Vision
updated
Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models
Paper
•
2402.17177
•
Published
•
88
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with
Audio2Video Diffusion Model under Weak Conditions
Paper
•
2402.17485
•
Published
•
190
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper
•
2403.00522
•
Published
•
44
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K
Text-to-Image Generation
Paper
•
2403.04692
•
Published
•
39
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Paper
•
2311.12793
•
Published
•
18
FlashFace: Human Image Personalization with High-fidelity Identity
Preservation
Paper
•
2403.17008
•
Published
•
19
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
87
Paper
•
2406.09414
•
Published
•
95
Vision language models are blind
Paper
•
2407.06581
•
Published
•
83
SAM 2: Segment Anything in Images and Videos
Paper
•
2408.00714
•
Published
•
110
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
•
2408.01800
•
Published
•
79
Paper
•
2408.07009
•
Published
•
61
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
98
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
124
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
StoryMaker: Towards Holistic Consistent Characters in Text-to-image
Generation
Paper
•
2409.12576
•
Published
•
16
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for
Customized Manga Generation
Paper
•
2412.07589
•
Published
•
46