Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers
Abstract
We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: https://github.com/dvlab-research/MagicMirror/
Community
Project page: https://julianjuaner.github.io/projects/MagicMirror/index.html
Code(Coming Soon): https://github.com/dvlab-research/MagicMirror
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Ingredients: Blending Custom Photos with Video Diffusion Transformers (2025)
- DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting (2024)
- PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation (2024)
- Identity-Preserving Text-to-Video Generation by Frequency Decomposition (2024)
- StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart (2024)
- REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents (2024)
- MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper