VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
Abstract
While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.
Community
🔥Make your model aesthetic again🔥
We introduce VMix, which offers improved aesthetic guidance to the model via a novel condition control method called value-mixed cross-attention. VMix serves as an innovative plug-and-play adapter, designed to systematically enhance aesthetic quality.
Highlight:
- We analyze and explore the differences in generated images across fine-grained aesthetic dimensions, proposing the disentanglement of these attributes in the text prompt to provide a clear direction for model optimization.
- We introduce VMix, which disentangles the input text prompt into content description and aesthetic description, offering improved guidance to the model via a novel condition control method called value-mixed cross-attention.
- The proposed VMix approach is universally effective for existing diffusion models, serving as a plug-and-play aesthetic adapter that is highly compatible with community modules.
Project page: https://vmix-diffusion.github.io/VMix/
Code: https://github.com/fenfenfenfan/VMix
Paper: https://arxiv.org/abs/2412.20800v1
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models (2024)
- Conditional Text-to-Image Generation with Reference Guidance (2024)
- CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis (2024)
- SeedEdit: Align Image Re-Generation to Image Editing (2024)
- DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation (2024)
- EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM (2024)
- Efficient Scaling of Diffusion Transformers for Text-to-Image Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper