arxiv:2412.20800

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

Published on Dec 30, 2024

· Submitted by

fenfan on Jan 2

Upvote

Authors:

Shaojin Wu ,

Abstract

While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.

View arXiv page View PDF Add to collection

Community

fenfan

Paper author Paper submitter 3 days ago

🔥Make your model aesthetic again🔥

We introduce VMix, which offers improved aesthetic guidance to the model via a novel condition control method called value-mixed cross-attention. VMix serves as an innovative plug-and-play adapter, designed to systematically enhance aesthetic quality.

Highlight:

We analyze and explore the differences in generated images across fine-grained aesthetic dimensions, proposing the disentanglement of these attributes in the text prompt to provide a clear direction for model optimization.
We introduce VMix, which disentangles the input text prompt into content description and aesthetic description, offering improved guidance to the model via a novel condition control method called value-mixed cross-attention.
The proposed VMix approach is universally effective for existing diffusion models, serving as a plug-and-play aesthetic adapter that is highly compatible with community modules.

Project page: https://vmix-diffusion.github.io/VMix/
Code: https://github.com/fenfenfenfan/VMix
Paper: https://arxiv.org/abs/2412.20800v1

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.20800 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.20800 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.20800 in a Space README.md to link it from this page.