|
--- |
|
license: mit |
|
language: |
|
- en |
|
base_model: |
|
- black-forest-labs/FLUX.1-dev |
|
- Qwen/Qwen2-VL-7B-Instruct |
|
library_name: diffusers |
|
tags: |
|
- flux |
|
- qwen2vl |
|
- stable-diffusion |
|
- text-to-image |
|
- image-to-image |
|
- controlnet |
|
pipeline_tag: text-to-image |
|
--- |
|
|
|
# Qwen2vl-Flux |
|
|
|
<div align="center"> |
|
<img src="landing-1.png" alt="Qwen2vl-Flux Banner" width="100%"> |
|
</div> |
|
|
|
Qwen2vl-Flux is a state-of-the-art multimodal image generation model that enhances FLUX with Qwen2VL's vision-language understanding capabilities. This model excels at generating high-quality images based on both text prompts and visual references, offering superior multimodal understanding and control. |
|
|
|
## Model Architecture |
|
|
|
<div align="center"> |
|
<img src="flux-architecture.svg" alt="Flux Architecture" width="800px"> |
|
</div> |
|
|
|
The model integrates Qwen2VL's vision-language capabilities into the FLUX framework, enabling more precise and context-aware image generation. Key components include: |
|
- Vision-Language Understanding Module (Qwen2VL) |
|
- Enhanced FLUX backbone |
|
- Multi-mode Generation Pipeline |
|
- Structural Control Integration |
|
|
|
## Features |
|
|
|
- **Enhanced Vision-Language Understanding**: Leverages Qwen2VL for superior multimodal comprehension |
|
- **Multiple Generation Modes**: Supports variation, img2img, inpainting, and controlnet-guided generation |
|
- **Structural Control**: Integrates depth estimation and line detection for precise structural guidance |
|
- **Flexible Attention Mechanism**: Supports focused generation with spatial attention control |
|
- **High-Resolution Output**: Supports various aspect ratios up to 1536x1024 |
|
|
|
## Generation Examples |
|
|
|
### Image Variation |
|
Create diverse variations while maintaining the essence of the original image: |
|
|
|
<div align="center"> |
|
<table> |
|
<tr> |
|
<td><img src="variation_1.png" alt="Variation Example 1" width="256px"></td> |
|
<td><img src="variation_2.png" alt="Variation Example 2" width="256px"></td> |
|
<td><img src="variation_3.png" alt="Variation Example 3" width="256px"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="variation_4.png" alt="Variation Example 4" width="256px"></td> |
|
<td><img src="variation_5.png" alt="Variation Example 5" width="256px"></td> |
|
</tr> |
|
</table> |
|
</div> |
|
|
|
### Image Blending |
|
Seamlessly blend multiple images with intelligent style transfer: |
|
|
|
<div align="center"> |
|
<table> |
|
<tr> |
|
<td><img src="blend_1.png" alt="Blend Example 1" width="256px"></td> |
|
<td><img src="blend_2.png" alt="Blend Example 2" width="256px"></td> |
|
<td><img src="blend_3.png" alt="Blend Example 3" width="256px"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="blend_4.png" alt="Blend Example 4" width="256px"></td> |
|
<td><img src="blend_5.png" alt="Blend Example 5" width="256px"></td> |
|
<td><img src="blend_6.png" alt="Blend Example 6" width="256px"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="blend_7.png" alt="Blend Example 7" width="256px"></td> |
|
</tr> |
|
</table> |
|
</div> |
|
|
|
### Text-Guided Image Blending |
|
Control image generation with textual prompts: |
|
|
|
<div align="center"> |
|
<table> |
|
<tr> |
|
<td><img src="textblend_1.png" alt="Text Blend Example 1" width="256px"></td> |
|
<td><img src="textblend_2.png" alt="Text Blend Example 2" width="256px"></td> |
|
<td><img src="textblend_3.png" alt="Text Blend Example 3" width="256px"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="textblend_4.png" alt="Text Blend Example 4" width="256px"></td> |
|
<td><img src="textblend_5.png" alt="Text Blend Example 5" width="256px"></td> |
|
<td><img src="textblend_6.png" alt="Text Blend Example 6" width="256px"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="textblend_7.png" alt="Text Blend Example 7" width="256px"></td> |
|
<td><img src="textblend_8.png" alt="Text Blend Example 8" width="256px"></td> |
|
<td><img src="textblend_9.png" alt="Text Blend Example 9" width="256px"></td> |
|
</tr> |
|
</table> |
|
</div> |
|
|
|
### Grid-Based Style Transfer |
|
Apply fine-grained style control with grid attention: |
|
|
|
<div align="center"> |
|
<table> |
|
<tr> |
|
<td><img src="griddot_1.png" alt="Grid Example 1" width="256px"></td> |
|
<td><img src="griddot_2.png" alt="Grid Example 2" width="256px"></td> |
|
<td><img src="griddot_3.png" alt="Grid Example 3" width="256px"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="griddot_4.png" alt="Grid Example 4" width="256px"></td> |
|
<td><img src="griddot_5.png" alt="Grid Example 5" width="256px"></td> |
|
<td><img src="griddot_6.png" alt="Grid Example 6" width="256px"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="griddot_7.png" alt="Grid Example 7" width="256px"></td> |
|
<td><img src="griddot_8.png" alt="Grid Example 8" width="256px"></td> |
|
<td><img src="griddot_9.png" alt="Grid Example 9" width="256px"></td> |
|
</tr> |
|
</table> |
|
</div> |
|
|
|
## Usage |
|
|
|
The inference code is available via our [GitHub repository](https://github.com/erwold/qwen2vl-flux) which provides comprehensive Python interfaces and examples. |
|
|
|
### Installation |
|
|
|
1. Clone the repository and install dependencies: |
|
```bash |
|
git clone https://github.com/erwold/qwen2vl-flux |
|
cd qwen2vl-flux |
|
pip install -r requirements.txt |
|
``` |
|
|
|
2. Download model checkpoints from Hugging Face: |
|
```python |
|
from huggingface_hub import snapshot_download |
|
|
|
snapshot_download("Djrango/Qwen2vl-Flux") |
|
``` |
|
|
|
### Basic Examples |
|
|
|
```python |
|
from model import FluxModel |
|
|
|
# Initialize model |
|
model = FluxModel(device="cuda") |
|
|
|
# Image Variation |
|
outputs = model.generate( |
|
input_image_a=input_image, |
|
prompt="Your text prompt", |
|
mode="variation" |
|
) |
|
|
|
# Image Blending |
|
outputs = model.generate( |
|
input_image_a=source_image, |
|
input_image_b=reference_image, |
|
mode="img2img", |
|
denoise_strength=0.8 |
|
) |
|
|
|
# Text-Guided Blending |
|
outputs = model.generate( |
|
input_image_a=input_image, |
|
prompt="Transform into an oil painting style", |
|
mode="variation", |
|
guidance_scale=7.5 |
|
) |
|
|
|
# Grid-Based Style Transfer |
|
outputs = model.generate( |
|
input_image_a=content_image, |
|
input_image_b=style_image, |
|
mode="controlnet", |
|
line_mode=True, |
|
depth_mode=True |
|
) |
|
``` |
|
|
|
## Technical Specifications |
|
|
|
- **Framework**: PyTorch 2.4.1+ |
|
- **Base Models**: |
|
- FLUX.1-dev |
|
- Qwen2-VL-7B-Instruct |
|
- **Memory Requirements**: 48GB+ VRAM |
|
- **Supported Image Sizes**: |
|
- 1024x1024 (1:1) |
|
- 1344x768 (16:9) |
|
- 768x1344 (9:16) |
|
- 1536x640 (2.4:1) |
|
- 896x1152 (3:4) |
|
- 1152x896 (4:3) |
|
|
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{erwold-2024-qwen2vl-flux, |
|
title={Qwen2VL-Flux: Unifying Image and Text Guidance for Controllable Image Generation}, |
|
author={Pengqi Lu}, |
|
year={2024}, |
|
url={https://github.com/erwold/qwen2vl-flux} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
This project is licensed under the MIT License. See [LICENSE](LICENSE) for details. |
|
|
|
## Acknowledgments |
|
|
|
- Based on the FLUX architecture |
|
- Integrates Qwen2VL for vision-language understanding |
|
- Thanks to the open-source communities of FLUX and Qwen |
|
|