Qwen2vl-Flux / README.md
gghfez's picture
Update README.md
ea1f649 verified
|
raw
history blame
6.88 kB
---
license: mit
language:
- en
base_model:
- black-forest-labs/FLUX.1-dev
- Qwen/Qwen2-VL-7B-Instruct
library_name: diffusers
tags:
- flux
- qwen2vl
- stable-diffusion
- text-to-image
- image-to-image
- controlnet
pipeline_tag: text-to-image
---
# Qwen2vl-Flux
<div align="center">
<img src="landing-1.png" alt="Qwen2vl-Flux Banner" width="100%">
</div>
Qwen2vl-Flux is a state-of-the-art multimodal image generation model that enhances FLUX with Qwen2VL's vision-language understanding capabilities. This model excels at generating high-quality images based on both text prompts and visual references, offering superior multimodal understanding and control.
## Model Architecture
<div align="center">
<img src="flux-architecture.svg" alt="Flux Architecture" width="800px">
</div>
The model integrates Qwen2VL's vision-language capabilities into the FLUX framework, enabling more precise and context-aware image generation. Key components include:
- Vision-Language Understanding Module (Qwen2VL)
- Enhanced FLUX backbone
- Multi-mode Generation Pipeline
- Structural Control Integration
## Features
- **Enhanced Vision-Language Understanding**: Leverages Qwen2VL for superior multimodal comprehension
- **Multiple Generation Modes**: Supports variation, img2img, inpainting, and controlnet-guided generation
- **Structural Control**: Integrates depth estimation and line detection for precise structural guidance
- **Flexible Attention Mechanism**: Supports focused generation with spatial attention control
- **High-Resolution Output**: Supports various aspect ratios up to 1536x1024
## Generation Examples
### Image Variation
Create diverse variations while maintaining the essence of the original image:
<div align="center">
<table>
<tr>
<td><img src="variation_1.png" alt="Variation Example 1" width="256px"></td>
<td><img src="variation_2.png" alt="Variation Example 2" width="256px"></td>
<td><img src="variation_3.png" alt="Variation Example 3" width="256px"></td>
</tr>
<tr>
<td><img src="variation_4.png" alt="Variation Example 4" width="256px"></td>
<td><img src="variation_5.png" alt="Variation Example 5" width="256px"></td>
</tr>
</table>
</div>
### Image Blending
Seamlessly blend multiple images with intelligent style transfer:
<div align="center">
<table>
<tr>
<td><img src="blend_1.png" alt="Blend Example 1" width="256px"></td>
<td><img src="blend_2.png" alt="Blend Example 2" width="256px"></td>
<td><img src="blend_3.png" alt="Blend Example 3" width="256px"></td>
</tr>
<tr>
<td><img src="blend_4.png" alt="Blend Example 4" width="256px"></td>
<td><img src="blend_5.png" alt="Blend Example 5" width="256px"></td>
<td><img src="blend_6.png" alt="Blend Example 6" width="256px"></td>
</tr>
<tr>
<td><img src="blend_7.png" alt="Blend Example 7" width="256px"></td>
</tr>
</table>
</div>
### Text-Guided Image Blending
Control image generation with textual prompts:
<div align="center">
<table>
<tr>
<td><img src="textblend_1.png" alt="Text Blend Example 1" width="256px"></td>
<td><img src="textblend_2.png" alt="Text Blend Example 2" width="256px"></td>
<td><img src="textblend_3.png" alt="Text Blend Example 3" width="256px"></td>
</tr>
<tr>
<td><img src="textblend_4.png" alt="Text Blend Example 4" width="256px"></td>
<td><img src="textblend_5.png" alt="Text Blend Example 5" width="256px"></td>
<td><img src="textblend_6.png" alt="Text Blend Example 6" width="256px"></td>
</tr>
<tr>
<td><img src="textblend_7.png" alt="Text Blend Example 7" width="256px"></td>
<td><img src="textblend_8.png" alt="Text Blend Example 8" width="256px"></td>
<td><img src="textblend_9.png" alt="Text Blend Example 9" width="256px"></td>
</tr>
</table>
</div>
### Grid-Based Style Transfer
Apply fine-grained style control with grid attention:
<div align="center">
<table>
<tr>
<td><img src="griddot_1.png" alt="Grid Example 1" width="256px"></td>
<td><img src="griddot_2.png" alt="Grid Example 2" width="256px"></td>
<td><img src="griddot_3.png" alt="Grid Example 3" width="256px"></td>
</tr>
<tr>
<td><img src="griddot_4.png" alt="Grid Example 4" width="256px"></td>
<td><img src="griddot_5.png" alt="Grid Example 5" width="256px"></td>
<td><img src="griddot_6.png" alt="Grid Example 6" width="256px"></td>
</tr>
<tr>
<td><img src="griddot_7.png" alt="Grid Example 7" width="256px"></td>
<td><img src="griddot_8.png" alt="Grid Example 8" width="256px"></td>
<td><img src="griddot_9.png" alt="Grid Example 9" width="256px"></td>
</tr>
</table>
</div>
## Usage
The inference code is available via our [GitHub repository](https://github.com/erwold/qwen2vl-flux) which provides comprehensive Python interfaces and examples.
### Installation
1. Clone the repository and install dependencies:
```bash
git clone https://github.com/erwold/qwen2vl-flux
cd qwen2vl-flux
pip install -r requirements.txt
```
2. Download model checkpoints from Hugging Face:
```python
from huggingface_hub import snapshot_download
snapshot_download("Djrango/Qwen2vl-Flux")
```
### Basic Examples
```python
from model import FluxModel
# Initialize model
model = FluxModel(device="cuda")
# Image Variation
outputs = model.generate(
input_image_a=input_image,
prompt="Your text prompt",
mode="variation"
)
# Image Blending
outputs = model.generate(
input_image_a=source_image,
input_image_b=reference_image,
mode="img2img",
denoise_strength=0.8
)
# Text-Guided Blending
outputs = model.generate(
input_image_a=input_image,
prompt="Transform into an oil painting style",
mode="variation",
guidance_scale=7.5
)
# Grid-Based Style Transfer
outputs = model.generate(
input_image_a=content_image,
input_image_b=style_image,
mode="controlnet",
line_mode=True,
depth_mode=True
)
```
## Technical Specifications
- **Framework**: PyTorch 2.4.1+
- **Base Models**:
- FLUX.1-dev
- Qwen2-VL-7B-Instruct
- **Memory Requirements**: 48GB+ VRAM
- **Supported Image Sizes**:
- 1024x1024 (1:1)
- 1344x768 (16:9)
- 768x1344 (9:16)
- 1536x640 (2.4:1)
- 896x1152 (3:4)
- 1152x896 (4:3)
## Citation
```bibtex
@misc{erwold-2024-qwen2vl-flux,
title={Qwen2VL-Flux: Unifying Image and Text Guidance for Controllable Image Generation},
author={Pengqi Lu},
year={2024},
url={https://github.com/erwold/qwen2vl-flux}
}
```
## License
This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.
## Acknowledgments
- Based on the FLUX architecture
- Integrates Qwen2VL for vision-language understanding
- Thanks to the open-source communities of FLUX and Qwen