MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
Yanbo Ding*, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wangβ
π‘ Motivation
Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries.
π€ Architecture
Our MUSES realize 3D controllable image generation by developing a progressive workflow with three key components, including:
- Layout Manager for 2D-to-3D layout lifting;
- Model Engineer for 3D object acquisition and calibration;
- Image Artist for 3D-to-2D image rendering.
By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation.
π¨ Installation
Clone this GitHub repository and install the required packages:
git clone https://github.com/DINGYANB/MUSES.git cd MUSES conda create -n MUSES python=3.10 conda activate MUSES pip install -r requirements.txt
Download other required models:
Model Storage Path Role OpenAI ViT-L-14 model/CLIP/
Similarity Comparison Meta Llama-3-8B model/Llama3/
3D Layout Planning stabilityai stable-diffusion-3-medium (SD3) model/SD3-Base/
Image Generation InstantX SD3-Canny-ControlNet model/SD3-ControlNet-Canny/
Controllable Image Generation examples_features.npy /dataset/
In-Context Learning finetuned_clip_epoch_20.pth /model/CLIP/
Orientation Calibration Since our MUSES is a training-free multi-model collaboration system, feel free to replace the generative models with other competitive ones. For example, we recommend users to replace the Llama-3-8B with more powerful LLMs like Llama-3.1-8B and GPT 4o.
Optional Downloads:
- Download our self-built 3D model shop at this link, which includes 300 high-quality 3D models, and 1500 images of various objects with different orientations for fine-tuing the CLIP.
- Download multiple ControlNets such as SD3-Tile-ControlNet, SDXL-Canny-ControlNet, SDXL-Depth-ControlNet, and other image generation models, e.g., SDXL with VAE.
π Usage
Use the following command to generate images.
cd MUSES && bash multi_runs.sh "test_prompt.txt" "test"
Where the first argument is the input txt file containing the prompts in rows, and the second argument is the identifier of the current run, which will be appended to the output folder name. For SD3-Canny-ControlNet, each prompt results in 5 images of different control scales.
π Dataset & Benchmark
Expanded NSR-1K
Since the original NSR-1K dataset lacks layouts in 3D scenes and complex scenes, so we manually add some
prompts with corresponding layouts.
Our expanded NSR-1K dataset is in the directory dataset/NSR-1K-Expand
.
Benchmark Evaluation
For T2I-CompBench evaluation, we follow its official evaluation codes in this link. Note that we choose the best score among the 5 images as the final score.
Since T2I-CompBench lacks detailed descriptions of complex 3D spatial relationships of multiple objects, we construct our T2I-3DisBench (dataset/T2I-3DisBench.txt
), which describes diverse 3D image scenes with 50 detailed prompts.
For T2I-3DisBench evaluation, we employ Mini-InternVL-2B-1.5 to score the generated images from 0.0 to 1.0 across four dimensions, including object count, object orientation, 3D spatial relationship, and camera view. You can download the weights at this link and put them into the folder model/InternVL/
.
python inference_code/internvl_vqa.py
After running it, it will output an average score. Our MUSES demonstrates state-of-the-art performance on both benchmarks, verifying its effectiveness.
π Acknowledgement
MUSES is built upon Llama, NSR-1K, Shap-e, CLIP, SD, ControlNet. We acknowledge these open-source codes and models, and the website CGTrader for supporting 3D model free downloads. We appreciate as well the valuable insights from the researchers at the Shenzhen Institute of Advanced Technology and the Shanghai AI Laboratory.
π Citation
@article{ding2024muses,
title={MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration},
author={Yanbo Ding and Shaobin Zhuang and Kunchang Li and Zhengrong Yue and Yu Qiao and Yali Wang},
journal={arXiv preprint arXiv:2408.10605},
year={2024},
}