MUSES / README.md
yanboding's picture
Update README.md
f21bb5b verified

πŸ’‘ Motivation

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries.

image

πŸ€– Architecture

Our MUSES realize 3D controllable image generation by developing a progressive workflow with three key components, including:

  1. Layout Manager for 2D-to-3D layout lifting;
  2. Model Engineer for 3D object acquisition and calibration;
  3. Image Artist for 3D-to-2D image rendering.

By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation.

image

πŸ”¨ Installation

  1. Clone this GitHub repository and install the required packages:

    git clone https://github.com/DINGYANB/MUSES.git
    cd MUSES
    
    conda create -n MUSES python=3.10
    conda activate MUSES
    
    pip install -r requirements.txt
    
  2. Download other required models:

    Model Storage Path Role
    OpenAI ViT-L-14 model/CLIP/ Similarity Comparison
    Meta Llama-3-8B model/Llama3/ 3D Layout Planning
    stabilityai stable-diffusion-3-medium (SD3) model/SD3-Base/ Image Generation
    InstantX SD3-Canny-ControlNet model/SD3-ControlNet-Canny/ Controllable Image Generation
    examples_features.npy /dataset/ In-Context Learning
    finetuned_clip_epoch_20.pth /model/CLIP/ Orientation Calibration

    Since our MUSES is a training-free multi-model collaboration system, feel free to replace the generative models with other competitive ones. For example, we recommend users to replace the Llama-3-8B with more powerful LLMs like Llama-3.1-8B and GPT 4o.

  3. Optional Downloads:

🌟 Usage

Use the following command to generate images.

cd MUSES && bash multi_runs.sh "test_prompt.txt" "test"

Where the first argument is the input txt file containing the prompts in rows, and the second argument is the identifier of the current run, which will be appended to the output folder name. For SD3-Canny-ControlNet, each prompt results in 5 images of different control scales.

πŸ“Š Dataset & Benchmark

Expanded NSR-1K

Since the original NSR-1K dataset lacks layouts in 3D scenes and complex scenes, so we manually add some prompts with corresponding layouts. Our expanded NSR-1K dataset is in the directory dataset/NSR-1K-Expand.

Benchmark Evaluation

For T2I-CompBench evaluation, we follow its official evaluation codes in this link. Note that we choose the best score among the 5 images as the final score.

Since T2I-CompBench lacks detailed descriptions of complex 3D spatial relationships of multiple objects, we construct our T2I-3DisBench (dataset/T2I-3DisBench.txt), which describes diverse 3D image scenes with 50 detailed prompts. For T2I-3DisBench evaluation, we employ Mini-InternVL-2B-1.5 to score the generated images from 0.0 to 1.0 across four dimensions, including object count, object orientation, 3D spatial relationship, and camera view. You can download the weights at this link and put them into the folder model/InternVL/.

python inference_code/internvl_vqa.py

After running it, it will output an average score. Our MUSES demonstrates state-of-the-art performance on both benchmarks, verifying its effectiveness.

πŸ’™ Acknowledgement

MUSES is built upon Llama, NSR-1K, Shap-e, CLIP, SD, ControlNet. We acknowledge these open-source codes and models, and the website CGTrader for supporting 3D model free downloads. We appreciate as well the valuable insights from the researchers at the Shenzhen Institute of Advanced Technology and the Shanghai AI Laboratory.

πŸ“ Citation

@article{ding2024muses,
      title={MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration}, 
      author={Yanbo Ding and Shaobin Zhuang and Kunchang Li and Zhengrong Yue and Yu Qiao and Yali Wang},
      journal={arXiv preprint arXiv:2408.10605},
      year={2024},
}