yanboding commited on
Commit
31816a6
ยท
verified ยท
1 Parent(s): f37c955

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ <h2><a href="https://arxiv.org/abs/2408.10605">MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration</a></h2>
4
+
5
+
6
+ [Yanbo Ding*](https://github.com/DINGYANB),
7
+ [Shaobin Zhuang](https://scholar.google.com/citations?user=PGaDirMAAAAJ&hl=zh-CN&oi=ao),
8
+ [Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ),
9
+ [Zhengrong Yue](https://arxiv.org/search/?searchtype=author&query=Zhengrong%20Yue),
10
+ [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl),
11
+ [Yali Wangโ€ ](https://scholar.google.com/citations?user=hD948dkAAAAJ)
12
+
13
+ [![arXiv](https://img.shields.io/badge/arXiv-2408.10605-b31b1b.svg)](https://arxiv.org/abs/2408.10605)
14
+ [![GitHub](https://img.shields.io/badge/GitHub-MUSES-blue?logo=github)](https://github.com/DINGYANB/MUSES)
15
+ [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/yanboding/MUSES/)
16
+
17
+ </div>
18
+
19
+ ## ๐Ÿ’ก Motivation
20
+
21
+ Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries.
22
+
23
+ <img width="800" alt="image" src="https://github.com/DINGYANB/MUSES/blob/main/assets/demo.png">
24
+ </a>
25
+
26
+
27
+ ## ๐Ÿค– Architecture
28
+
29
+ Our MUSES realize 3D controllable image generation by developing a progressive workflow with three key components, including:
30
+ 1. Layout Manager for 2D-to-3D layout lifting;
31
+ 2. Model Engineer for 3D object acquisition and calibration;
32
+ 3. Image Artist for 3D-to-2D image rendering.
33
+
34
+ By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation.
35
+
36
+ <img width="800" alt="image" src="https://github.com/DINGYANB/MUSES/blob/main/assets/overview.png">
37
+ </a>
38
+
39
+
40
+ ## ๐Ÿ”จ Installation
41
+
42
+ 1. Clone this GitHub repository and install the required packages:
43
+
44
+ ``` shell
45
+ git clone https://github.com/DINGYANB/MUSES.git
46
+ cd MUSES
47
+
48
+ conda create -n MUSES python=3.10
49
+ conda activate MUSES
50
+
51
+ pip install -r requirements.txt
52
+ ```
53
+
54
+
55
+ 2. Download other required models:
56
+
57
+ | Model | Storage Path | Role |
58
+ |----------------------|----------------------|-------------|
59
+ | [OpenAI ViT-L-14](https://huggingface.co/openai/clip-vit-large-patch14) | `model/CLIP/` | Similarity Comparison |
60
+ | [Meta Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | `model/Llama3/` | 3D Layout Planning |
61
+ | [stabilityai stable-diffusion-3-medium (SD3)](https://huggingface.co/stabilityai/stable-diffusion-3-medium) | `model/SD3-Base/` | Image Generation |
62
+ | [InstantX SD3-Canny-ControlNet](https://huggingface.co/InstantX/SD3-Controlnet-Canny) | `model/SD3-ControlNet-Canny/` | Controllable Image Generation |
63
+ | [examples_features.npy](https://huggingface.co/yanboding/MUSES/upload/main) | `/dataset/` | In-Context Learning |
64
+ | [finetuned_clip_epoch_20.pth](https://huggingface.co/yanboding/MUSES/upload/main) | `/model/CLIP/` | Orientation Calibration |
65
+
66
+ Since our MUSES is a training-free multi-model collaboration system, feel free to replace the generative models with other competitive ones. For example, we recommend users to replace the Llama-3-8B with more powerful LLMs like [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) and [GPT 4o](https://platform.openai.com/docs/models/gpt-4o).
67
+
68
+ 3. *Optional* Downloads:
69
+ - Download our self-built 3D model shop at this [link](https://huggingface.co/yanboding/MUSES/upload/main), which includes 300 high-quality 3D models, and 1500 images of various objects with different orientations for fine-tuing the [CLIP](https://huggingface.co/openai/clip-vit-base-patch32).
70
+ - Download multiple ControlNets such as [SD3-Tile-ControlNet](https://huggingface.co/InstantX/SD3-Controlnet-Tile), [SDXL-Canny-ControlNet](https://huggingface.co/TheMistoAI/MistoLine), [SDXL-Depth-ControlNet](https://huggingface.co/diffusers/controlnet-zoe-depth-sdxl-1.0), and other image generation models, e.g., [SDXL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) with [VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix).
71
+
72
+
73
+ ## ๐ŸŒŸ Usage
74
+ Use the following command to generate images.
75
+ ``` shell
76
+ cd MUSES && bash multi_runs.sh "test_prompt.txt" "test"
77
+ ```
78
+ Where the **first argument** is the input txt file containing the prompts in rows, and the **second argument** is the identifier of the current run, which will be appended to the output folder name. For SD3-Canny-ControlNet, each prompt results in 5 images of different control scales.
79
+
80
+
81
+ ## ๐Ÿ“Š Dataset & Benchmark
82
+ ### Expanded NSR-1K
83
+ Since the original [NSR-1K](https://github.com/Karine-Huang/T2I-CompBench) dataset lacks layouts in 3D scenes and complex scenes, so we manually add some
84
+ prompts with corresponding layouts.
85
+ Our expanded NSR-1K dataset is in the directory `dataset/NSR-1K-Expand`.
86
+
87
+ ### Benchmark Evaluation
88
+
89
+ For *T2I-CompBench* evaluation, we follow its official evaluation codes in this [link](https://github.com/Karine-Huang/T2I-CompBench). Note that we choose the best score among the 5 images as the final score.
90
+
91
+ Since T2I-CompBench lacks detailed descriptions of complex 3D spatial relationships of multiple objects, we construct our T2I-3DisBench (`dataset/T2I-3DisBench.txt`), which describes diverse 3D image scenes with 50 detailed prompts.
92
+ For *T2I-3DisBench* evaluation, we employ [Mini-InternVL-2B-1.5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5) to score the generated images from 0.0 to 1.0 across four dimensions, including object count, object orientation, 3D spatial relationship, and camera view. You can download the weights at this [link](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5) and put them into the folder `model/InternVL/`.
93
+
94
+ ``` shell
95
+ python inference_code/internvl_vqa.py
96
+ ```
97
+
98
+ After running it, it will output an average score.
99
+ Our MUSES demonstrates state-of-the-art performance on both benchmarks, verifying its effectiveness.
100
+
101
+ ## ๐Ÿ’™ Acknowledgement
102
+ MUSES is built upon
103
+ [Llama](https://github.com/meta-llama/llama3),
104
+ [NSR-1K](https://github.com/Karine-Huang/T2I-CompBench),
105
+ [Shap-e](https://github.com/openai/shap-e),
106
+ [CLIP](https://github.com/openai/CLIP),
107
+ [SD](https://github.com/Stability-AI/generative-models),
108
+ [ControlNet](https://github.com/lllyasviel/ControlNet).
109
+ We acknowledge these open-source codes and models, and the website [CGTrader](https://www.cgtrader.com) for supporting 3D model free downloads.
110
+ We appreciate as well the valuable insights from the researchers
111
+ at the Shenzhen Institute of Advanced Technology and the
112
+ Shanghai AI Laboratory.
113
+
114
+
115
+ ## ๐Ÿ“ Citation
116
+ ```bib
117
+ @article{ding2024muses,
118
+ title={MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration},
119
+ author={Yanbo Ding and Shaobin Zhuang and Kunchang Li and Zhengrong Yue and Yu Qiao and Yali Wang},
120
+ journal={arXiv preprint arXiv:2408.10605},
121
+ year={2024},
122
+ }
123
+ ```