[Yanbo Ding*](https://github.com/DINGYANB),
[Shaobin Zhuang](https://scholar.google.com/citations?user=PGaDirMAAAAJ&hl=zh-CN&oi=ao),
[Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ),
[Zhengrong Yue](https://arxiv.org/search/?searchtype=author&query=Zhengrong%20Yue),
[Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl),
[Yali Wangβ ](https://scholar.google.com/citations?user=hD948dkAAAAJ)
[![arXiv](https://img.shields.io/badge/arXiv-2408.10605-b31b1b.svg)](https://arxiv.org/abs/2408.10605) [![GitHub](https://img.shields.io/badge/GitHub-MUSES-blue?logo=github)](https://github.com/DINGYANB/MUSES) [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/yanboding/MUSES/)
## π‘ Motivation
Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries.