arxiv:2409.18869

Emu3: Next-Token Prediction is All You Need

Published on Sep 27, 2024

· Submitted by

akhaliq on Sep 30, 2024

#1 Paper of the day

Upvote

Authors:

Xinlong Wang ,

Xiaosong Zhang ,

Zhengxiong Luo ,

Quan Sun ,

Yufeng Cui ,

Jinsheng Wang ,

Fan Zhang ,

Yueze Wang ,

Zhen Li ,

Qiying Yu ,

Yulong Ao ,

Boya Wu ,

Bo Zhao ,

Bowen Zhang ,

Liangdong Wang ,

Guang Liu ,

Zheqi He ,

Jingjing Liu

Abstract

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Sep 30, 2024

https://emu.baai.ac.cn/

Lyte

Sep 30, 2024

•

edited Sep 30, 2024

https://emu.baai.ac.cn/

clicking View PDF gives: No document for '2409.18869'
Paper here

librarian-bot

Oct 1, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

leegao19

Oct 1, 2024

Despite the discussion around VideoPoet, this doesn't seem significantly different from the architecture presented there. As I understand the main differences highlighted by the authors here are:

Emu3 does not perform a second super-resolution step
Emu3 does not use a pre trained text encoder

However, these differences seem more superficial. It might be worthwhile to discuss, for e.g., the choice of MAGViT 2 vs SBER, as the choice of image tokenizer seems to be the real difference between the 2 works.

m-ric

Oct 2, 2024

My read from this paper:

This is the most important research in months: we’re now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence (BAAI) just released Emu3, a single model that handles text, images, and videos all at once.

𝗪𝗵𝗮𝘁'𝘀 𝘁𝗵𝗲 𝗯𝗶𝗴 𝗱𝗲𝗮𝗹?
🌟 Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token.
And it’s only 8B, but really strong:
🖼️ For image generation, it's matching the best specialized models out there, like SDXL.
👁️ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this.
🎬 It's the first to nail video generation without using complicated diffusion techniques.

𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗶𝘁 𝘄𝗼𝗿𝗸?
🧩 Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens.
🔗 Then, it treats everything - text, images, and videos - as one long series of tokens to predict.
🔮 During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame.

𝗖𝗮𝘃𝗲𝗮𝘁𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀:
👉 In image generation, Emu3 beats SDXL, but it’s also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev.
👉 In vision, authors also don’t show a comparison against all the current SOTA models like Qwen-VL or Pixtral.

This approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data)!

ltl

Oct 9, 2024

I’m a total beginner.
Next-Token Prediction is great, but it's really slow...
Is there any way to predict the entire answer or generate a complete picture all at once? After all, humans don’t think word by word, nor do they start drawing a picture from the top-left corner.
I mean, not like diffusion, but generating all tokens at once?