Abstract
Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
Community
Can anyone see any mention of the embedding dimension for these models? I can't see it stated anywhere in the paper and they're yet to release any code ๐คจ
Thanks for your interest.
For Tokenizer and Detokenizer, they are standard ViT-S/B/L (for TiTok-S/B/L). For codebook, the config is mentioned at "4.1 Preliminary Experiments of 1D Tokenization - Preliminary Experimental Setup.", quoted as "the codebook C is configured to have N = 1024 entries with each entry a vector with 16 channels".
Final models increase the codebook size to 4096 as mentioned in "4.2 Main Experiments - Implementation Details", quoted as "In the final setting for TiTok training, the codebook is configured to N = 4096"
The code & model is current under internal review and we will try our best to release them to public ASAP :)
If I understand correctly, this can be used as a lossy compression system to achieve compression ratios in excess of 1000:1 (e.g. 256 x 256 x 24 bits = 1.5MB, but 32 tokens x 12 bits per token = 384 bits). Is this correct?
If so, have you evaluated this against other extreme lossy compression systems? I'd be very curious to see the result!
Thanks for your interest and comments.
The compression ratio computation seems correct and reasonable to me, and it looks an interesting perspective to view the problem! (we currently uses number of tokens to measure how compact the latent space is but measuring with bits as you did sounds also reasonable and interesting)
TBH I am not so familiar with the topic on lossy compression systems themselves, any references you can introduce to me for comparison against other "extreme lossy compression systems"?
this is very exciting! use this tokenizer for multi-modal llm like LLAVA with such high compression ratio would be a great solution
Thanks for sharing this amazing paper! I have a question: what module did you used for upsampling from the output mask tokens to pixels? i.e., (H/f, W/f, D) -> (224, 224, 3)?
Thank you in advance!
We use a small conv deocder (re-use the MaskGIT-VQGAN's decoder at the decoder-finetuning stage) to upsample the mask tokens to pixels. Ideally it should not be a problem and using a simple linear layer (similar to MAE's last layer) should be fine as well.
Transformers are strong enough to encode and decode. I think the impressive result of high compression ratio is attributed to the high usage of codebook?
The fact that only a few thousands of fully utilized discrete code are enough to describe all the images is also meaningful to VLMs
Hi, I have some questions. The paper mentions that proxy codes are from MaskGIT (codebook 1024X16) which is convolutional base. But the TiTok has 4096 codes and is based on ViT. How does the distillation go with different architecture and codebook setting?
I figured out the training a little bit. But still wondering what do the proxy codes refer to, Ze or Zq?(before or after quantization of MaskGIT).
Does the first stage loss only include the alignment between TiTok's output and the proxy codes, without considering the loss of the RGB result (TiTok encoder -> TiTok decoder -> MaskGIT decoder) with GT image?
Thank you in advance
Hi all , thanks a lot for your interests in our work. Welcome to check our code at https://github.com/bytedance/1d-tokenizer or play with hf demo at https://huggingface.co/spaces/fun-research/TiTok
Hello, looks like the model can not reconstruct an image with 512 larger size.
(it need crop, if resize won't working)
Is this a limitation?
Hi,
Thanks for your interest run our work. In the paper we have verified TiTok on both 256 and 512 resolution and they work fine. If you mean arbitrary input size/aspect ratio for a trained TiTok, I believe it is an standalone research topic for vision transformer and beyond the scope of this paper. I provided some reference aiming at getting rid of this ViT limitation if you are interested.
https://arxiv.org/abs/2307.06304
https://huggingface.co/adept/fuyu-8b
Where is the example of reconstruct a 512 image?
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper