# **Scaling Image Tokenizers with Grouped Spherical Quantization** --- [Paper link](https://arxiv.org/abs/2412.02632) | [GITHUB REPO](https://github.com/HelmholtzAI-FZJ/flex_gen) [HF Checkpoints](https://huggingface.co/collections/HelmholtzAI-FZJ/grouped-spherical-quantization-674d6f9f548e472d0eaf179e) In [GSQ](https://arxiv.org/abs/2412.02632), we show the optimized training hyper-parameters and configs for quantization based image tokenizer. We also show how to scale the latent, vocab size etc. appropriately to achieve better reconstruction performance. ![dim-vocab-scaling.png](./https://github.com/HelmholtzAI-FZJ/flex_gen/raw/main/figures/dim-vocab-scaling.png) We also show how to scaling the latent (and group) appropriately when pursuing high down-sample ratio in compression. ![spatial_scale.png](./https://github.com/HelmholtzAI-FZJ/flex_gen/raw/main/figures/spatial_scale.png) The group scaling experiment of GSQ: --- | **Models** | \( G $\times$ d \) | **rFID ↓** | **IS ↑** | **LPIPS ↓** | **PSNR ↑** | **SSIM ↑** | **Usage ↑** | **PPL ↑** | |--------------------------------------|---------------------|------------|----------|-------------|------------|------------|-------------|-------------| | **GSQ F8-D64** \( V=8K \) | \( 1 $\times$ 64 \) | 0.63 | 205 | 0.08 | 22.95 | 0.67 | 99.87% | 8,055 | | | \( 2 $\times$ 32 \) | 0.32 | 220 | 0.05 | 25.42 | 0.76 | 100% | 8,157 | | | \( 4 $\times$ 16 \) | 0.18 | 226 | 0.03 | 28.02 | 0.08 | 100% | 8,143 | | | \( 16 $\times$ 4 \) | **0.03** | **233** | **0.004** | **34.61** | **0.91** | **99.98%** | **6,775** | | **GSQ F16-D16** \( V=256K \) | \( 1 $\times$ 16 \) | 1.63 | 179 | 0.13 | 20.70 | 0.56 | 100% | 254,044 | | | \( 2 $\times$ 8 \) | 0.82 | 199 | 0.09 | 22.20 | 0.63 | 100% | 257,273 | | | \( 4 $\times$ 4 \) | 0.74 | 202 | 0.08 | 22.75 | 0.63 | 62.46% | 43,767 | | | \( 8 $\times$ 2 \) | 0.50 | 211 | 0.06 | 23.62 | 0.66 | 46.83% | 22,181 | | | \( 16 $\times$ 1 \) | 0.52 | 210 | 0.06 | 23.54 | 0.66 | 50.81% | 181 | | | \( 16 $\times$ 1^* \) | 0.51 | 210 | 0.06 | 23.52 | 0.66 | 52.64% | 748 | | **GSQ F32-D32** \( V=256K \) | \( 1 $\times$ 32 \) | 6.84 | 95 | 0.24 | 17.83 | 0.40 | 100% | 245,715 | | | \( 2 $\times$ 16 \) | 3.31 | 139 | 0.18 | 19.01 | 0.47 | 100% | 253,369 | | | \( 4 $\times$ 8 \) | 1.77 | 173 | 0.13 | 20.60 | 0.53 | 100% | 253,199 | | | \( 8 $\times$ 4 \) | 1.67 | 176 | 0.12 | 20.88 | 0.54 | 59% | 40,307 | | | \( 16 $\times$ 2 \) | 1.13 | 190 | 0.10 | 21.73 | 0.57 | 46% | 30,302 | | | \( 32 $\times$ 1 \) | 1.21 | 187 | 0.10 | 21.64 | 0.57 | 54% | 247 | --- ## Use Pre-trained GSQ-Tokenizer ```python from flex_gen import autoencoders from timm import create_model # ============= From HF's repo model=create_model('flexTokenizer', pretrained=True, repo_id='HelmholtzAI-FZJ/GSQ-F8-D8-V64k',) # ============= From Local Checkpoint model=create_model('flexTokenizer', pretrained=True, path='PATH/your_checkpoint.pt', ) ``` --- ## Training your tokenizer ### Set-up Python Virtual Environment ```python sh gen_env/setup.sh source ./gen_env/activate.sh #! This will run pip install to download all required lib sh ./gen_env/install_requirements.sh ``` ### Run Training ```python # Single GPU python -W ignore ./scripts/train_autoencoder.py # Multi GPU torchrun --nnodes=1 --nproc_per_node=4 ./scripts/train_autoencoder.py --config-file=PATH/config_name.yaml \ --output_dir=./logs_test/test opts train.num_train_steps=100 train_batch_size=16 ``` ### Run Evaluation Add the checkpoint path that your want to test in `evaluation/run_tokenizer_eval.sh` ```bash # For example ... configs_of_training_lists=() configs_of_training_lists=("logs_test/test/") ... ``` And run `sh evaluation/run_tokenizer_eval.sh` it will automatically scan `folder/model/eval_xxx.pth` for tokenizer evaluation --- # **Citation** ```bash @misc{GSQ, title={Scaling Image Tokenizers with Grouped Spherical Quantization}, author={Jiangtao Wang and Zhen Qin and Yifan Zhang and Vincent Tao Hu and Björn Ommer and Rania Briq and Stefan Kesselheim}, year={2024}, eprint={2412.02632}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.02632}, } ```