SimMIM: A Simple Framework for Masked Image Modeling

This repository is primarily used for storing SimMIM pretrained Swin-V2 models, which are utilized in the "On Data Scaling in Masked Image Modeling" study. If you have any questions about SimMIM or the Data Scaling study, please file an issue in this repository or contact [email protected] directly. Please note that the SimMIM and Swin-Transformer repositories managed by Microsoft are no longer within my scope.

SimMIM Pretrained Swin-V2 Models

You can use the direct link below to download the checkpoints, or use the huggingface_hub library to download checkpoints using Python.

  • Model size only includes the backbone weights and excludes weights in the decoders/classification heads.
  • Batch size for all models is set to 2048.
  • Validation loss is calculated on the ImageNet-1K validation set.
  • Fine-tuned acc@1 refers to the top-1 accuracy on the ImageNet-1K validation set after fine-tuning.
name model size pre-train dataset pre-train iterations validation loss fine-tuned acc@1 pre-trained model fine-tuned model
SwinV2-Small 49M ImageNet-1K 10% 125k 0.4820 82.69 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 10% 250k 0.4961 83.11 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 10% 500k 0.5115 83.17 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 20% 125k 0.4751 83.05 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 20% 250k 0.4722 83.56 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 20% 500k 0.4734 83.75 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 50% 125k 0.4732 83.04 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 50% 250k 0.4681 83.67 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 50% 500k 0.4646 83.96 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 125k 0.4728 82.92 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 250k 0.4674 83.66 huggingface huggingface
SwinV2-Small 49M ImageNet-1K 500k 0.4641 84.08 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 10% 125k 0.4822 83.33 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 10% 250k 0.4997 83.60 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 10% 500k 0.5112 83.41 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 20% 125k 0.4703 83.86 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 20% 250k 0.4679 84.37 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 20% 500k 0.4711 84.61 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 50% 125k 0.4683 84.04 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 50% 250k 0.4633 84.57 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 50% 500k 0.4598 84.95 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 125k 0.4680 84.13 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 250k 0.4626 84.65 huggingface huggingface
SwinV2-Base 87M ImageNet-1K 500k 0.4588 85.04 huggingface huggingface
SwinV2-Base 87M ImageNet-22K 125k 0.4695 84.11 huggingface huggingface
SwinV2-Base 87M ImageNet-22K 250k 0.4649 84.57 huggingface huggingface
SwinV2-Base 87M ImageNet-22K 500k 0.4614 85.11 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 10% 125k 0.4995 83.69 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 10% 250k 0.5140 83.66 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 10% 500k 0.5150 83.50 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 20% 125k 0.4675 84.38 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 20% 250k 0.4746 84.71 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 20% 500k 0.4960 84.59 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 50% 125k 0.4622 84.78 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 50% 250k 0.4566 85.38 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 50% 500k 0.4530 85.80 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 125k 0.4611 84.98 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 250k 0.4552 85.45 huggingface huggingface
SwinV2-Large 195M ImageNet-1K 500k 0.4507 85.91 huggingface huggingface
SwinV2-Large 195M ImageNet-22K 125k 0.4649 84.61 huggingface huggingface
SwinV2-Large 195M ImageNet-22K 250k 0.4586 85.39 huggingface huggingface
SwinV2-Large 195M ImageNet-22K 500k 0.4536 85.81 huggingface huggingface
SwinV2-Huge 655M ImageNet-1K 20% 125k 0.4789 84.35 huggingface huggingface
SwinV2-Huge 655M ImageNet-1K 20% 250k 0.5038 84.16 huggingface huggingface
SwinV2-Huge 655M ImageNet-1K 20% 500k 0.5071 83.44 huggingface huggingface
SwinV2-Huge 655M ImageNet-1K 50% 125k 0.4549 85.09 huggingface huggingface
SwinV2-Huge 655M ImageNet-1K 50% 250k 0.4511 85.64 huggingface huggingface
SwinV2-Huge 655M ImageNet-1K 50% 500k 0.4559 85.69 huggingface huggingface
SwinV2-Huge 655M ImageNet-1K 125k 0.4531 85.23 huggingface huggingface
SwinV2-Huge 655M ImageNet-1K 250k 0.4464 85.90 huggingface huggingface
SwinV2-Huge 655M ImageNet-1K 500k 0.4416 86.34 huggingface huggingface
SwinV2-Huge 655M ImageNet-22K 125k 0.4564 85.14 huggingface huggingface
SwinV2-Huge 655M ImageNet-22K 250k 0.4499 85.86 huggingface huggingface
SwinV2-Huge 655M ImageNet-22K 500k 0.4444 86.27 huggingface huggingface
SwinV2-giant 1.06B ImageNet-1K 50% 125k 0.4534 85.44 huggingface huggingface
SwinV2-giant 1.06B ImageNet-1K 50% 250k 0.4515 85.76 huggingface huggingface
SwinV2-giant 1.06B ImageNet-1K 50% 500k 0.4719 85.51 huggingface huggingface
SwinV2-giant 1.06B ImageNet-1K 125k 0.4513 85.57 huggingface huggingface
SwinV2-giant 1.06B ImageNet-1K 250k 0.4442 86.12 huggingface huggingface
SwinV2-giant 1.06B ImageNet-1K 500k 0.4395 86.46 huggingface huggingface
SwinV2-giant 1.06B ImageNet-22K 125k 0.4544 85.39 huggingface huggingface
SwinV2-giant 1.06B ImageNet-22K 250k 0.4475 85.96 huggingface huggingface
SwinV2-giant 1.06B ImageNet-22K 500k 0.4416 86.53 huggingface huggingface

Citations

Citing SimMIM

@inproceedings{xie2021simmim,
  title={SimMIM: A Simple Framework for Masked Image Modeling},
  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Citing "On Data Scaling in Masked Image Modeling"

@article{xie2022data,
  title={On Data Scaling in Masked Image Modeling},
  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Wei, Yixuan and Dai, Qi and Hu, Han},
  journal={arXiv preprint arXiv:2206.04664},
  year={2022}
}

Citing Swin V2

@inproceedings{liu2021swinv2,
  title={Swin Transformer V2: Scaling Up Capacity and Resolution}, 
  author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .