SimMIM: A Simple Framework for Masked Image Modeling
This repository is primarily used for storing SimMIM pretrained Swin-V2 models, which are utilized in the "On Data Scaling in Masked Image Modeling" study. If you have any questions about SimMIM or the Data Scaling study, please file an issue in this repository or contact [email protected]
directly. Please note that the SimMIM and Swin-Transformer repositories managed by Microsoft are no longer within my scope.
SimMIM Pretrained Swin-V2 Models
You can use the direct link below to download the checkpoints, or use the huggingface_hub
library to download checkpoints using Python.
- Model size only includes the backbone weights and excludes weights in the decoders/classification heads.
- Batch size for all models is set to 2048.
- Validation loss is calculated on the ImageNet-1K validation set.
- Fine-tuned acc@1 refers to the top-1 accuracy on the ImageNet-1K validation set after fine-tuning.
name | model size | pre-train dataset | pre-train iterations | validation loss | fine-tuned acc@1 | pre-trained model | fine-tuned model |
---|---|---|---|---|---|---|---|
SwinV2-Small | 49M | ImageNet-1K 10% | 125k | 0.4820 | 82.69 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K 10% | 250k | 0.4961 | 83.11 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K 10% | 500k | 0.5115 | 83.17 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K 20% | 125k | 0.4751 | 83.05 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K 20% | 250k | 0.4722 | 83.56 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K 20% | 500k | 0.4734 | 83.75 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K 50% | 125k | 0.4732 | 83.04 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K 50% | 250k | 0.4681 | 83.67 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K 50% | 500k | 0.4646 | 83.96 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K | 125k | 0.4728 | 82.92 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K | 250k | 0.4674 | 83.66 | huggingface | huggingface |
SwinV2-Small | 49M | ImageNet-1K | 500k | 0.4641 | 84.08 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K 10% | 125k | 0.4822 | 83.33 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K 10% | 250k | 0.4997 | 83.60 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K 10% | 500k | 0.5112 | 83.41 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K 20% | 125k | 0.4703 | 83.86 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K 20% | 250k | 0.4679 | 84.37 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K 20% | 500k | 0.4711 | 84.61 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K 50% | 125k | 0.4683 | 84.04 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K 50% | 250k | 0.4633 | 84.57 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K 50% | 500k | 0.4598 | 84.95 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K | 125k | 0.4680 | 84.13 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K | 250k | 0.4626 | 84.65 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-1K | 500k | 0.4588 | 85.04 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-22K | 125k | 0.4695 | 84.11 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-22K | 250k | 0.4649 | 84.57 | huggingface | huggingface |
SwinV2-Base | 87M | ImageNet-22K | 500k | 0.4614 | 85.11 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K 10% | 125k | 0.4995 | 83.69 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K 10% | 250k | 0.5140 | 83.66 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K 10% | 500k | 0.5150 | 83.50 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K 20% | 125k | 0.4675 | 84.38 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K 20% | 250k | 0.4746 | 84.71 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K 20% | 500k | 0.4960 | 84.59 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K 50% | 125k | 0.4622 | 84.78 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K 50% | 250k | 0.4566 | 85.38 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K 50% | 500k | 0.4530 | 85.80 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K | 125k | 0.4611 | 84.98 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K | 250k | 0.4552 | 85.45 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-1K | 500k | 0.4507 | 85.91 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-22K | 125k | 0.4649 | 84.61 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-22K | 250k | 0.4586 | 85.39 | huggingface | huggingface |
SwinV2-Large | 195M | ImageNet-22K | 500k | 0.4536 | 85.81 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-1K 20% | 125k | 0.4789 | 84.35 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-1K 20% | 250k | 0.5038 | 84.16 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-1K 20% | 500k | 0.5071 | 83.44 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-1K 50% | 125k | 0.4549 | 85.09 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-1K 50% | 250k | 0.4511 | 85.64 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-1K 50% | 500k | 0.4559 | 85.69 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-1K | 125k | 0.4531 | 85.23 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-1K | 250k | 0.4464 | 85.90 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-1K | 500k | 0.4416 | 86.34 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-22K | 125k | 0.4564 | 85.14 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-22K | 250k | 0.4499 | 85.86 | huggingface | huggingface |
SwinV2-Huge | 655M | ImageNet-22K | 500k | 0.4444 | 86.27 | huggingface | huggingface |
SwinV2-giant | 1.06B | ImageNet-1K 50% | 125k | 0.4534 | 85.44 | huggingface | huggingface |
SwinV2-giant | 1.06B | ImageNet-1K 50% | 250k | 0.4515 | 85.76 | huggingface | huggingface |
SwinV2-giant | 1.06B | ImageNet-1K 50% | 500k | 0.4719 | 85.51 | huggingface | huggingface |
SwinV2-giant | 1.06B | ImageNet-1K | 125k | 0.4513 | 85.57 | huggingface | huggingface |
SwinV2-giant | 1.06B | ImageNet-1K | 250k | 0.4442 | 86.12 | huggingface | huggingface |
SwinV2-giant | 1.06B | ImageNet-1K | 500k | 0.4395 | 86.46 | huggingface | huggingface |
SwinV2-giant | 1.06B | ImageNet-22K | 125k | 0.4544 | 85.39 | huggingface | huggingface |
SwinV2-giant | 1.06B | ImageNet-22K | 250k | 0.4475 | 85.96 | huggingface | huggingface |
SwinV2-giant | 1.06B | ImageNet-22K | 500k | 0.4416 | 86.53 | huggingface | huggingface |
Citations
Citing SimMIM
@inproceedings{xie2021simmim,
title={SimMIM: A Simple Framework for Masked Image Modeling},
author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}
Citing "On Data Scaling in Masked Image Modeling"
@article{xie2022data,
title={On Data Scaling in Masked Image Modeling},
author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Wei, Yixuan and Dai, Qi and Hu, Han},
journal={arXiv preprint arXiv:2206.04664},
year={2022}
}
Citing Swin V2
@inproceedings{liu2021swinv2,
title={Swin Transformer V2: Scaling Up Capacity and Resolution},
author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}