Denoising Vision Transformer (DVT)

Introduction

We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts (“Original features” in the teaser), which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (Fig. 1, right, Tabs. 2 to 4). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.

Model Summary

We include 4 versions of models in this space:

  • voc_denoised: These are single-layer Transformer models that are trained to denoise the output of the original ViT models. These models are trained on the VOC dataset.
  • voc_distilled: These are models distilled from the denoiser using the ImageNet-1k dataset, where all model parameters are jointly fine-tuned. The distillation process involves three stages:
    1. Stage 1: Perform per-image denoising on the VOC datasets.
    2. Stage 2: Train the denoiser using the features obtained from the per-image denoising in Stage 1 on the VOC datasets.
    3. Stage 3: Fine-tune the entire model on the ImageNet-1k dataset, using the outputs from the Stage 2 denoiser as supervision.
  • imgnet_denoised: The same as voc_denoised, but trained on the ImageNet-1k dataset.
  • imgnet_distilled: The same as voc_distilled, but trained on the ImageNet-1k dataset, including the denoiser and the distilled model.

Performance Summary

  • Baseline: The original ViT models.
Model VOC_mIoU VOC_mAcc ADE_mIoU ADE_mAcc NYU_RMSE NYU_abs_rel NYU_a1
vit_small_patch14_dinov2.lvd142m 81.78 88.44 44.05 55.53 0.4340 0.1331 84.49%
vit_base_patch14_dinov2.lvd142m 83.52 90.60 47.02 58.45 0.3965 0.1197 87.59%
vit_large_patch14_dinov2.lvd142m 83.43 90.38 47.53 59.64 0.3831 0.1145 88.89%
vit_small_patch14_reg4_dinov2.lvd142m 80.88 88.69 44.36 55.90 0.4328 0.1303 85.00%
vit_base_patch14_reg4_dinov2.lvd142m 83.48 90.95 47.73 60.17 0.3967 0.1177 87.92%
vit_large_patch14_reg4_dinov2.lvd142m 83.21 90.67 48.44 61.28 0.3852 0.1139 88.53%
deit3_base_patch16_224.fb_in1k 71.03 80.67 32.84 42.79 0.5837 0.1772 73.03%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k 77.75 86.68 40.50 52.81 0.5585 0.1678 74.30%
vit_base_patch16_224.dino 62.92 75.98 31.03 40.62 0.5742 0.1694 74.55%
vit_base_patch16_224.mae 50.29 63.10 23.84 32.06 0.6629 0.2275 66.24%
eva02_base_patch16_clip_224.merged2b 71.49 82.69 37.89 50.31 - - -
vit_base_patch16_384.augreg_in21k_ft_in1k 73.51 83.60 36.46 48.65 0.6360 0.1898 69.10%
  • DVT (voc_denoised): The denoised models trained on the VOC dataset.
Model VOC_mIoU VOC_mAcc ADE_mIoU ADE_mAcc NYU_RMSE NYU_abs_rel NYU_a1
vit_small_patch14_dinov2.lvd142m 82.78 90.69 45.14 56.35 0.4368 0.1337 84.34%
vit_base_patch14_dinov2.lvd142m 84.92 91.74 48.54 60.21 0.3811 0.1166 88.42%
vit_large_patch14_dinov2.lvd142m 85.25 91.69 49.80 61.98 0.3826 0.1118 89.32%
vit_small_patch14_reg4_dinov2.lvd142m 81.93 89.54 45.55 57.52 0.4251 0.1292 85.01%
vit_base_patch14_reg4_dinov2.lvd142m 84.58 91.17 49.24 61.66 0.3898 0.1146 88.60%
vit_large_patch14_reg4_dinov2.lvd142m 84.37 91.42 49.19 62.21 0.3852 0.1141 88.45%
deit3_base_patch16_224.fb_in1k 73.52 83.65 33.57 43.56 0.5817 0.1774 73.05%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k 79.50 88.43 41.33 53.54 0.5512 0.1639 75.30%
vit_base_patch16_224.dino 66.41 77.75 32.45 42.42 0.5784 0.1738 73.75%
vit_base_patch16_224.mae 50.65 62.90 23.25 31.03 0.6651 0.2271 65.44%
eva02_base_patch16_clip_224.merged2b 73.76 84.50 37.99 50.40 0.6196 0.1904 69.86%
vit_base_patch16_384.augreg_in21k_ft_in1k 74.82 84.40 36.75 48.82 0.6316 0.1921 69.37%
  • DVT (voc_distilled): The distilled models trained on the VOC dataset.
Model VOC_mIoU VOC_mAcc ADE_mIoU ADE_mAcc NYU_RMSE NYU_abs_rel NYU_a1
vit_base_patch14_dinov2.lvd142m 85.10 91.41 48.57 60.35 0.3850 0.1207 88.25%
vit_base_patch14_reg4_dinov2.lvd142m 84.36 90.80 49.20 61.56 0.3838 0.1143 88.97%
deit3_base_patch16_224.fb_in1k 73.63 82.74 34.43 44.96 0.5712 0.1747 74.00%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k 79.86 88.33 42.28 54.26 0.5253 0.1571 77.23%
vit_base_patch16_224.dino 66.80 78.47 32.68 42.58 0.5750 0.1696 73.86%
vit_base_patch16_224.mae 51.91 64.67 23.73 31.88 0.6733 0.2282 65.33%
eva02_base_patch16_clip_224.merged2b 75.93 85.44 40.15 52.04 - - -
vit_base_patch16_384.augreg_in21k_ft_in1k 76.26 85.14 38.62 50.61 0.5825 0.1768 73.14%
  • DVT (imgnet_denoised) and DVT (imgnet_distilled): The denoised and distilled models trained on the ImageNet-1k dataset.
Model VOC_mIoU VOC_mAcc ADE_mIoU ADE_mAcc NYU_RMSE NYU_abs_rel NYU_a1
vit_base_patch14_dinov2.lvd142m (denoised) 85.17 91.55 48.68 60.60 0.3832 0.1152 88.50%
vit_base_patch14_dinov2.lvd142m (distilled) 85.33 91.48 48.85 60.47 0.3704 0.1115 89.74%

A summary of DINOv2-base model is shown below:

vit_base_patch14_dinov2.lvd142m VOC_mIoU VOC_mAcc ADE_mIoU ADE_mAcc NYU_RMSE NYU_abs_rel NYU_a1
baseline 83.52 90.60 47.02 58.45 0.3965 0.1197 87.59%
voc_denoised 84.92 91.74 48.54 60.21 0.3811 0.1166 88.42%
voc_distilled 85.10 91.41 48.57 60.35 0.3850 0.1207 88.25%
imgnet_denoised 85.17 91.55 48.68 60.60 0.3832 0.1152 88.50%
imgnet_distilled 85.33 91.48 48.85 60.47 0.3704 0.1115 89.74%

In fact, during our exploration, we find the setting of denoiser training and distillation training can slightly affect the performance of the final model. For example, whether to include the cls token in the denoiser's Transformer feedforward layer can affect the depth estimation performance. Our best model during the exploration achieves around 85.56 mIoU on the VOC, 49.02 mIoU on the ADE, and 89.98% a1 on the NYU datasets.

However, we do not include this model in the final release because of the additional complexity but non-significant improvement.

Citation

If you find this project useful, please consider citing:

@inproceedings{yang2024denoising,
  title={Denoising vision transformers},
  author={Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
  booktitle={ECCV},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .