This is a MAE trained on Anime dataset. The main goal is to have a model efficient for image search, retrival and clustering.

There are 2 parts of this model, the encoder and decoder. The encoder encode the full images into 8x512 embedding and the masked out image into 8 (28x28/10) x 512 embedding. The decoder try to reconstruct that image.

Model arch is LocalViT small but with 16 layers instead of 12, Decoder is a simple transformers model with LocalViT style MLP.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The model has no library tag.