nielsr HF staff commited on
Commit
04132fb
·
1 Parent(s): 79030f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -5,7 +5,7 @@ tags:
5
  - vision
6
  ---
7
 
8
- # Vision Transformer (base-sized model, patch size 16) trained using DINOv2
9
 
10
  Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Oquab et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).
11
 
@@ -15,7 +15,7 @@ Disclaimer: The team releasing DINOv2 did not write a model card for this model
15
 
16
  The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion at a resolution of 224x224 pixels.
17
 
18
- Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
19
 
20
  Note that this model does not include any fine-tuned heads.
21
 
 
5
  - vision
6
  ---
7
 
8
+ # Vision Transformer (base-sized model) trained using DINOv2
9
 
10
  Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Oquab et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).
11
 
 
15
 
16
  The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion at a resolution of 224x224 pixels.
17
 
18
+ Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
19
 
20
  Note that this model does not include any fine-tuned heads.
21