EdoAbati commited on
Commit
1027762
·
1 Parent(s): 1e90e58

added author and url links

Browse files
Files changed (1) hide show
  1. README.md +8 -2
README.md CHANGED
@@ -7,6 +7,8 @@ tags:
7
 
8
  # Compact Convolutional Transformers
9
 
 
 
10
  ## Model description
11
 
12
  As discussed in the [Vision Transformers (ViT)](https://arxiv.org/abs/2010.11929) paper, a Transformer-based architecture for vision typically requires a larger dataset than usual, as well as a longer pre-training schedule. ImageNet-1k (which has about a million images) is considered to fall under the medium-sized data regime with respect to ViTs. This is primarily because, unlike CNNs, ViTs (or a typical Transformer-based architecture) do not have well-informed inductive biases (such as convolutions for processing images). This begs the question: can't we combine the benefits of convolution and the benefits of Transformers in a single network architecture? These benefits include parameter-efficiency, and self-attention to process long-range and global dependencies (interactions between different regions in an image).
@@ -20,7 +22,7 @@ In [Escaping the Big Data Paradigm with Compact Transformers](https://arxiv.org/
20
 
21
  ## Training and evaluation data
22
 
23
- The model is trained using the CIFAR-10 dataset. 10% of the data is used for validation.
24
 
25
  ## Training procedure
26
 
@@ -39,4 +41,8 @@ The following hyperparameters were used during training:
39
 
40
  ![Model Image](./model.png)
41
 
42
- </details>
 
 
 
 
 
7
 
8
  # Compact Convolutional Transformers
9
 
10
+ Based on the _Compact Convolutional Transformers_ example on [keras.io](https://keras.io/examples/vision/cct/) created by [Sayak Paul](https://twitter.com/RisingSayak).
11
+
12
  ## Model description
13
 
14
  As discussed in the [Vision Transformers (ViT)](https://arxiv.org/abs/2010.11929) paper, a Transformer-based architecture for vision typically requires a larger dataset than usual, as well as a longer pre-training schedule. ImageNet-1k (which has about a million images) is considered to fall under the medium-sized data regime with respect to ViTs. This is primarily because, unlike CNNs, ViTs (or a typical Transformer-based architecture) do not have well-informed inductive biases (such as convolutions for processing images). This begs the question: can't we combine the benefits of convolution and the benefits of Transformers in a single network architecture? These benefits include parameter-efficiency, and self-attention to process long-range and global dependencies (interactions between different regions in an image).
 
22
 
23
  ## Training and evaluation data
24
 
25
+ The model is trained using the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html).
26
 
27
  ## Training procedure
28
 
 
41
 
42
  ![Model Image](./model.png)
43
 
44
+ </details>
45
+
46
+ <center>
47
+ Model reproduced by <a href="https://github.com/EdAbati" target="_blank">Edoardo Abati</a>
48
+ </center>