An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale


Title:	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Authors:	Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Link:	https://arxiv.org/pdf/2010.11929.pdf

Overview

This week I took a closer look at the ViT paper which contains some interesting experiments on how learning in the Transformer model scales with increasing computer power and data ¹.

The paper experiments with applying the Transformer model to images, making as few changes as possible to the original architecture. The authors are interested in whether a Transformer can learn more computationally efficient inductive biases that might otherwise be designed into the network architecture itself. For example, convolutional neural networks (CNNs) are by design translation and scale invariant. This simply means that if, for example an object in an image is moved or scaled up, the same neurons in the network would fire whether the transformation was applied or not.

Model

The standard Transformer was originally applied in Natural Language Processing (NLP) to sequences of word tokens. In order to use the same architecture for images, they need to be tokenized as well. The authors use a simple technique: an image is split into a sequence of non-overlapping image patch embeddings, ordered from the top left corner of the image to the bottom right corner. Once the image is tokenized, it can be fed as an input sequence to a Transformer. To obtain the image patch embedding, we first flatten the image patch to a 1D vector and then multiply it by a weight matrix.

In this paper, the authors used ViT primarily for image classification. To do this, an extra patch embedding is placed at the beginning of the sequence and its value is learned to predict the target class label of the image ².

It is important to note that although Transformers do not have translation and scale invariance baked into their architecture, nor do they start with an a priori spatial understanding of each patch relative to the others, they do exploit a particular kind of inductive bias, namely the order of the patches in the sequence, through the use of positional embeddings.

The authors also experimented with a hybrid architecture, where the image patch embeddings are instead provided to the Transformer by convolutional layers ³. As the experiments showed, the hybrid model (a state-of-the-art CNN architecture) outperformed ViT for small models, but as more and more computation was done during pre-training, ViT outperformed the hybrid model. This somehow suggests that convolutions actually interfere with some parts of the Transformer architecture.

Results

In the experiments, the models were pre-trained on large datasets and fine-tuned on smaller datasets with higher resolution images. In order to reuse the pre-trained positional embeddings for ViT, which are necessarily fewer than those of the high-resolution image, the authors performed an interpolation on the existing positional embeddings. That is, the pre-trained positional embeddings were placed on the high-resolution image at the corresponding location of the patches, and missing positional embeddings were computed using bicubic interpolation between existing ones.

In one experiment, after pre-training ViT on a large dataset and then fine-tuning it on smaller datasets, the model showed superiority in accuracy over the ResNet⁴ on all the datasets. I am a little sceptical about the significance of these results, mainly because of the number of runs (three in this case) that were averaged over. But what was striking to me was that the fine-tuning was more than four times faster, assuming that they used the same number of TPUv3 cores for all models.

In a second experiment, the authors trained the models on subsets of the large JFT-300M dataset of increasing size. For the smaller subsets, the ResNet models performed better, but as the data size increased, the ViT models started to perform better, while ResNet plateaued early. This seems to indicate that learning inductive biases directly from the data when there is plenty of it trumps careful engineering of inductive biases in the network architecture.

Another interesting experiment showed that mean attention distance, measured in image space, increased in self-attention layers the deeper these layers were. In a way, this reflects a similar structure of abstraction in the layers as in CNNs, where it is well known that the first layers encode basic structures such as circles and edges, and deeper layers capture more complex hierarchical structures such as cars, ears, and mouths.

Conclusion

Images have a nice structure that allows them to be represented simply as a sequence of tokens.
Providing the learning algorithm with ways to infer its own inductive biases from large data sets is more economically advantageous than building inductive biases into the network architecture a priori.
- Let the algorithm decide which inductive biases are useful (some of which might have been designed by an Artificial Neural Network (ANN) designer, if they had bothered).
- Engineered inductive biases may even interfere with the learned ones, thus degrading the performance of the algorithm.

References

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. ↩︎
The output from the last layer of the Transformer of this patch embedding is fed into a classification head (a MLP) that learns to predict the class label using for example the binary cross-entropy loss for single label classification. ↩︎
Specifically, for each image patch, each input feature of the corresponding patch embedding is the result of one convolutional layer applied on the image patch. ↩︎
The authors use a modification of the ResNet to turn it into the best competitor for ViT. Namely, they used ResNet with Group Normalization instead of Batch Normalization, and used standardized convolutions. ↩︎