Vision Transformer Architecture
- Shashank Shekhar
- Dec 23, 2024
- 3 min read

The Vision Transformer (ViT) processes images differently from traditional Convolutional Neural Networks (CNNs) by dividing the image into fixed-size patches rather than interacting with individual pixels. Each patch is then flattened into a vector, with dimensions defined by the patch size and the number of color channels. Since transformers don’t innately understand the spatial relationships between these patches, positional embeddings are added to each patch to encode its position within the image. Once the image patches are embedded and combined with positional encodings, they are fed into the standard transformer encoder architecture. Using its Multi-Head Self-Attention mechanism, the encoder computes patch relationships and extracts global image features. ViT introduces a classification token during the encoding step, which is used to help the Multi-Layer Perceptron (MLP) with the final classification task. The MLP, consisting of two layers and activated by Gaussian Error Linear Units (GELU), adds non-linearity to the model and supports classification.
ViT's performance as compared to CNN based architectures have been derived from this paper:

As one can see, ViT beats typical CNN based architectures like ImageNet, ResNet by miles. However, EfficientNet is still scarping through to the top for typical vision related tasks. And that led to few cross-pollination efforts in changing ViT, which have been detailed at the end of this blog. But first, let's try and simplify the differences between ViT and EfficientNet:
Feature | Vision Transformer (ViT) | EfficientNet |
|---|---|---|
Core Idea | Transformer-based | CNN-based |
Input Representation | Image patches + positional embedding | Full image processing |
Inductive Bias | Minimal | Strong |
Scalability | Needs large datasets | Scales well across dimensions |
Inference Efficiency | Computationally expensive | Resource-efficient |
Data-Set Size | Long-range dependencies, large datasets | Small datasets, real-time apps |
Architecture | Multi-head self-attention mechanisms, Feedforward neural networks, Layer normalization and skip connections | Depthwise separable convolutions, Squeeze-and-excitation modules (for channel attention), Batch normalization and activation function - Swish (Improvement on ReLu) |
Training Performance | Requires more training data to reach its full potential, Performs better on tasks that benefit from global context and long-range dependency modeling (e.g., image classification, segmentation). | More data-efficient, making it suitable for scenarios with limited training data |
Applications | Image classification, Object Detection, Image Segmentation | Most vision tasks and is efficient enough to run on edge devices |
Stay Away When | Labeled Data size is small or Very high resolution images are to be processed during training |
Inspired Architectures and Key Innovations
The success of Vision Transformers (ViTs) has spurred numerous derivative architectures and innovative methods aimed at addressing their limitations and broadening their applicability to a wide range of visual tasks, including segmentation, object detection, and self-supervised learning. Notable advancements include:
Patch-Based Enhancements and Improved Interaction
Swin Transformer (2021) by Liu et al.: Features a hierarchical architecture with shifted windows for efficient computation. By employing local self-attention within non-overlapping windows, Swin captures local features at lower layers and scales to global features at deeper layers, excelling in dense prediction tasks like image segmentation.
Masked Autoencoders (2022) by He et al.: Leverages self-supervised learning by masking portions of input images during training, encouraging the model to reconstruct the missing content. This approach minimizes the need for extensive labeled datasets while enhancing training efficiency.
DINO (2021) by Caron et al.: Implements self-supervised learning via a Momentum Teacher Network and multi-crop training. This process helps a student network learn robust features by aligning predictions from augmented views generated by the teacher network.
Hierarchical and Multi-Scale Feature Extraction
TransUNet (2021) by Chen et al.: Merges the ViT framework with U-Net, combining global context modeling with spatial feature retention, making it particularly effective for medical image segmentation.
Pyramid Vision Transformer (2021) by Wang et al.: Adopts a pyramid structure to extract features across multiple scales, enhancing performance in dense predictions for high-resolution images.
SegFormer (2021) by Xie et al.: Utilizes a lightweight MLP decoder to efficiently aggregate multi-scale features, achieving superior segmentation accuracy with significant efficiency gains.
Efficiency-Driven Innovations
DeiT (2021) by Touvron et al.: Designed to make ViTs viable for smaller datasets, DeiT incorporates a distillation-based training approach using a CNN teacher model, enabling effective training with limited data.
Conclusion
Vision Transformers represent a paradigm shift in computer vision, challenging the dominance of Convolutional Neural Networks (CNNs). By leveraging self-attention to process images holistically and globally, ViTs exhibit exceptional scalability and performance, especially with large datasets and high-resolution images.
The continual evolution of ViTs and their derivatives ensures they remain at the forefront of computer vision research and application. Ongoing advancements aim to refine their efficiency, extend their reach, and unlock new possibilities across industries, further cementing their transformative impact on the field.



Comments