PaperComputer Visionv1.0

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

by Google Brain · free · Last verified 2026-03-17

Introduced the Vision Transformer (ViT), demonstrating that a pure transformer applied directly to sequences of image patches achieves state-of-the-art performance on image classification when pretrained on large datasets. The paper challenged the dominance of convolutional neural networks in computer vision.

https://arxiv.org/abs/2010.11929 ↗

C+

C+—Average

Adoption: A+Quality: A+Freshness: B+Citations: FEngagement: F

Specifications

License: Open Access
Pricing: free
Capabilities: image-classification, feature-extraction, transfer-learning
Integrations: PyTorch (via timm), TensorFlow/Keras, Hugging Face Transformers
Use Cases: image-classification, vision-pretraining, feature-extraction
API Available: No
Tags: vision-transformer, image-classification, attention, self-supervised, pretraining
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service