Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
by Google Research · free · Last verified 2026-03-17
Introduces Switch Transformers, simplifying MoE routing to select a single expert per token (top-1), enabling stable trillion-parameter T5-scale models with 7× pre-training speedup. Demonstrates that parameter count and compute can be decoupled through sparsity.
https://arxiv.org/abs/2101.03961 ↗B
B—Above Average
Adoption: B+Quality: AFreshness: C+Citations: B+Engagement: F
Specifications
- License
- Apache 2.0
- Pricing
- free
- Capabilities
- sparse-computation, trillion-parameter-training, efficient-scaling
- Integrations
- huggingface-transformers
- Use Cases
- large-scale-pretraining, multi-task-learning, efficient-model-scaling
- API Available
- No
- Tags
- switch-transformer, mixture-of-experts, trillion-parameters, sparse, t5, scaling
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
66.1Adoption
75
Quality
88
Freshness
50
Citations
74
Engagement
0
Put AI to work for your business
Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.