FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
by Stanford University · free · Last verified 2026-03-17
Introduces FlashAttention, an IO-aware exact attention algorithm that restructures attention computation to minimize memory reads/writes between HBM and SRAM. Achieves 2-4× speedup over standard attention and enables training on much longer sequences.
https://arxiv.org/abs/2205.14135 ↗B+
B+—Good
Adoption: A+Quality: A+Freshness: BCitations: AEngagement: F
Specifications
- License
- BSD-3-Clause
- Pricing
- free
- Capabilities
- memory-efficient-attention, long-context-training, cuda-optimization
- Integrations
- pytorch, huggingface-transformers, vllm
- Use Cases
- long-context-training, inference-optimization, gpu-memory-reduction
- API Available
- No
- Tags
- flash-attention, io-aware, memory-efficient, attention, cuda, hardware-optimization
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
75.7Adoption
90
Quality
96
Freshness
68
Citations
82
Engagement
0
Put AI to work for your business
Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.