FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
by Princeton University / Together AI · free · Last verified 2026-03-17
Extends FlashAttention with improved work partitioning across GPU thread blocks and warps, achieving 2× speedup over FlashAttention and ~9× speedup over standard attention. Enables efficient training of models with context lengths up to 256K tokens.
https://arxiv.org/abs/2307.08691 ↗B+
B+—Good
Adoption: AQuality: A+Freshness: B+Citations: B+Engagement: F
Specifications
- License
- BSD-3-Clause
- Pricing
- free
- Capabilities
- memory-efficient-attention, long-context-training, gpu-parallelism
- Integrations
- pytorch, huggingface-transformers, vllm, flash-attn-package
- Use Cases
- long-context-training, inference-optimization, multi-head-attention
- API Available
- No
- Tags
- flash-attention-2, attention, parallelism, cuda, performance, gpu
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
72.2Adoption
88
Quality
95
Freshness
76
Citations
72
Engagement
0
Put AI to work for your business
Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.