Skip to main content
Paperinfrastructurev2.0

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

by Princeton University / Together AI · free · Last verified 2026-03-17

Extends FlashAttention with improved work partitioning across GPU thread blocks and warps, achieving 2× speedup over FlashAttention and ~9× speedup over standard attention. Enables efficient training of models with context lengths up to 256K tokens.

https://arxiv.org/abs/2307.08691
B+
B+Good
Adoption: AQuality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License
BSD-3-Clause
Pricing
free
Capabilities
memory-efficient-attention, long-context-training, gpu-parallelism
Integrations
pytorch, huggingface-transformers, vllm, flash-attn-package
Use Cases
long-context-training, inference-optimization, multi-head-attention
API Available
No
Tags
flash-attention-2, attention, parallelism, cuda, performance, gpu
Added
2026-03-17
Completeness
100%

Index Score

72.2
Adoption
88
Quality
95
Freshness
76
Citations
72
Engagement
0

Put AI to work for your business

Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service