Skip to main content
Paperinfrastructurev2.0

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

by Princeton University / Together AI · free · Last verified 2026-03-17

Extends FlashAttention with improved work partitioning across GPU thread blocks and warps, achieving 2× speedup over FlashAttention and ~9× speedup over standard attention. Enables efficient training of models with context lengths up to 256K tokens.

https://arxiv.org/abs/2307.08691
C+
C+Average
Adoption: AQuality: A+Freshness: B+Citations: FEngagement: F

Specifications

License
BSD-3-Clause
Pricing
free
Capabilities
memory-efficient-attention, long-context-training, gpu-parallelism
Integrations
pytorch, huggingface-transformers, vllm, flash-attn-package
Use Cases
long-context-training, inference-optimization, multi-head-attention
API Available
No
Tags
flash-attention-2, attention, parallelism, cuda, performance, gpu
Added
2026-03-17
Completeness
100%

Index Score

54
Adoption
88
Quality
95
Freshness
76
Citations
0
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service