Paperinfrastructurev2.0

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

by Princeton University / Together AI · free · Last verified 2026-03-17

Extends FlashAttention with improved work partitioning across GPU thread blocks and warps, achieving 2× speedup over FlashAttention and ~9× speedup over standard attention. Enables efficient training of models with context lengths up to 256K tokens.

https://arxiv.org/abs/2307.08691 ↗

B+

B+—Good

Adoption: AQuality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License: BSD-3-Clause
Pricing: free
Capabilities: memory-efficient-attention, long-context-training, gpu-parallelism
Integrations: pytorch, huggingface-transformers, vllm, flash-attn-package
Use Cases: long-context-training, inference-optimization, multi-head-attention
API Available: No
Tags: flash-attention-2, attention, parallelism, cuda, performance, gpu
Added: 2026-03-17
Completeness: 100%

Index Score

72.2

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service