Fast Transformer Decoding: One Write-Head is All You Need (Multi-Query Attention)
by Google Brain · free · Last verified 2026-03-17
Introduces Multi-Query Attention (MQA), an efficient attention mechanism for autoregressive decoding. By sharing a single key and value head across all query heads, MQA drastically reduces the size of the KV cache. This leads to significant memory bandwidth savings and faster inference speeds with minimal impact on model quality.
https://arxiv.org/abs/1911.02150 ↗B
B—Above Average
Adoption: B+Quality: AFreshness: CCitations: BEngagement: F
Specifications
- License
- Open Access
- Pricing
- free
- Capabilities
- Reduced KV Cache Size, Faster Autoregressive Decoding, Lower Inference Latency, Increased Throughput for LLM Services, Reduced Memory Bandwidth Consumption, Enables Larger Batch Sizes During Inference, Facilitates LLM Deployment on Memory-Constrained Devices
- Integrations
- PyTorch, TensorFlow, JAX, vLLM, TensorRT-LLM, Hugging Face Transformers
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Tags
- multi-query-attention, mqa, inference-speed, kv-cache, decoding, llm-inference, model-optimization, attention-mechanism, transformer-architecture, memory-bandwidth, grouped-query-attention
- Added
- 2026-03-17
- Completeness
- 0.8%
Index Score
63.2Adoption
75
Quality
85
Freshness
45
Citations
65
Engagement
0