Skip to main content
brand
context
industry
strategy
AaaS
PaperLLMsv1.0

Fast Transformer Decoding: One Write-Head is All You Need (Multi-Query Attention)

by Google Brain · free · Last verified 2026-03-17

Introduces Multi-Query Attention (MQA), an efficient attention mechanism for autoregressive decoding. By sharing a single key and value head across all query heads, MQA drastically reduces the size of the KV cache. This leads to significant memory bandwidth savings and faster inference speeds with minimal impact on model quality.

https://arxiv.org/abs/1911.02150
B
BAbove Average
Adoption: B+Quality: AFreshness: CCitations: BEngagement: F

Specifications

License
Open Access
Pricing
free
Capabilities
Reduced KV Cache Size, Faster Autoregressive Decoding, Lower Inference Latency, Increased Throughput for LLM Services, Reduced Memory Bandwidth Consumption, Enables Larger Batch Sizes During Inference, Facilitates LLM Deployment on Memory-Constrained Devices
Integrations
PyTorch, TensorFlow, JAX, vLLM, TensorRT-LLM, Hugging Face Transformers
Use Cases
[object Object], [object Object], [object Object], [object Object]
API Available
No
Tags
multi-query-attention, mqa, inference-speed, kv-cache, decoding, llm-inference, model-optimization, attention-mechanism, transformer-architecture, memory-bandwidth, grouped-query-attention
Added
2026-03-17
Completeness
0.8%

Index Score

63.2
Adoption
75
Quality
85
Freshness
45
Citations
65
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service