PaperLLMsv1.0

Fast Transformer Decoding: One Write-Head is All You Need (Multi-Query Attention)

by Google Brain · free · Last verified 2026-03-17

Introduces Multi-Query Attention (MQA), an efficient attention mechanism for autoregressive decoding. By sharing a single key and value head across all query heads, MQA drastically reduces the size of the KV cache. This leads to significant memory bandwidth savings and faster inference speeds with minimal impact on model quality.

https://arxiv.org/abs/1911.02150 ↗

B—Above Average

Adoption: B+Quality: AFreshness: CCitations: BEngagement: F

Specifications

License: Open Access
Pricing: free
Capabilities: Reduced KV Cache Size, Faster Autoregressive Decoding, Lower Inference Latency, Increased Throughput for LLM Services, Reduced Memory Bandwidth Consumption, Enables Larger Batch Sizes During Inference, Facilitates LLM Deployment on Memory-Constrained Devices
Integrations: PyTorch, TensorFlow, JAX, vLLM, TensorRT-LLM, Hugging Face Transformers
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Tags: multi-query-attention, mqa, inference-speed, kv-cache, decoding, llm-inference, model-optimization, attention-mechanism, transformer-architecture, memory-bandwidth, grouped-query-attention
Added: 2026-03-17
Completeness: 0.8%

Index Score

63.2

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service