GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
by Google Research · free · Last verified 2026-03-17
Introduces Grouped-Query Attention (GQA), an efficient attention mechanism that generalizes Multi-Head and Multi-Query Attention. GQA groups query heads to share key and value heads, drastically reducing the KV cache size and memory bandwidth, which accelerates inference speed while maintaining near Multi-Head quality.
https://arxiv.org/abs/2305.13245 ↗B
B—Above Average
Adoption: AQuality: AFreshness: B+Citations: BEngagement: F
Specifications
- License
- Open Access
- Pricing
- free
- Capabilities
- Grouped-Query Attention (GQA), Reduced KV Cache Size, Faster Inference Throughput, Reduced Memory Bandwidth During Decoding, Maintains High Model Quality, Up-training from Multi-Head Attention Checkpoints, Scalable Attention Mechanism, Optimized Autoregressive Decoding
- Integrations
- [object Object], [object Object], [object Object], [object Object]
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Tags
- grouped-query-attention, gqa, multi-query-attention, inference-speed, kv-cache, llm-optimization, attention-mechanism, transformer-architecture, memory-efficiency, autoregressive-decoding, llama-2, mistral
- Added
- 2026-03-17
- Completeness
- 1%
Index Score
67.4Adoption
82
Quality
88
Freshness
72
Citations
68
Engagement
0