Efficient Memory Management for Large Language Model Serving with PagedAttention
by UC Berkeley · open-source · Last verified 2026-03-17
Introduced PagedAttention and the vLLM serving system, which manages the KV cache in non-contiguous physical memory blocks inspired by OS paging, enabling near-zero memory waste and efficient sharing of KV cache across requests. vLLM achieves 2-4x higher throughput than HuggingFace Transformers and 1.7x over Orca.
https://arxiv.org/abs/2309.06180 ↗B+
B+—Good
Adoption: A+Quality: A+Freshness: ACitations: AEngagement: F
Specifications
- License
- Apache 2.0
- Pricing
- open-source
- Capabilities
- high-throughput-inference, kv-cache-management, continuous-batching, memory-efficiency
- Integrations
- vllm, ray, huggingface
- Use Cases
- llm-serving, production-inference, batch-inference
- API Available
- Yes
- Tags
- paged-attention, vllm, inference, memory-management, kv-cache
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
77.7Adoption
93
Quality
96
Freshness
83
Citations
85
Engagement
0
Put AI to work for your business
Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.