Skip to main content
PaperLLMsv1.0

Efficient Memory Management for Large Language Model Serving with PagedAttention

by UC Berkeley · open-source · Last verified 2026-03-17

Introduced PagedAttention and the vLLM serving system, which manages the KV cache in non-contiguous physical memory blocks inspired by OS paging, enabling near-zero memory waste and efficient sharing of KV cache across requests. vLLM achieves 2-4x higher throughput than HuggingFace Transformers and 1.7x over Orca.

https://arxiv.org/abs/2309.06180
B+
B+Good
Adoption: A+Quality: A+Freshness: ACitations: AEngagement: F

Specifications

License
Apache 2.0
Pricing
open-source
Capabilities
high-throughput-inference, kv-cache-management, continuous-batching, memory-efficiency
Integrations
vllm, ray, huggingface
Use Cases
llm-serving, production-inference, batch-inference
API Available
Yes
Tags
paged-attention, vllm, inference, memory-management, kv-cache
Added
2026-03-17
Completeness
100%

Index Score

77.7
Adoption
93
Quality
96
Freshness
83
Citations
85
Engagement
0

Put AI to work for your business

Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service