PaperLLMsv1.0

Efficient Memory Management for Large Language Model Serving with PagedAttention

by UC Berkeley · open-source · Last verified 2026-03-17

Introduced PagedAttention and the vLLM serving system, which manages the KV cache in non-contiguous physical memory blocks inspired by OS paging, enabling near-zero memory waste and efficient sharing of KV cache across requests. vLLM achieves 2-4x higher throughput than HuggingFace Transformers and 1.7x over Orca.

https://arxiv.org/abs/2309.06180 ↗

B+

B+—Good

Adoption: A+Quality: A+Freshness: ACitations: AEngagement: F

Specifications

License: Apache 2.0
Pricing: open-source
Capabilities: high-throughput-inference, kv-cache-management, continuous-batching, memory-efficiency
Integrations: vllm, ray, huggingface
Use Cases: llm-serving, production-inference, batch-inference
API Available: Yes
Tags: paged-attention, vllm, inference, memory-management, kv-cache
Added: 2026-03-17
Completeness: 100%

Index Score

77.7

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service