Introduction
You have a fine-tuned 70B model. It works. Now you need to serve it to users: fast, reliably, and without burning money on GPU hours. This is where most teams hit a wall: naive inference with HuggingFace Transformers wastes 60–80% of GPU memory, handles one request at a time, and leaves expensive hardware sitting idle.
vLLM is an open-source inference engine from UC Berkeley that solves this. Its core innovation, PagedAttention, borrows virtual memory concepts from operating systems to manage the KV cache, achieving 14–24x higher throughput than HuggingFace Transformers. I use it to serve models on an NVIDIA RTX Pro 6000 Blackwell (96GB VRAM) and it's become the standard for serious LLM deployment.
The Problem: GPU Memory Waste
During autoregressive generation, every token's attention keys and values (the KV cache) must live in GPU memory. For a 13B model, each token needs ~800KB of KV cache. With a 2048-token max sequence, that's up to 1.6GB per request.
Traditional systems pre-allocate a contiguous block of GPU memory for each request, sized for the maximum possible sequence length. If a user sends a short prompt and gets a short reply, most of that block goes unused. Across many concurrent requests, this creates three forms of waste:
- Internal fragmentation: unused slots within each allocated block
- Reservation waste: entire blocks held for sequences that never grow
- External fragmentation: gaps between blocks that can't be reused
The result: existing systems waste 60–80% of KV cache memory. Fewer concurrent requests fit in memory, batch sizes stay small, and GPU compute sits idle.
PagedAttention: Virtual Memory for KV Cache
PagedAttention is vLLM's core innovation, published at SOSP 2023 (one of the top systems conferences). The idea is simple: apply the same techniques your OS uses for RAM management to GPU KV cache management.
Instead of one contiguous block per request, PagedAttention:
- Divides the KV cache into fixed-size blocks (typically 16 tokens per block)
- Allocates blocks on demand as tokens are generated, not upfront
- Maps logical blocks to physical blocks via a block table (like an OS page table)
- Shares physical blocks across requests with common prefixes (copy-on-write)
| OS Concept | PagedAttention Equivalent |
|---|---|
| Virtual address space | Logical KV blocks (sequential view) |
| Physical memory frames | Physical KV blocks in GPU VRAM |
| Page table | Block table |
| On-demand paging | Incremental block allocation |
| Shared memory pages | Shared blocks for common prefixes |
The only wasted memory per sequence is the last partially-filled block, at most 15 tokens. Overall KV cache waste drops from 60–80% to under 4%.
Why This Matters for GPU Utilization
With near-zero memory waste, vLLM can fit far more concurrent requests in the same GPU. Combined with continuous batching (inserting new requests into the batch as others finish, rather than waiting for a static batch), the throughput gains are massive:
| Comparison | Throughput Gain |
|---|---|
| vLLM vs. HuggingFace Transformers | 14–24x |
| vLLM vs. HuggingFace TGI | 2.2–2.5x |
| vLLM vs. FasterTransformer | 2–4x |
The advantage grows with longer sequences, larger models, and more complex decoding (beam search, parallel sampling) because those are exactly the scenarios where KV cache memory pressure is worst.
Key Features
Continuous Batching
Static batching waits for a fixed batch, pads shorter sequences, and blocks everything until the longest finishes. vLLM's continuous batching changes the batch composition at every decoding step. As soon as one sequence finishes, a new request takes its slot. No waiting, no padding.
Tensor Parallelism
For models that don't fit on a single GPU, vLLM shards weight matrices across multiple GPUs. A 70B model in FP16 (~140GB) can run on 2x H100s (80GB each) with --tensor-parallel-size 2. NVLink interconnect is strongly recommended since PCIe degrades throughput significantly.
Quantization Support
vLLM supports GPTQ, AWQ, FP8, bitsandbytes, and more. FP8 is particularly interesting on modern hardware (H100, L40S), offering 2x memory reduction with up to 1.6x throughput improvement and minimal quality loss.
OpenAI-Compatible API
vLLM serves an OpenAI-compatible API out of the box. You can point any existing OpenAI client at your vLLM server by changing the base URL. Zero code changes for migration.
Prefix Caching
Requests sharing common prefixes (e.g., the same system prompt) automatically share KV cache blocks via copy-on-write. For chat applications with repeated system prompts, this saves significant memory.
How to Use vLLM
Installation
# Recommended: install with pip
pip install vllm
# Or with uv (faster)
uv pip install vllm --torch-backend=auto
Offline Batch Inference
from vllm import LLM, SamplingParams
prompts = [
"Explain PagedAttention in one paragraph:",
"What is the capital of Kenya?",
"Write a Python function that reverses a string:",
]
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=256,
)
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Output: {output.outputs[0].text}\n")
Starting the API Server
# Basic
vllm serve Qwen/Qwen2.5-7B-Instruct
# Production: FP8 quantization, 64K context, 90% GPU memory
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--quantization fp8 \
--max-model-len 65536 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 8000
Querying the Server
from openai import OpenAI
# Point the standard OpenAI client at your vLLM server
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is PagedAttention?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)
Or with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128
}'
Key Launch Parameters
| Parameter | What It Does | Example |
|---|---|---|
--model | HuggingFace model ID or local path | meta-llama/Llama-3.1-8B-Instruct |
--quantization | Quantization method | fp8, awq, gptq |
--max-model-len | Max context window | 65536 |
--tensor-parallel-size | Number of GPUs | 2, 4 |
--gpu-memory-utilization | Fraction of VRAM to use | 0.90 |
--dtype | Model data type | bfloat16, auto |
vLLM vs. the Alternatives
| Factor | vLLM | Ollama | llama.cpp | TGI |
|---|---|---|---|---|
| Best for | Production GPU serving | Local dev/prototyping | CPU/edge inference | Legacy HF deployments |
| Throughput | 120–160 req/s | 1–3 req/s | <1 req/s concurrent | Good (2x slower than vLLM) |
| Multi-GPU | Tensor + pipeline parallel | No | Limited | Yes |
| API | OpenAI-compatible | Custom + OpenAI | C++ library | Custom |
| Setup | Moderate | Minimal | Minimal | Moderate |
| Hardware | NVIDIA, AMD, TPU | Any | Any (CPU, Metal) | NVIDIA, AMD |
Rule of thumb: Use vLLM when you have GPUs and need to serve multiple users. Use Ollama for quick local testing. Use llama.cpp for CPU-only or edge deployment. TGI entered maintenance mode in late 2025. HuggingFace now recommends vLLM or SGLang for new projects.
VRAM Estimation Cheat Sheet
| Model Size | FP16 | FP8 / INT8 | INT4 |
|---|---|---|---|
| 7B | 14 GB | 7 GB | 3.5 GB |
| 13B | 26 GB | 13 GB | 6.5 GB |
| 70B | 140 GB | 70 GB | 35 GB |
Remember to reserve 4–5GB for the CUDA runtime, driver overhead, and activation buffers. A 70B FP8 model needs ~70GB for weights plus overhead, so it fits on a single 96GB GPU with room for KV cache.
Deployment Tips
- Start with
--gpu-memory-utilization 0.90and drop to 0.85 if you see OOM errors - Single GPU is faster than multi-GPU when the model fits because tensor parallelism adds communication overhead
- Use FP8 on modern GPUs (H100, L40S, Blackwell) for 2x memory savings with minimal quality loss
- Enable prefix caching for chat apps with repeated system prompts
- Monitor KV cache utilization. If it stays above 90%, reduce
--max-num-seqsor enable chunked prefill - Always benchmark on your actual task before deploying a quantized model
The Research Paper
vLLM is based on the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention" by Woosuk Kwon et al. from UC Berkeley, published at SOSP 2023.
Key findings from the paper:
- Existing systems waste 60–80% of KV cache memory due to contiguous pre-allocation
- PagedAttention reduces this to under 4% using block-based, on-demand allocation
- Copy-on-write sharing enables efficient parallel sampling and beam search
- 2–4x throughput over FasterTransformer and Orca, 14–24x over HuggingFace Transformers
Key Takeaways
- vLLM's PagedAttention treats GPU VRAM like OS virtual memory: paged, on-demand, and shared
- This eliminates 60–80% KV cache waste, enabling far more concurrent requests per GPU
- Combined with continuous batching, it achieves 14–24x throughput over naive inference
- The OpenAI-compatible API makes migration from hosted APIs trivial
- For GPU-based production LLM serving, vLLM is the current standard