Introduction

You have a fine-tuned 70B model. It works. Now you need to serve it to users: fast, reliably, and without burning money on GPU hours. This is where most teams hit a wall: naive inference with HuggingFace Transformers wastes 60–80% of GPU memory, handles one request at a time, and leaves expensive hardware sitting idle.

vLLM is an open-source inference engine from UC Berkeley that solves this. Its core innovation, PagedAttention, borrows virtual memory concepts from operating systems to manage the KV cache, achieving 14–24x higher throughput than HuggingFace Transformers. I use it to serve models on an NVIDIA RTX Pro 6000 Blackwell (96GB VRAM) and it's become the standard for serious LLM deployment.

The Problem: GPU Memory Waste

During autoregressive generation, every token's attention keys and values (the KV cache) must live in GPU memory. For a 13B model, each token needs ~800KB of KV cache. With a 2048-token max sequence, that's up to 1.6GB per request.

Traditional systems pre-allocate a contiguous block of GPU memory for each request, sized for the maximum possible sequence length. If a user sends a short prompt and gets a short reply, most of that block goes unused. Across many concurrent requests, this creates three forms of waste:

  • Internal fragmentation: unused slots within each allocated block
  • Reservation waste: entire blocks held for sequences that never grow
  • External fragmentation: gaps between blocks that can't be reused

The result: existing systems waste 60–80% of KV cache memory. Fewer concurrent requests fit in memory, batch sizes stay small, and GPU compute sits idle.

PagedAttention: Virtual Memory for KV Cache

PagedAttention is vLLM's core innovation, published at SOSP 2023 (one of the top systems conferences). The idea is simple: apply the same techniques your OS uses for RAM management to GPU KV cache management.

Instead of one contiguous block per request, PagedAttention:

  1. Divides the KV cache into fixed-size blocks (typically 16 tokens per block)
  2. Allocates blocks on demand as tokens are generated, not upfront
  3. Maps logical blocks to physical blocks via a block table (like an OS page table)
  4. Shares physical blocks across requests with common prefixes (copy-on-write)
OS Concept PagedAttention Equivalent
Virtual address spaceLogical KV blocks (sequential view)
Physical memory framesPhysical KV blocks in GPU VRAM
Page tableBlock table
On-demand pagingIncremental block allocation
Shared memory pagesShared blocks for common prefixes

The only wasted memory per sequence is the last partially-filled block, at most 15 tokens. Overall KV cache waste drops from 60–80% to under 4%.

Why This Matters for GPU Utilization

With near-zero memory waste, vLLM can fit far more concurrent requests in the same GPU. Combined with continuous batching (inserting new requests into the batch as others finish, rather than waiting for a static batch), the throughput gains are massive:

Comparison Throughput Gain
vLLM vs. HuggingFace Transformers14–24x
vLLM vs. HuggingFace TGI2.2–2.5x
vLLM vs. FasterTransformer2–4x

The advantage grows with longer sequences, larger models, and more complex decoding (beam search, parallel sampling) because those are exactly the scenarios where KV cache memory pressure is worst.

Key Features

Continuous Batching

Static batching waits for a fixed batch, pads shorter sequences, and blocks everything until the longest finishes. vLLM's continuous batching changes the batch composition at every decoding step. As soon as one sequence finishes, a new request takes its slot. No waiting, no padding.

Tensor Parallelism

For models that don't fit on a single GPU, vLLM shards weight matrices across multiple GPUs. A 70B model in FP16 (~140GB) can run on 2x H100s (80GB each) with --tensor-parallel-size 2. NVLink interconnect is strongly recommended since PCIe degrades throughput significantly.

Quantization Support

vLLM supports GPTQ, AWQ, FP8, bitsandbytes, and more. FP8 is particularly interesting on modern hardware (H100, L40S), offering 2x memory reduction with up to 1.6x throughput improvement and minimal quality loss.

OpenAI-Compatible API

vLLM serves an OpenAI-compatible API out of the box. You can point any existing OpenAI client at your vLLM server by changing the base URL. Zero code changes for migration.

Prefix Caching

Requests sharing common prefixes (e.g., the same system prompt) automatically share KV cache blocks via copy-on-write. For chat applications with repeated system prompts, this saves significant memory.

How to Use vLLM

Installation

# Recommended: install with pip
pip install vllm

# Or with uv (faster)
uv pip install vllm --torch-backend=auto

Offline Batch Inference

Colab Open in Google Colab (GPU required)
from vllm import LLM, SamplingParams

prompts = [
    "Explain PagedAttention in one paragraph:",
    "What is the capital of Kenya?",
    "Write a Python function that reverses a string:",
]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=256,
)

llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Output: {output.outputs[0].text}\n")

Starting the API Server

# Basic
vllm serve Qwen/Qwen2.5-7B-Instruct

# Production: FP8 quantization, 64K context, 90% GPU memory
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --quantization fp8 \
    --max-model-len 65536 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.90 \
    --host 0.0.0.0 \
    --port 8000

Querying the Server

from openai import OpenAI

# Point the standard OpenAI client at your vLLM server
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Or with curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'

Key Launch Parameters

Parameter What It Does Example
--modelHuggingFace model ID or local pathmeta-llama/Llama-3.1-8B-Instruct
--quantizationQuantization methodfp8, awq, gptq
--max-model-lenMax context window65536
--tensor-parallel-sizeNumber of GPUs2, 4
--gpu-memory-utilizationFraction of VRAM to use0.90
--dtypeModel data typebfloat16, auto

vLLM vs. the Alternatives

Factor vLLM Ollama llama.cpp TGI
Best forProduction GPU servingLocal dev/prototypingCPU/edge inferenceLegacy HF deployments
Throughput120–160 req/s1–3 req/s<1 req/s concurrentGood (2x slower than vLLM)
Multi-GPUTensor + pipeline parallelNoLimitedYes
APIOpenAI-compatibleCustom + OpenAIC++ libraryCustom
SetupModerateMinimalMinimalModerate
HardwareNVIDIA, AMD, TPUAnyAny (CPU, Metal)NVIDIA, AMD

Rule of thumb: Use vLLM when you have GPUs and need to serve multiple users. Use Ollama for quick local testing. Use llama.cpp for CPU-only or edge deployment. TGI entered maintenance mode in late 2025. HuggingFace now recommends vLLM or SGLang for new projects.

VRAM Estimation Cheat Sheet

Model Size FP16 FP8 / INT8 INT4
7B14 GB7 GB3.5 GB
13B26 GB13 GB6.5 GB
70B140 GB70 GB35 GB

Remember to reserve 4–5GB for the CUDA runtime, driver overhead, and activation buffers. A 70B FP8 model needs ~70GB for weights plus overhead, so it fits on a single 96GB GPU with room for KV cache.

Deployment Tips

  • Start with --gpu-memory-utilization 0.90 and drop to 0.85 if you see OOM errors
  • Single GPU is faster than multi-GPU when the model fits because tensor parallelism adds communication overhead
  • Use FP8 on modern GPUs (H100, L40S, Blackwell) for 2x memory savings with minimal quality loss
  • Enable prefix caching for chat apps with repeated system prompts
  • Monitor KV cache utilization. If it stays above 90%, reduce --max-num-seqs or enable chunked prefill
  • Always benchmark on your actual task before deploying a quantized model

The Research Paper

vLLM is based on the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention" by Woosuk Kwon et al. from UC Berkeley, published at SOSP 2023.

Key findings from the paper:

  • Existing systems waste 60–80% of KV cache memory due to contiguous pre-allocation
  • PagedAttention reduces this to under 4% using block-based, on-demand allocation
  • Copy-on-write sharing enables efficient parallel sampling and beam search
  • 2–4x throughput over FasterTransformer and Orca, 14–24x over HuggingFace Transformers

Key Takeaways

  • vLLM's PagedAttention treats GPU VRAM like OS virtual memory: paged, on-demand, and shared
  • This eliminates 60–80% KV cache waste, enabling far more concurrent requests per GPU
  • Combined with continuous batching, it achieves 14–24x throughput over naive inference
  • The OpenAI-compatible API makes migration from hosted APIs trivial
  • For GPU-based production LLM serving, vLLM is the current standard

Further Reading