How I Get the Best Out of My GPU Using vLLM for Local LLM Production

Introduction

You have a fine-tuned 70B model. It works. Now you need to serve it to users: fast, reliably, and without burning money on GPU hours. This is where most teams hit a wall: naive inference with HuggingFace Transformers wastes 60–80% of GPU memory, handles one request at a time, and leaves expensive hardware sitting idle.

vLLM is an open-source inference engine from UC Berkeley that solves this. Its core innovation, PagedAttention, borrows virtual memory concepts from operating systems to manage the KV cache, achieving 14–24x higher throughput than HuggingFace Transformers. I use it to serve models on an NVIDIA RTX Pro 6000 Blackwell (96GB VRAM) and it's become the standard for serious LLM deployment.

The Problem: GPU Memory Waste

During autoregressive generation, every token's attention keys and values (the KV cache) must live in GPU memory. For a 13B model, each token needs ~800KB of KV cache. With a 2048-token max sequence, that's up to 1.6GB per request.

Traditional systems pre-allocate a contiguous block of GPU memory for each request, sized for the maximum possible sequence length. If a user sends a short prompt and gets a short reply, most of that block goes unused. Across many concurrent requests, this creates three forms of waste:

Internal fragmentation: unused slots within each allocated block
Reservation waste: entire blocks held for sequences that never grow
External fragmentation: gaps between blocks that can't be reused

The result: existing systems waste 60–80% of KV cache memory. Fewer concurrent requests fit in memory, batch sizes stay small, and GPU compute sits idle.

PagedAttention: Virtual Memory for KV Cache

PagedAttention is vLLM's core innovation, published at SOSP 2023 (one of the top systems conferences). The idea is simple: apply the same techniques your OS uses for RAM management to GPU KV cache management.

Instead of one contiguous block per request, PagedAttention:

Divides the KV cache into fixed-size blocks (typically 16 tokens per block)
Allocates blocks on demand as tokens are generated, not upfront
Maps logical blocks to physical blocks via a block table (like an OS page table)
Shares physical blocks across requests with common prefixes (copy-on-write)

OS Concept	PagedAttention Equivalent
Virtual address space	Logical KV blocks (sequential view)
Physical memory frames	Physical KV blocks in GPU VRAM
Page table	Block table
On-demand paging	Incremental block allocation
Shared memory pages	Shared blocks for common prefixes

The only wasted memory per sequence is the last partially-filled block, at most 15 tokens. Overall KV cache waste drops from 60–80% to under 4%.

Why This Matters for GPU Utilization

With near-zero memory waste, vLLM can fit far more concurrent requests in the same GPU. Combined with continuous batching (inserting new requests into the batch as others finish, rather than waiting for a static batch), the throughput gains are massive:

Comparison	Throughput Gain
vLLM vs. HuggingFace Transformers	14–24x
vLLM vs. HuggingFace TGI	2.2–2.5x
vLLM vs. FasterTransformer	2–4x

The advantage grows with longer sequences, larger models, and more complex decoding (beam search, parallel sampling) because those are exactly the scenarios where KV cache memory pressure is worst.

Key Features

Continuous Batching

Static batching waits for a fixed batch, pads shorter sequences, and blocks everything until the longest finishes. vLLM's continuous batching changes the batch composition at every decoding step. As soon as one sequence finishes, a new request takes its slot. No waiting, no padding.

Tensor Parallelism

For models that don't fit on a single GPU, vLLM shards weight matrices across multiple GPUs. A 70B model in FP16 (~140GB) can run on 2x H100s (80GB each) with --tensor-parallel-size 2. NVLink interconnect is strongly recommended since PCIe degrades throughput significantly.

Quantization Support

vLLM supports GPTQ, AWQ, FP8, bitsandbytes, and more. FP8 is particularly interesting on modern hardware (H100, L40S), offering 2x memory reduction with up to 1.6x throughput improvement and minimal quality loss.

OpenAI-Compatible API

vLLM serves an OpenAI-compatible API out of the box. You can point any existing OpenAI client at your vLLM server by changing the base URL. Zero code changes for migration.

Prefix Caching

Requests sharing common prefixes (e.g., the same system prompt) automatically share KV cache blocks via copy-on-write. For chat applications with repeated system prompts, this saves significant memory.

How to Use vLLM

Installation

# Recommended: install with pip
pip install vllm

# Or with uv (faster)
uv pip install vllm --torch-backend=auto

Offline Batch Inference

Open in Google Colab (GPU required)

from vllm import LLM, SamplingParams

prompts = [
    "Explain PagedAttention in one paragraph:",
    "What is the capital of Kenya?",
    "Write a Python function that reverses a string:",
]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=256,
)

llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Output: {output.outputs[0].text}\n")

Starting the API Server

# Basic
vllm serve Qwen/Qwen2.5-7B-Instruct

# Production: FP8 quantization, 64K context, 90% GPU memory
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --quantization fp8 \
    --max-model-len 65536 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.90 \
    --host 0.0.0.0 \
    --port 8000

Querying the Server

from openai import OpenAI

# Point the standard OpenAI client at your vLLM server
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Or with curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'

Key Launch Parameters

Parameter	What It Does	Example
`--model`	HuggingFace model ID or local path	`meta-llama/Llama-3.1-8B-Instruct`
`--quantization`	Quantization method	`fp8`, `awq`, `gptq`
`--max-model-len`	Max context window	`65536`
`--tensor-parallel-size`	Number of GPUs	`2`, `4`
`--gpu-memory-utilization`	Fraction of VRAM to use	`0.90`
`--dtype`	Model data type	`bfloat16`, `auto`

vLLM vs. the Alternatives

Factor	vLLM	Ollama	llama.cpp	TGI
Best for	Production GPU serving	Local dev/prototyping	CPU/edge inference	Legacy HF deployments
Throughput	120–160 req/s	1–3 req/s	<1 req/s concurrent	Good (2x slower than vLLM)
Multi-GPU	Tensor + pipeline parallel	No	Limited	Yes
API	OpenAI-compatible	Custom + OpenAI	C++ library	Custom
Setup	Moderate	Minimal	Minimal	Moderate
Hardware	NVIDIA, AMD, TPU	Any	Any (CPU, Metal)	NVIDIA, AMD

Rule of thumb: Use vLLM when you have GPUs and need to serve multiple users. Use Ollama for quick local testing. Use llama.cpp for CPU-only or edge deployment. TGI entered maintenance mode in late 2025. HuggingFace now recommends vLLM or SGLang for new projects.

VRAM Estimation Cheat Sheet

Model Size	FP16	FP8 / INT8	INT4
7B	14 GB	7 GB	3.5 GB
13B	26 GB	13 GB	6.5 GB
70B	140 GB	70 GB	35 GB

Remember to reserve 4–5GB for the CUDA runtime, driver overhead, and activation buffers. A 70B FP8 model needs ~70GB for weights plus overhead, so it fits on a single 96GB GPU with room for KV cache.

Deployment Tips

Start with --gpu-memory-utilization 0.90 and drop to 0.85 if you see OOM errors
Single GPU is faster than multi-GPU when the model fits because tensor parallelism adds communication overhead
Use FP8 on modern GPUs (H100, L40S, Blackwell) for 2x memory savings with minimal quality loss
Enable prefix caching for chat apps with repeated system prompts
Monitor KV cache utilization. If it stays above 90%, reduce --max-num-seqs or enable chunked prefill
Always benchmark on your actual task before deploying a quantized model

The Research Paper

vLLM is based on the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention" by Woosuk Kwon et al. from UC Berkeley, published at SOSP 2023.

Key findings from the paper:

Existing systems waste 60–80% of KV cache memory due to contiguous pre-allocation
PagedAttention reduces this to under 4% using block-based, on-demand allocation
Copy-on-write sharing enables efficient parallel sampling and beam search
2–4x throughput over FasterTransformer and Orca, 14–24x over HuggingFace Transformers

Key Takeaways

vLLM's PagedAttention treats GPU VRAM like OS virtual memory: paged, on-demand, and shared
This eliminates 60–80% KV cache waste, enabling far more concurrent requests per GPU
Combined with continuous batching, it achieves 14–24x throughput over naive inference
The OpenAI-compatible API makes migration from hosted APIs trivial
For GPU-based production LLM serving, vLLM is the current standard