Hardware: a CUDA GPU with at least 8 GB VRAM for a 4-bit AWQ 4B model, 24 GB for a 7B. Python: 3.11+.
Why vLLM
You can serve an LLM with transformers.pipeline(). You'll handle exactly one user at a time, and your GPU will sit at 30% utilisation. vLLM's job is to make the same hardware do 10× the work via continuous batching, paged attention, and prefix caching.
The deeper dive on why these techniques work lives at How I Get the Best Out of My GPU Using vLLM. This post is the operational pipeline: install, launch, configure, integrate.
Install
# vLLM brings its own pinned torch
uv venv --python 3.11
uv pip install vllm==0.7.0
vLLM doesn't run on Windows natively; use WSL2 or a Linux box. Mac M-series users: vLLM doesn't target Apple Silicon — use Ollama or llama.cpp with the GGUF artefact from Part 7 instead.
Launch the Server
Save as scripts/launch_vllm.sh:
#!/usr/bin/env bash
set -euo pipefail
MODEL_PATH=${MODEL_PATH:-./outputs/qwen3-4b-domain-awq}
LORA_PATH=${LORA_PATH:-./outputs/qwen3-4b-domain-adapter}
PORT=${PORT:-8001}
vllm serve "$MODEL_PATH" \
--port "$PORT" \
--quantization awq_marlin \
--gpu-memory-utilization 0.85 \
--max-model-len 4096 \
--max-num-seqs 16 \
--enable-prefix-caching \
--enable-lora \
--max-loras 4 \
--max-lora-rank 16 \
--lora-modules "domain=$LORA_PATH"
Run it:
chmod +x scripts/launch_vllm.sh && ./scripts/launch_vllm.sh
vLLM exposes an OpenAI-compatible API at http://localhost:8001. Test it:
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "domain",
"messages": [{"role":"user","content":"Convert temperature.\n\n75 F"}],
"max_tokens": 50
}'
Use "model": "domain" to hit the LoRA-adapted alias, or the base model path for un-adapted base behaviour. One process, two endpoints.
The Flags That Matter
| Flag | What it controls | What goes wrong if you skip it |
|---|---|---|
--quantization awq_marlin | Tells vLLM to use the Marlin INT4 kernel for AWQ weights | Loads in slow path or fails outright |
--gpu-memory-utilization 0.85 | Fraction of VRAM vLLM reserves for weights + KV cache | Default 0.9 OOMs on shared GPUs; lower to share with other workloads |
--max-model-len 4096 | Context window. Smaller = larger KV cache budget = more concurrent users | Set this to your real workload, not the model's max |
--max-num-seqs 16 | Concurrency cap | Default is high; cap to your KV-cache budget. Set lower if you OOM mid-batch |
--enable-prefix-caching | KV cache reuse for stable prefixes | You leave free latency on the table for any workload with shared system prompts |
--enable-lora | Allows LoRA modules to be loaded on top of the base | Can't serve adapters at all without this |
--max-loras 4 --max-lora-rank 16 | Per-process LoRA slot count and max rank | Adapters with rank > this are rejected |
--lora-modules name=path | Pre-load adapters at startup; reference by name in requests | Without this you can serve only the base |
Multi-LoRA Serving
The single biggest operational win vLLM gives you: one base model, many specialised adapters, near-zero memory cost per adapter. Add more --lora-modules entries:
--lora-modules \
"support=./adapters/customer-support" \
"code-review=./adapters/code-review" \
"summariser=./adapters/summariser"
Then in requests:
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"code-review","messages":[{"role":"user","content":"Review this PR..."}]}'
vLLM swaps the active LoRA per-request based on the model field. The base weights stay loaded once. This is how you serve 10 specialised models on a single GPU without 10× the memory.
Prefix Caching: The Free Win
Most LLM workloads have a stable system prompt — "You are a helpful assistant ..." or a multi-page RAG context that's the same across many calls in a session. Prefix caching means vLLM keeps the KV cache for that shared prefix and reuses it on every subsequent call. Latency on calls with shared prefixes drops by 30–70%.
The flag costs nothing. Always enable it.
Continuous Batching (One Paragraph)
Static batching: gather N requests, run them together, return all responses when the slowest finishes. Wasteful because short requests block on long ones. Continuous batching: as soon as one request finishes a token, slot the next request's first token in. The GPU is constantly full, no request waits for an unrelated long request to finish. This is the biggest single reason vLLM beats transformers.pipeline() by an order of magnitude. Deeper treatment in the vLLM deep-dive post.
Swap vLLM into the Part 5 RAG System
The wrapper from Part 2 was built provider-agnostic on purpose. Swapping in vLLM is a config change, not a code change. Add a self-hosted backend to llm_client.py:
from openai import OpenAI
class VLLMClient(OpenAIClient):
"""Same as OpenAIClient but talks to a local vLLM endpoint."""
def __init__(self, model: str = "domain", tracker: CostTracker | None = None,
base_url: str = "http://localhost:8001/v1") -> None:
# vLLM's API is OpenAI-compatible — reuse the OpenAI SDK
self._client = OpenAI(api_key="EMPTY", base_url=base_url, timeout=30.0)
self._model = model
self._tracker = tracker or CostTracker()
def get_client(provider: str | None = None) -> LLMClient:
provider = (provider or os.environ.get("LLM_PROVIDER", "openai")).lower()
if provider == "openai": return OpenAIClient(tracker=_TRACKER)
if provider == "anthropic": return AnthropicClient(tracker=_TRACKER)
if provider == "vllm": return VLLMClient(tracker=_TRACKER)
raise ValueError(f"unknown provider: {provider}")
Now switch the FastAPI service to use it:
export LLM_PROVIDER=vllm
uv run uvicorn text_tool.server:app --host 0.0.0.0 --port 8000
The /rag endpoint from Part 5 now grounds answers via your own quantised, fine-tuned model. Zero downstream API spend. The same Pydantic schemas, the same retrieval, the same FastAPI endpoint — only the LLM provider has changed. That is the whole point of the wrapper from Part 2.
Update the Pricing Map
The CostTracker from Part 2 prices OpenAI/Anthropic per token. Self-hosted vLLM has a different cost model — GPU-hour cost amortised over throughput. Add a flat zero entry so you don't get bogus dollar figures:
PRICING = {
"domain": {"in": 0.0, "out": 0.0}, # self-hosted; cost lives in the GPU bill
"gpt-4o-mini": {"in": 0.15, "out": 0.60},
# ...
}
For real cost accounting on a self-hosted endpoint, divide GPU $/hour by tokens/hour. That's the per-token cost. Track it separately from API spend — the right comparison metric is GPU $/hour, not $/token.
Operational Concerns
systemd Unit
Save as /etc/systemd/system/vllm.service:
[Unit]
Description=vLLM serving stack
After=network.target
[Service]
Type=simple
User=ai
WorkingDirectory=/opt/text-tool
Environment="HF_HOME=/opt/cache/hf"
ExecStart=/opt/text-tool/scripts/launch_vllm.sh
Restart=on-failure
RestartSec=10
StandardOutput=append:/var/log/vllm/stdout.log
StandardError=append:/var/log/vllm/stderr.log
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo journalctl -u vllm -f
Log Rotation
Drop into /etc/logrotate.d/vllm:
/var/log/vllm/*.log {
daily
rotate 14
compress
missingok
notifempty
copytruncate
}
GPU Monitoring
Quick check from the host:
watch -n 1 nvidia-smi
For Prometheus-style metrics: --otlp-traces-endpoint exports vLLM internals. Pair with Grafana for a real dashboard. Detailed setup is in Part 10.
When vLLM Is the Wrong Choice
- Very small models (<1B). vLLM's overhead dominates; raw
transformersor llama.cpp wins - CPU-only deployment. Use llama.cpp with the GGUF from Part 7
- Single-user workload. Use Ollama. It's friendlier and the multi-user advantage doesn't matter
- Apple Silicon. vLLM doesn't target it; llama.cpp / Ollama do, natively and well
Key Takeaways
- vLLM is the default GPU serving stack in 2026 because continuous batching + paged attention + prefix caching is exactly what an LLM workload needs
- One vLLM process, one base model, many LoRA adapters — pick the adapter per request via the
modelfield - Always enable prefix caching. It's free latency reduction
- The Part 2 client wrapper makes the swap from API to self-hosted a one-line config change
- Run it under systemd with log rotation.
nvidia-smiis a crude but effective monitoring loop - vLLM is the wrong tool for tiny models, CPU-only deployment, or Apple Silicon. Use llama.cpp / Ollama there
Next Up
Part 9 takes a different optimisation angle: distillation. Use the Part 6 fine-tuned model as a teacher; train a much smaller student that runs on a laptop. The student plus the AWQ quantization compose — you can have both.