Part 8: Serving Quantized Models with vLLM

Hardware: a CUDA GPU with at least 8 GB VRAM for a 4-bit AWQ 4B model, 24 GB for a 7B. Python: 3.11+.

Why vLLM

You can serve an LLM with transformers.pipeline(). You'll handle exactly one user at a time, and your GPU will sit at 30% utilisation. vLLM's job is to make the same hardware do 10× the work via continuous batching, paged attention, and prefix caching.

The deeper dive on why these techniques work lives at How I Get the Best Out of My GPU Using vLLM. This post is the operational pipeline: install, launch, configure, integrate.

Install

# vLLM brings its own pinned torch
uv venv --python 3.11
uv pip install vllm==0.7.0

vLLM doesn't run on Windows natively; use WSL2 or a Linux box. Mac M-series users: vLLM doesn't target Apple Silicon — use Ollama or llama.cpp with the GGUF artefact from Part 7 instead.

Launch the Server

Save as scripts/launch_vllm.sh:

#!/usr/bin/env bash
set -euo pipefail

MODEL_PATH=${MODEL_PATH:-./outputs/qwen3-4b-domain-awq}
LORA_PATH=${LORA_PATH:-./outputs/qwen3-4b-domain-adapter}
PORT=${PORT:-8001}

vllm serve "$MODEL_PATH" \
  --port "$PORT" \
  --quantization awq_marlin \
  --gpu-memory-utilization 0.85 \
  --max-model-len 4096 \
  --max-num-seqs 16 \
  --enable-prefix-caching \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 16 \
  --lora-modules "domain=$LORA_PATH"

Run it:

chmod +x scripts/launch_vllm.sh && ./scripts/launch_vllm.sh

vLLM exposes an OpenAI-compatible API at http://localhost:8001. Test it:

curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "domain",
    "messages": [{"role":"user","content":"Convert temperature.\n\n75 F"}],
    "max_tokens": 50
  }'

Use "model": "domain" to hit the LoRA-adapted alias, or the base model path for un-adapted base behaviour. One process, two endpoints.

The Flags That Matter

Flag	What it controls	What goes wrong if you skip it
`--quantization awq_marlin`	Tells vLLM to use the Marlin INT4 kernel for AWQ weights	Loads in slow path or fails outright
`--gpu-memory-utilization 0.85`	Fraction of VRAM vLLM reserves for weights + KV cache	Default 0.9 OOMs on shared GPUs; lower to share with other workloads
`--max-model-len 4096`	Context window. Smaller = larger KV cache budget = more concurrent users	Set this to your real workload, not the model's max
`--max-num-seqs 16`	Concurrency cap	Default is high; cap to your KV-cache budget. Set lower if you OOM mid-batch
`--enable-prefix-caching`	KV cache reuse for stable prefixes	You leave free latency on the table for any workload with shared system prompts
`--enable-lora`	Allows LoRA modules to be loaded on top of the base	Can't serve adapters at all without this
`--max-loras 4 --max-lora-rank 16`	Per-process LoRA slot count and max rank	Adapters with rank > this are rejected
`--lora-modules name=path`	Pre-load adapters at startup; reference by name in requests	Without this you can serve only the base

Multi-LoRA Serving

The single biggest operational win vLLM gives you: one base model, many specialised adapters, near-zero memory cost per adapter. Add more --lora-modules entries:

--lora-modules \
  "support=./adapters/customer-support" \
  "code-review=./adapters/code-review" \
  "summariser=./adapters/summariser"

Then in requests:

curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"code-review","messages":[{"role":"user","content":"Review this PR..."}]}'

vLLM swaps the active LoRA per-request based on the model field. The base weights stay loaded once. This is how you serve 10 specialised models on a single GPU without 10× the memory.

Prefix Caching: The Free Win

Most LLM workloads have a stable system prompt — "You are a helpful assistant ..." or a multi-page RAG context that's the same across many calls in a session. Prefix caching means vLLM keeps the KV cache for that shared prefix and reuses it on every subsequent call. Latency on calls with shared prefixes drops by 30–70%.

The flag costs nothing. Always enable it.

Continuous Batching (One Paragraph)

Static batching: gather N requests, run them together, return all responses when the slowest finishes. Wasteful because short requests block on long ones. Continuous batching: as soon as one request finishes a token, slot the next request's first token in. The GPU is constantly full, no request waits for an unrelated long request to finish. This is the biggest single reason vLLM beats transformers.pipeline() by an order of magnitude. Deeper treatment in the vLLM deep-dive post.

Swap vLLM into the Part 5 RAG System

The wrapper from Part 2 was built provider-agnostic on purpose. Swapping in vLLM is a config change, not a code change. Add a self-hosted backend to llm_client.py:

from openai import OpenAI

class VLLMClient(OpenAIClient):
    """Same as OpenAIClient but talks to a local vLLM endpoint."""
    def __init__(self, model: str = "domain", tracker: CostTracker | None = None,
                 base_url: str = "http://localhost:8001/v1") -> None:
        # vLLM's API is OpenAI-compatible — reuse the OpenAI SDK
        self._client = OpenAI(api_key="EMPTY", base_url=base_url, timeout=30.0)
        self._model = model
        self._tracker = tracker or CostTracker()


def get_client(provider: str | None = None) -> LLMClient:
    provider = (provider or os.environ.get("LLM_PROVIDER", "openai")).lower()
    if provider == "openai":   return OpenAIClient(tracker=_TRACKER)
    if provider == "anthropic": return AnthropicClient(tracker=_TRACKER)
    if provider == "vllm":      return VLLMClient(tracker=_TRACKER)
    raise ValueError(f"unknown provider: {provider}")

Now switch the FastAPI service to use it:

export LLM_PROVIDER=vllm
uv run uvicorn text_tool.server:app --host 0.0.0.0 --port 8000

The /rag endpoint from Part 5 now grounds answers via your own quantised, fine-tuned model. Zero downstream API spend. The same Pydantic schemas, the same retrieval, the same FastAPI endpoint — only the LLM provider has changed. That is the whole point of the wrapper from Part 2.

Update the Pricing Map

The CostTracker from Part 2 prices OpenAI/Anthropic per token. Self-hosted vLLM has a different cost model — GPU-hour cost amortised over throughput. Add a flat zero entry so you don't get bogus dollar figures:

PRICING = {
    "domain":   {"in": 0.0, "out": 0.0},  # self-hosted; cost lives in the GPU bill
    "gpt-4o-mini": {"in": 0.15, "out": 0.60},
    # ...
}

For real cost accounting on a self-hosted endpoint, divide GPU $/hour by tokens/hour. That's the per-token cost. Track it separately from API spend — the right comparison metric is GPU $/hour, not $/token.

Operational Concerns

systemd Unit

Save as /etc/systemd/system/vllm.service:

[Unit]
Description=vLLM serving stack
After=network.target

[Service]
Type=simple
User=ai
WorkingDirectory=/opt/text-tool
Environment="HF_HOME=/opt/cache/hf"
ExecStart=/opt/text-tool/scripts/launch_vllm.sh
Restart=on-failure
RestartSec=10
StandardOutput=append:/var/log/vllm/stdout.log
StandardError=append:/var/log/vllm/stderr.log

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo journalctl -u vllm -f

Log Rotation

Drop into /etc/logrotate.d/vllm:

/var/log/vllm/*.log {
    daily
    rotate 14
    compress
    missingok
    notifempty
    copytruncate
}

GPU Monitoring

Quick check from the host:

watch -n 1 nvidia-smi

For Prometheus-style metrics: --otlp-traces-endpoint exports vLLM internals. Pair with Grafana for a real dashboard. Detailed setup is in Part 10.

When vLLM Is the Wrong Choice

Very small models (<1B). vLLM's overhead dominates; raw transformers or llama.cpp wins
CPU-only deployment. Use llama.cpp with the GGUF from Part 7
Single-user workload. Use Ollama. It's friendlier and the multi-user advantage doesn't matter
Apple Silicon. vLLM doesn't target it; llama.cpp / Ollama do, natively and well

Key Takeaways

vLLM is the default GPU serving stack in 2026 because continuous batching + paged attention + prefix caching is exactly what an LLM workload needs
One vLLM process, one base model, many LoRA adapters — pick the adapter per request via the model field
Always enable prefix caching. It's free latency reduction
The Part 2 client wrapper makes the swap from API to self-hosted a one-line config change
Run it under systemd with log rotation. nvidia-smi is a crude but effective monitoring loop
vLLM is the wrong tool for tiny models, CPU-only deployment, or Apple Silicon. Use llama.cpp / Ollama there

Next Up

Part 9 takes a different optimisation angle: distillation. Use the Part 6 fine-tuned model as a teacher; train a much smaller student that runs on a laptop. The student plus the AWQ quantization compose — you can have both.

Next: Part 9 — Knowledge Distillation