Hardware: the same stack you've built across Parts 3, 5, and 8 — FastAPI service + Qdrant + vLLM. Python: 3.11+.
The Threshold
Every project in this series so far works for one user. The interesting failures start when you have ten. The interesting engineering starts when you have a thousand. This post is about the patterns that hold the stack up across that gap.
Concurrency: Async All the Way
LLM calls are I/O-bound. Every wait on the network is wasted CPU on the server side. The default for an LLM service should be async everywhere:
# Async wrapper around the OpenAI/Anthropic client.
from openai import AsyncOpenAI
class AsyncOpenAIClient:
def __init__(self, model: str, base_url: str | None = None) -> None:
self._client = AsyncOpenAI(
api_key=os.environ.get("OPENAI_API_KEY", "EMPTY"),
base_url=base_url,
timeout=30.0,
)
self._model = model
async def complete(self, prompt: str, **kw) -> str:
resp = await self._client.chat.completions.create(
model=self._model,
messages=[{"role": "user", "content": prompt}],
**kw,
)
return resp.choices[0].message.content or ""
For parallelisable subtasks (a fan-out RAG that hits 5 retrievers; an agent that calls 3 tools concurrently), use asyncio.gather:
results = await asyncio.gather(
client.complete(prompt_a),
client.complete(prompt_b),
client.complete(prompt_c),
)
Three calls, one wait. Same pattern at the API layer: async def endpoints in FastAPI. Don't slip back to synchronous database calls or file reads inside an async endpoint — they block the whole event loop.
Connection Pooling
Naive code creates an HTTP client per request. The setup cost — DNS, TLS handshake, TCP — eats into every call. Use a process-wide singleton:
import httpx
_HTTP_CLIENT: httpx.AsyncClient | None = None
def http() -> httpx.AsyncClient:
global _HTTP_CLIENT
if _HTTP_CLIENT is None:
_HTTP_CLIENT = httpx.AsyncClient(
timeout=httpx.Timeout(30.0, connect=5.0),
limits=httpx.Limits(
max_keepalive_connections=20,
max_connections=100,
keepalive_expiry=60.0,
),
)
return _HTTP_CLIENT
# In FastAPI lifespan:
@asynccontextmanager
async def lifespan(app: FastAPI):
yield
if _HTTP_CLIENT is not None:
await _HTTP_CLIENT.aclose()
The max_keepalive_connections is the number of connections kept warm between requests. max_connections is the upper bound. For a service that fans out to multiple downstream APIs, this is the difference between latency that grows linearly with load and latency that stays flat.
Caching at Every Layer
| Layer | What | How |
|---|---|---|
| HTTP edge | Identical requests inside a TTL | Cloudflare / nginx; respects Cache-Control headers from your service |
| Application | Deterministic prompts (temp 0) | Redis with prompt-hash key; ~10 ms vs ~2,000 ms LLM call |
| Embedding | Same query computed twice | LRU cache or Redis; embedding is small (a few KB) so cache hit ratios are high |
| vLLM prefix cache | Shared system prompts / RAG context | One flag in Part 8; reuses KV across calls with same prefix |
Each layer multiplies the savings. A cached HTTP response never touches the embedding cache; a cached embedding never enters the LLM cache; a cache miss at every level still benefits from the vLLM prefix cache. The compounding effect is enormous on real workloads.
Application-level cache, simplest version:
import hashlib
import redis.asyncio as redis # uv add redis==5.2.0
_REDIS = redis.from_url("redis://localhost:6379")
async def cached_complete(client, prompt: str, *, ttl: int = 86400) -> str:
key = "llm:" + hashlib.sha256(prompt.encode()).hexdigest()
hit = await _REDIS.get(key)
if hit: return hit.decode()
result = await client.complete(prompt)
await _REDIS.set(key, result, ex=ttl)
return result
Rate Limiting and Admission Control
Rate limiting (Part 3 with slowapi) decides whether a single client is over their quota. Admission control decides whether the service can take any more work. They're different problems:
- Rate limit: 60 requests / minute / API key. Returns 429
- Admission: in-flight LLM calls > threshold → queue or reject. Protects against thundering herd
import asyncio
_INFLIGHT = asyncio.Semaphore(50)
async def serve(prompt: str) -> str:
async with _INFLIGHT:
return await client.complete(prompt)
Past 50 in-flight calls, the next request blocks until one finishes. For a public API you'd want a queue with a timeout instead, so users aren't held indefinitely. For an internal API, the semaphore is enough.
Graceful Degradation
The LLM call should be enriching, not load-bearing, wherever possible. Detailed treatment in Designing a Consumer-Tier AI Product That Degrades Gracefully. The pattern in code:
async def enriched_response(query: str) -> dict:
base = compute_deterministic_response(query) # always succeeds
try:
ai_layer = await asyncio.wait_for(call_llm(query), timeout=5.0)
return {**base, "ai_extras": ai_layer}
except (asyncio.TimeoutError, Exception):
log.warning("ai layer failed; serving deterministic response only")
return base
The user-visible failure mode becomes "AI features missing" rather than "request failed". Where this is appropriate is the architectural decision; where it isn't, fail loudly.
Observability
Structured Logging
uv add structlog==25.1.0
import structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
)
log = structlog.get_logger()
# Use it everywhere, with structured fields:
log.info("rag.query", question_len=len(q), top_k=k, hits=len(hits))
log.info("llm.call", model=m, in_tok=in_t, out_tok=out_t, latency_ms=lat)
JSON output goes to stdout. Pipe to Loki, Datadog, or whatever ingests JSON logs. The fields are queryable, the timestamps are sortable, and there's no string-parsing fragility.
Metrics with Prometheus
uv add prometheus-client==0.21.0 starlette-exporter==0.23.0
from prometheus_client import Counter, Histogram
LLM_CALLS = Counter("llm_calls_total", "LLM calls", ["model", "status"])
LLM_LATENCY = Histogram("llm_latency_seconds", "LLM latency", ["model"])
LLM_TOKENS = Counter("llm_tokens_total", "Tokens used", ["model", "direction"])
# Wrap the LLM call:
async def tracked_complete(client, model: str, prompt: str) -> str:
with LLM_LATENCY.labels(model).time():
try:
result = await client.complete(prompt)
LLM_CALLS.labels(model, "ok").inc()
return result
except Exception:
LLM_CALLS.labels(model, "error").inc()
raise
Mount the metrics endpoint:
from starlette_exporter import PrometheusMiddleware, handle_metrics
app.add_middleware(PrometheusMiddleware)
app.add_route("/metrics", handle_metrics)
Scrape /metrics from Prometheus. Build dashboards in Grafana. Alert when:
- Error rate > 1% over 5 minutes
- p95 latency > 5s
- Queue depth (in-flight semaphore) saturated for > 1 minute
- Daily token spend > threshold
Cost Monitoring
The CostTracker from Part 2 logs per-call cost. Aggregate it:
DAILY_USD_BUDGET = 50.0
_DAILY_SPEND = 0.0
def record(cost: float) -> None:
global _DAILY_SPEND
_DAILY_SPEND += cost
if _DAILY_SPEND > DAILY_USD_BUDGET:
log.error("daily.budget.exceeded", spent=_DAILY_SPEND, budget=DAILY_USD_BUDGET)
# — trip a circuit breaker, page on-call, etc.
Cron a daily reset; in a multi-process deployment, use Redis for the counter so all workers share state. The "stop-the-bleed" mechanism on cost — switching the LLM client to a self-hosted vLLM, returning a static fallback, or rejecting traffic outright — is the kind of decision you want to make before the bill arrives.
Deploy-Time Eval
Every deploy runs a small eval set against the new model/config. Catches regressions before users do. Save as scripts/deploy_eval.py:
"""Run the held-out eval against the deployed endpoint. Fail the deploy if quality drops."""
import json, sys
from openai import OpenAI
ENDPOINT = sys.argv[1] # e.g. "http://staging.internal:8000/rag"
THRESHOLD = 0.85 # required exact-match rate
EVAL = [json.loads(l) for l in open("data/deploy_eval.jsonl")]
import requests
hits = 0
for case in EVAL:
r = requests.post(ENDPOINT, json={"question": case["q"], "top_k": 5}, timeout=30)
answer = r.json()["answer"].strip().lower()
if any(kw.lower() in answer for kw in case["expected_keywords"]):
hits += 1
rate = hits / len(EVAL)
print(f"deploy eval: {hits}/{len(EVAL)} ({rate:.0%})")
sys.exit(0 if rate >= THRESHOLD else 1)
Wire this into your CI / deploy pipeline. Exit code drives whether the deploy proceeds.
Beyond One Box
When you outgrow a single GPU, the patterns:
- Replicate the vLLM stack horizontally. Multiple GPUs, multiple processes, a load balancer in front. Sticky routing by request-hash gives prefix-cache locality — the same conversation lands on the same backend
- Shard a model across GPUs. vLLM's
--tensor-parallel-size Nsplits the model across N GPUs. Use when one GPU can't hold the model; otherwise replicate - Separate inference targets per workload tier. Pattern from the graceful-degradation post: enterprise tier on a 32B AWQ on big GPUs, consumer tier on a 4B on smaller GPUs, separate failure domains
What's Not in This Post
Each of these is its own series:
- Full Kubernetes setups. Helm charts, autoscaling, GPU operator, NodePool selectors. Use Helm + a managed cluster (EKS / GKE) when you need it; otherwise systemd is fine for a long time
- Multi-region deployments. Latency-aware routing, region-local caches, eventual-consistency on KV stores. Most projects don't need this. The ones that do already have a team
- Regulatory compliance. Data residency, audit logs, role-based access. Talk to your legal team. Architecture decisions follow what they tell you, not the other way around
Shipping AI to Production: One-Page Checklist
Use this as a deploy gate. Each line is a yes/no.
| Category | Check |
|---|---|
| API | Every request body validated by a Pydantic schema |
| API | Auth required on every non-health endpoint |
| API | Rate limiting per API key |
| API | Admission control / inflight semaphore set |
| API | Streaming endpoints handled, not just batch |
| Async | All endpoints are async def |
| Async | HTTP client is a process-wide singleton with keepalive |
| LLM | Wrapper module isolates retries, cost tracking, provider swap |
| LLM | Timeout set explicitly (not SDK default) |
| LLM | Cost tracked per call, aggregated per day, alert on budget |
| LLM | Cache layer for deterministic prompts |
| vLLM | Prefix caching enabled |
| vLLM | GPU memory utilisation set, not default |
| vLLM | Running under systemd with restart-on-failure |
| RAG | Chunks validated against source documents (no parser bugs) |
| RAG | Citations returned with answers |
| RAG | Eval set built; precision@k tracked |
| Logging | Structured JSON logs, queryable fields |
| Metrics | Prometheus /metrics endpoint exposed |
| Metrics | Alerts on error rate, p95 latency, queue saturation, daily spend |
| Reliability | Graceful degradation: deterministic core ships even when LLM fails (where appropriate) |
| Deploy | Eval set runs against the new deploy; fails the deploy if quality drops |
| Deploy | Health endpoint exists, distinguishes liveness from readiness |
Series Wrap-Up
Across the 10 parts you've built:
- A typed, tested CLI scaffold (Part 1)
- A provider-agnostic LLM wrapper with retries, streaming, structured output, cost tracking (Part 2)
- A FastAPI service with auth, rate limiting, streaming, and a Dockerfile (Part 3)
- Embeddings + Qdrant semantic search bolted onto the same service (Part 4)
- A full RAG pipeline with hybrid retrieval and citations (Part 5)
- A QLoRA fine-tune of a 4B instruct model on your own dataset (Part 6)
- AWQ + GGUF quantization with benchmarks (Part 7)
- A vLLM serving stack with multi-LoRA, prefix caching, and a config-only swap into the RAG (Part 8)
- A 1.5B distilled student that retains most of the teacher's behaviour (Part 9)
- A hardened version of the whole stack with metrics, structured logs, deploy-time eval (this part)
Where to Read Next
- How I Build AI Agents That Actually Work in Production — agentic patterns that turn this stack into a multi-agent system
- How I Get the Best Out of My GPU Using vLLM — the "why" behind the techniques in Part 8
- Demystifying LLM Quantization — theory behind Part 7
- Generating Long-Form Compliance Documents, Hybrid-RAG over Statute, Graceful Degradation, Audio-Video Diffusion on a Single GPU — case studies of stacks like this one in real production
Key Takeaways
- Async everywhere. LLM workloads are I/O-bound; sync code wastes the easy wins
- Connection pool the HTTP client. Don't pay TLS handshakes per call
- Cache at every layer: HTTP, application, embedding, vLLM prefix. The savings compound
- Rate-limit per key. Add admission control on top to protect the service from itself
- Make the LLM enriching, not load-bearing, where you can. Failure mode becomes "AI missing" not "site down"
- Structured logs. Prometheus metrics. Alerts on error rate, latency, and spend
- Deploy-time eval. Catch regressions before users do
- One vLLM, one base, many adapters; replicate horizontally before sharding
- The shipping checklist is more useful than any one trick. Run it before every release
The End
That's the series. You've gone from a CLI that counts words to a self-hosted, fine-tuned, quantized, distilled, RAG-augmented, monitored AI service. Build something with it.