Hardware: the same stack you've built across Parts 3, 5, and 8 — FastAPI service + Qdrant + vLLM. Python: 3.11+.

The Threshold

Every project in this series so far works for one user. The interesting failures start when you have ten. The interesting engineering starts when you have a thousand. This post is about the patterns that hold the stack up across that gap.

Concurrency: Async All the Way

LLM calls are I/O-bound. Every wait on the network is wasted CPU on the server side. The default for an LLM service should be async everywhere:

# Async wrapper around the OpenAI/Anthropic client.
from openai import AsyncOpenAI


class AsyncOpenAIClient:
    def __init__(self, model: str, base_url: str | None = None) -> None:
        self._client = AsyncOpenAI(
            api_key=os.environ.get("OPENAI_API_KEY", "EMPTY"),
            base_url=base_url,
            timeout=30.0,
        )
        self._model = model

    async def complete(self, prompt: str, **kw) -> str:
        resp = await self._client.chat.completions.create(
            model=self._model,
            messages=[{"role": "user", "content": prompt}],
            **kw,
        )
        return resp.choices[0].message.content or ""

For parallelisable subtasks (a fan-out RAG that hits 5 retrievers; an agent that calls 3 tools concurrently), use asyncio.gather:

results = await asyncio.gather(
    client.complete(prompt_a),
    client.complete(prompt_b),
    client.complete(prompt_c),
)

Three calls, one wait. Same pattern at the API layer: async def endpoints in FastAPI. Don't slip back to synchronous database calls or file reads inside an async endpoint — they block the whole event loop.

Connection Pooling

Naive code creates an HTTP client per request. The setup cost — DNS, TLS handshake, TCP — eats into every call. Use a process-wide singleton:

import httpx

_HTTP_CLIENT: httpx.AsyncClient | None = None


def http() -> httpx.AsyncClient:
    global _HTTP_CLIENT
    if _HTTP_CLIENT is None:
        _HTTP_CLIENT = httpx.AsyncClient(
            timeout=httpx.Timeout(30.0, connect=5.0),
            limits=httpx.Limits(
                max_keepalive_connections=20,
                max_connections=100,
                keepalive_expiry=60.0,
            ),
        )
    return _HTTP_CLIENT


# In FastAPI lifespan:
@asynccontextmanager
async def lifespan(app: FastAPI):
    yield
    if _HTTP_CLIENT is not None:
        await _HTTP_CLIENT.aclose()

The max_keepalive_connections is the number of connections kept warm between requests. max_connections is the upper bound. For a service that fans out to multiple downstream APIs, this is the difference between latency that grows linearly with load and latency that stays flat.

Caching at Every Layer

LayerWhatHow
HTTP edgeIdentical requests inside a TTLCloudflare / nginx; respects Cache-Control headers from your service
ApplicationDeterministic prompts (temp 0)Redis with prompt-hash key; ~10 ms vs ~2,000 ms LLM call
EmbeddingSame query computed twiceLRU cache or Redis; embedding is small (a few KB) so cache hit ratios are high
vLLM prefix cacheShared system prompts / RAG contextOne flag in Part 8; reuses KV across calls with same prefix

Each layer multiplies the savings. A cached HTTP response never touches the embedding cache; a cached embedding never enters the LLM cache; a cache miss at every level still benefits from the vLLM prefix cache. The compounding effect is enormous on real workloads.

Application-level cache, simplest version:

import hashlib
import redis.asyncio as redis  # uv add redis==5.2.0

_REDIS = redis.from_url("redis://localhost:6379")


async def cached_complete(client, prompt: str, *, ttl: int = 86400) -> str:
    key = "llm:" + hashlib.sha256(prompt.encode()).hexdigest()
    hit = await _REDIS.get(key)
    if hit: return hit.decode()
    result = await client.complete(prompt)
    await _REDIS.set(key, result, ex=ttl)
    return result

Rate Limiting and Admission Control

Rate limiting (Part 3 with slowapi) decides whether a single client is over their quota. Admission control decides whether the service can take any more work. They're different problems:

  • Rate limit: 60 requests / minute / API key. Returns 429
  • Admission: in-flight LLM calls > threshold → queue or reject. Protects against thundering herd
import asyncio

_INFLIGHT = asyncio.Semaphore(50)


async def serve(prompt: str) -> str:
    async with _INFLIGHT:
        return await client.complete(prompt)

Past 50 in-flight calls, the next request blocks until one finishes. For a public API you'd want a queue with a timeout instead, so users aren't held indefinitely. For an internal API, the semaphore is enough.

Graceful Degradation

The LLM call should be enriching, not load-bearing, wherever possible. Detailed treatment in Designing a Consumer-Tier AI Product That Degrades Gracefully. The pattern in code:

async def enriched_response(query: str) -> dict:
    base = compute_deterministic_response(query)   # always succeeds
    try:
        ai_layer = await asyncio.wait_for(call_llm(query), timeout=5.0)
        return {**base, "ai_extras": ai_layer}
    except (asyncio.TimeoutError, Exception):
        log.warning("ai layer failed; serving deterministic response only")
        return base

The user-visible failure mode becomes "AI features missing" rather than "request failed". Where this is appropriate is the architectural decision; where it isn't, fail loudly.

Observability

Structured Logging

uv add structlog==25.1.0
import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
)
log = structlog.get_logger()


# Use it everywhere, with structured fields:
log.info("rag.query", question_len=len(q), top_k=k, hits=len(hits))
log.info("llm.call",  model=m, in_tok=in_t, out_tok=out_t, latency_ms=lat)

JSON output goes to stdout. Pipe to Loki, Datadog, or whatever ingests JSON logs. The fields are queryable, the timestamps are sortable, and there's no string-parsing fragility.

Metrics with Prometheus

uv add prometheus-client==0.21.0 starlette-exporter==0.23.0
from prometheus_client import Counter, Histogram

LLM_CALLS = Counter("llm_calls_total", "LLM calls", ["model", "status"])
LLM_LATENCY = Histogram("llm_latency_seconds", "LLM latency", ["model"])
LLM_TOKENS = Counter("llm_tokens_total", "Tokens used", ["model", "direction"])


# Wrap the LLM call:
async def tracked_complete(client, model: str, prompt: str) -> str:
    with LLM_LATENCY.labels(model).time():
        try:
            result = await client.complete(prompt)
            LLM_CALLS.labels(model, "ok").inc()
            return result
        except Exception:
            LLM_CALLS.labels(model, "error").inc()
            raise

Mount the metrics endpoint:

from starlette_exporter import PrometheusMiddleware, handle_metrics
app.add_middleware(PrometheusMiddleware)
app.add_route("/metrics", handle_metrics)

Scrape /metrics from Prometheus. Build dashboards in Grafana. Alert when:

  • Error rate > 1% over 5 minutes
  • p95 latency > 5s
  • Queue depth (in-flight semaphore) saturated for > 1 minute
  • Daily token spend > threshold

Cost Monitoring

The CostTracker from Part 2 logs per-call cost. Aggregate it:

DAILY_USD_BUDGET = 50.0
_DAILY_SPEND = 0.0


def record(cost: float) -> None:
    global _DAILY_SPEND
    _DAILY_SPEND += cost
    if _DAILY_SPEND > DAILY_USD_BUDGET:
        log.error("daily.budget.exceeded", spent=_DAILY_SPEND, budget=DAILY_USD_BUDGET)
        # — trip a circuit breaker, page on-call, etc.

Cron a daily reset; in a multi-process deployment, use Redis for the counter so all workers share state. The "stop-the-bleed" mechanism on cost — switching the LLM client to a self-hosted vLLM, returning a static fallback, or rejecting traffic outright — is the kind of decision you want to make before the bill arrives.

Deploy-Time Eval

Every deploy runs a small eval set against the new model/config. Catches regressions before users do. Save as scripts/deploy_eval.py:

"""Run the held-out eval against the deployed endpoint. Fail the deploy if quality drops."""
import json, sys
from openai import OpenAI

ENDPOINT = sys.argv[1]   # e.g. "http://staging.internal:8000/rag"
THRESHOLD = 0.85         # required exact-match rate
EVAL = [json.loads(l) for l in open("data/deploy_eval.jsonl")]

import requests
hits = 0
for case in EVAL:
    r = requests.post(ENDPOINT, json={"question": case["q"], "top_k": 5}, timeout=30)
    answer = r.json()["answer"].strip().lower()
    if any(kw.lower() in answer for kw in case["expected_keywords"]):
        hits += 1

rate = hits / len(EVAL)
print(f"deploy eval: {hits}/{len(EVAL)} ({rate:.0%})")
sys.exit(0 if rate >= THRESHOLD else 1)

Wire this into your CI / deploy pipeline. Exit code drives whether the deploy proceeds.

Beyond One Box

When you outgrow a single GPU, the patterns:

  • Replicate the vLLM stack horizontally. Multiple GPUs, multiple processes, a load balancer in front. Sticky routing by request-hash gives prefix-cache locality — the same conversation lands on the same backend
  • Shard a model across GPUs. vLLM's --tensor-parallel-size N splits the model across N GPUs. Use when one GPU can't hold the model; otherwise replicate
  • Separate inference targets per workload tier. Pattern from the graceful-degradation post: enterprise tier on a 32B AWQ on big GPUs, consumer tier on a 4B on smaller GPUs, separate failure domains

What's Not in This Post

Each of these is its own series:

  • Full Kubernetes setups. Helm charts, autoscaling, GPU operator, NodePool selectors. Use Helm + a managed cluster (EKS / GKE) when you need it; otherwise systemd is fine for a long time
  • Multi-region deployments. Latency-aware routing, region-local caches, eventual-consistency on KV stores. Most projects don't need this. The ones that do already have a team
  • Regulatory compliance. Data residency, audit logs, role-based access. Talk to your legal team. Architecture decisions follow what they tell you, not the other way around

Shipping AI to Production: One-Page Checklist

Use this as a deploy gate. Each line is a yes/no.

CategoryCheck
APIEvery request body validated by a Pydantic schema
APIAuth required on every non-health endpoint
APIRate limiting per API key
APIAdmission control / inflight semaphore set
APIStreaming endpoints handled, not just batch
AsyncAll endpoints are async def
AsyncHTTP client is a process-wide singleton with keepalive
LLMWrapper module isolates retries, cost tracking, provider swap
LLMTimeout set explicitly (not SDK default)
LLMCost tracked per call, aggregated per day, alert on budget
LLMCache layer for deterministic prompts
vLLMPrefix caching enabled
vLLMGPU memory utilisation set, not default
vLLMRunning under systemd with restart-on-failure
RAGChunks validated against source documents (no parser bugs)
RAGCitations returned with answers
RAGEval set built; precision@k tracked
LoggingStructured JSON logs, queryable fields
MetricsPrometheus /metrics endpoint exposed
MetricsAlerts on error rate, p95 latency, queue saturation, daily spend
ReliabilityGraceful degradation: deterministic core ships even when LLM fails (where appropriate)
DeployEval set runs against the new deploy; fails the deploy if quality drops
DeployHealth endpoint exists, distinguishes liveness from readiness

Series Wrap-Up

Across the 10 parts you've built:

  • A typed, tested CLI scaffold (Part 1)
  • A provider-agnostic LLM wrapper with retries, streaming, structured output, cost tracking (Part 2)
  • A FastAPI service with auth, rate limiting, streaming, and a Dockerfile (Part 3)
  • Embeddings + Qdrant semantic search bolted onto the same service (Part 4)
  • A full RAG pipeline with hybrid retrieval and citations (Part 5)
  • A QLoRA fine-tune of a 4B instruct model on your own dataset (Part 6)
  • AWQ + GGUF quantization with benchmarks (Part 7)
  • A vLLM serving stack with multi-LoRA, prefix caching, and a config-only swap into the RAG (Part 8)
  • A 1.5B distilled student that retains most of the teacher's behaviour (Part 9)
  • A hardened version of the whole stack with metrics, structured logs, deploy-time eval (this part)

Where to Read Next

Key Takeaways

  • Async everywhere. LLM workloads are I/O-bound; sync code wastes the easy wins
  • Connection pool the HTTP client. Don't pay TLS handshakes per call
  • Cache at every layer: HTTP, application, embedding, vLLM prefix. The savings compound
  • Rate-limit per key. Add admission control on top to protect the service from itself
  • Make the LLM enriching, not load-bearing, where you can. Failure mode becomes "AI missing" not "site down"
  • Structured logs. Prometheus metrics. Alerts on error rate, latency, and spend
  • Deploy-time eval. Catch regressions before users do
  • One vLLM, one base, many adapters; replicate horizontally before sharding
  • The shipping checklist is more useful than any one trick. Run it before every release

The End

That's the series. You've gone from a CLI that counts words to a self-hosted, fine-tuned, quantized, distilled, RAG-augmented, monitored AI service. Build something with it.