Part 10: Scaling AI Systems

Hardware: the same stack you've built across Parts 3, 5, and 8 — FastAPI service + Qdrant + vLLM. Python: 3.11+.

The Threshold

Every project in this series so far works for one user. The interesting failures start when you have ten. The interesting engineering starts when you have a thousand. This post is about the patterns that hold the stack up across that gap.

Concurrency: Async All the Way

LLM calls are I/O-bound. Every wait on the network is wasted CPU on the server side. The default for an LLM service should be async everywhere:

# Async wrapper around the OpenAI/Anthropic client.
from openai import AsyncOpenAI


class AsyncOpenAIClient:
    def __init__(self, model: str, base_url: str | None = None) -> None:
        self._client = AsyncOpenAI(
            api_key=os.environ.get("OPENAI_API_KEY", "EMPTY"),
            base_url=base_url,
            timeout=30.0,
        )
        self._model = model

    async def complete(self, prompt: str, **kw) -> str:
        resp = await self._client.chat.completions.create(
            model=self._model,
            messages=[{"role": "user", "content": prompt}],
            **kw,
        )
        return resp.choices[0].message.content or ""

For parallelisable subtasks (a fan-out RAG that hits 5 retrievers; an agent that calls 3 tools concurrently), use asyncio.gather:

results = await asyncio.gather(
    client.complete(prompt_a),
    client.complete(prompt_b),
    client.complete(prompt_c),
)

Three calls, one wait. Same pattern at the API layer: async def endpoints in FastAPI. Don't slip back to synchronous database calls or file reads inside an async endpoint — they block the whole event loop.

Connection Pooling

Naive code creates an HTTP client per request. The setup cost — DNS, TLS handshake, TCP — eats into every call. Use a process-wide singleton:

import httpx

_HTTP_CLIENT: httpx.AsyncClient | None = None


def http() -> httpx.AsyncClient:
    global _HTTP_CLIENT
    if _HTTP_CLIENT is None:
        _HTTP_CLIENT = httpx.AsyncClient(
            timeout=httpx.Timeout(30.0, connect=5.0),
            limits=httpx.Limits(
                max_keepalive_connections=20,
                max_connections=100,
                keepalive_expiry=60.0,
            ),
        )
    return _HTTP_CLIENT


# In FastAPI lifespan:
@asynccontextmanager
async def lifespan(app: FastAPI):
    yield
    if _HTTP_CLIENT is not None:
        await _HTTP_CLIENT.aclose()

The max_keepalive_connections is the number of connections kept warm between requests. max_connections is the upper bound. For a service that fans out to multiple downstream APIs, this is the difference between latency that grows linearly with load and latency that stays flat.

Caching at Every Layer

Layer	What	How
HTTP edge	Identical requests inside a TTL	Cloudflare / nginx; respects `Cache-Control` headers from your service
Application	Deterministic prompts (temp 0)	Redis with prompt-hash key; ~10 ms vs ~2,000 ms LLM call
Embedding	Same query computed twice	LRU cache or Redis; embedding is small (a few KB) so cache hit ratios are high
vLLM prefix cache	Shared system prompts / RAG context	One flag in Part 8; reuses KV across calls with same prefix

Each layer multiplies the savings. A cached HTTP response never touches the embedding cache; a cached embedding never enters the LLM cache; a cache miss at every level still benefits from the vLLM prefix cache. The compounding effect is enormous on real workloads.

Application-level cache, simplest version:

import hashlib
import redis.asyncio as redis  # uv add redis==5.2.0

_REDIS = redis.from_url("redis://localhost:6379")


async def cached_complete(client, prompt: str, *, ttl: int = 86400) -> str:
    key = "llm:" + hashlib.sha256(prompt.encode()).hexdigest()
    hit = await _REDIS.get(key)
    if hit: return hit.decode()
    result = await client.complete(prompt)
    await _REDIS.set(key, result, ex=ttl)
    return result

Rate Limiting and Admission Control

Rate limiting (Part 3 with slowapi) decides whether a single client is over their quota. Admission control decides whether the service can take any more work. They're different problems:

Rate limit: 60 requests / minute / API key. Returns 429
Admission: in-flight LLM calls > threshold → queue or reject. Protects against thundering herd

import asyncio

_INFLIGHT = asyncio.Semaphore(50)


async def serve(prompt: str) -> str:
    async with _INFLIGHT:
        return await client.complete(prompt)

Past 50 in-flight calls, the next request blocks until one finishes. For a public API you'd want a queue with a timeout instead, so users aren't held indefinitely. For an internal API, the semaphore is enough.

Graceful Degradation

The LLM call should be enriching, not load-bearing, wherever possible. Detailed treatment in Designing a Consumer-Tier AI Product That Degrades Gracefully. The pattern in code:

async def enriched_response(query: str) -> dict:
    base = compute_deterministic_response(query)   # always succeeds
    try:
        ai_layer = await asyncio.wait_for(call_llm(query), timeout=5.0)
        return {**base, "ai_extras": ai_layer}
    except (asyncio.TimeoutError, Exception):
        log.warning("ai layer failed; serving deterministic response only")
        return base

The user-visible failure mode becomes "AI features missing" rather than "request failed". Where this is appropriate is the architectural decision; where it isn't, fail loudly.

Observability

Structured Logging

uv add structlog==25.1.0

import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
)
log = structlog.get_logger()


# Use it everywhere, with structured fields:
log.info("rag.query", question_len=len(q), top_k=k, hits=len(hits))
log.info("llm.call",  model=m, in_tok=in_t, out_tok=out_t, latency_ms=lat)

JSON output goes to stdout. Pipe to Loki, Datadog, or whatever ingests JSON logs. The fields are queryable, the timestamps are sortable, and there's no string-parsing fragility.

Metrics with Prometheus

uv add prometheus-client==0.21.0 starlette-exporter==0.23.0

from prometheus_client import Counter, Histogram

LLM_CALLS = Counter("llm_calls_total", "LLM calls", ["model", "status"])
LLM_LATENCY = Histogram("llm_latency_seconds", "LLM latency", ["model"])
LLM_TOKENS = Counter("llm_tokens_total", "Tokens used", ["model", "direction"])


# Wrap the LLM call:
async def tracked_complete(client, model: str, prompt: str) -> str:
    with LLM_LATENCY.labels(model).time():
        try:
            result = await client.complete(prompt)
            LLM_CALLS.labels(model, "ok").inc()
            return result
        except Exception:
            LLM_CALLS.labels(model, "error").inc()
            raise

Mount the metrics endpoint:

from starlette_exporter import PrometheusMiddleware, handle_metrics
app.add_middleware(PrometheusMiddleware)
app.add_route("/metrics", handle_metrics)

Scrape /metrics from Prometheus. Build dashboards in Grafana. Alert when:

Error rate > 1% over 5 minutes
p95 latency > 5s
Queue depth (in-flight semaphore) saturated for > 1 minute
Daily token spend > threshold

Cost Monitoring

The CostTracker from Part 2 logs per-call cost. Aggregate it:

DAILY_USD_BUDGET = 50.0
_DAILY_SPEND = 0.0


def record(cost: float) -> None:
    global _DAILY_SPEND
    _DAILY_SPEND += cost
    if _DAILY_SPEND > DAILY_USD_BUDGET:
        log.error("daily.budget.exceeded", spent=_DAILY_SPEND, budget=DAILY_USD_BUDGET)
        # — trip a circuit breaker, page on-call, etc.

Cron a daily reset; in a multi-process deployment, use Redis for the counter so all workers share state. The "stop-the-bleed" mechanism on cost — switching the LLM client to a self-hosted vLLM, returning a static fallback, or rejecting traffic outright — is the kind of decision you want to make before the bill arrives.

Deploy-Time Eval

Every deploy runs a small eval set against the new model/config. Catches regressions before users do. Save as scripts/deploy_eval.py:

"""Run the held-out eval against the deployed endpoint. Fail the deploy if quality drops."""
import json, sys
from openai import OpenAI

ENDPOINT = sys.argv[1]   # e.g. "http://staging.internal:8000/rag"
THRESHOLD = 0.85         # required exact-match rate
EVAL = [json.loads(l) for l in open("data/deploy_eval.jsonl")]

import requests
hits = 0
for case in EVAL:
    r = requests.post(ENDPOINT, json={"question": case["q"], "top_k": 5}, timeout=30)
    answer = r.json()["answer"].strip().lower()
    if any(kw.lower() in answer for kw in case["expected_keywords"]):
        hits += 1

rate = hits / len(EVAL)
print(f"deploy eval: {hits}/{len(EVAL)} ({rate:.0%})")
sys.exit(0 if rate >= THRESHOLD else 1)

Wire this into your CI / deploy pipeline. Exit code drives whether the deploy proceeds.

Beyond One Box

When you outgrow a single GPU, the patterns:

Replicate the vLLM stack horizontally. Multiple GPUs, multiple processes, a load balancer in front. Sticky routing by request-hash gives prefix-cache locality — the same conversation lands on the same backend
Shard a model across GPUs. vLLM's --tensor-parallel-size N splits the model across N GPUs. Use when one GPU can't hold the model; otherwise replicate
Separate inference targets per workload tier. Pattern from the graceful-degradation post: enterprise tier on a 32B AWQ on big GPUs, consumer tier on a 4B on smaller GPUs, separate failure domains

What's Not in This Post

Each of these is its own series:

Full Kubernetes setups. Helm charts, autoscaling, GPU operator, NodePool selectors. Use Helm + a managed cluster (EKS / GKE) when you need it; otherwise systemd is fine for a long time
Multi-region deployments. Latency-aware routing, region-local caches, eventual-consistency on KV stores. Most projects don't need this. The ones that do already have a team
Regulatory compliance. Data residency, audit logs, role-based access. Talk to your legal team. Architecture decisions follow what they tell you, not the other way around

Shipping AI to Production: One-Page Checklist

Use this as a deploy gate. Each line is a yes/no.

Category	Check
API	Every request body validated by a Pydantic schema
API	Auth required on every non-health endpoint
API	Rate limiting per API key
API	Admission control / inflight semaphore set
API	Streaming endpoints handled, not just batch
Async	All endpoints are `async def`
Async	HTTP client is a process-wide singleton with keepalive
LLM	Wrapper module isolates retries, cost tracking, provider swap
LLM	Timeout set explicitly (not SDK default)
LLM	Cost tracked per call, aggregated per day, alert on budget
LLM	Cache layer for deterministic prompts
vLLM	Prefix caching enabled
vLLM	GPU memory utilisation set, not default
vLLM	Running under systemd with restart-on-failure
RAG	Chunks validated against source documents (no parser bugs)
RAG	Citations returned with answers
RAG	Eval set built; precision@k tracked
Logging	Structured JSON logs, queryable fields
Metrics	Prometheus `/metrics` endpoint exposed
Metrics	Alerts on error rate, p95 latency, queue saturation, daily spend
Reliability	Graceful degradation: deterministic core ships even when LLM fails (where appropriate)
Deploy	Eval set runs against the new deploy; fails the deploy if quality drops
Deploy	Health endpoint exists, distinguishes liveness from readiness

Series Wrap-Up

Across the 10 parts you've built:

A typed, tested CLI scaffold (Part 1)
A provider-agnostic LLM wrapper with retries, streaming, structured output, cost tracking (Part 2)
A FastAPI service with auth, rate limiting, streaming, and a Dockerfile (Part 3)
Embeddings + Qdrant semantic search bolted onto the same service (Part 4)
A full RAG pipeline with hybrid retrieval and citations (Part 5)
A QLoRA fine-tune of a 4B instruct model on your own dataset (Part 6)
AWQ + GGUF quantization with benchmarks (Part 7)
A vLLM serving stack with multi-LoRA, prefix caching, and a config-only swap into the RAG (Part 8)
A 1.5B distilled student that retains most of the teacher's behaviour (Part 9)
A hardened version of the whole stack with metrics, structured logs, deploy-time eval (this part)

Where to Read Next

How I Build AI Agents That Actually Work in Production — agentic patterns that turn this stack into a multi-agent system
How I Get the Best Out of My GPU Using vLLM — the "why" behind the techniques in Part 8
Demystifying LLM Quantization — theory behind Part 7
Generating Long-Form Compliance Documents, Hybrid-RAG over Statute, Graceful Degradation, Audio-Video Diffusion on a Single GPU — case studies of stacks like this one in real production

Key Takeaways

Async everywhere. LLM workloads are I/O-bound; sync code wastes the easy wins
Connection pool the HTTP client. Don't pay TLS handshakes per call
Cache at every layer: HTTP, application, embedding, vLLM prefix. The savings compound
Rate-limit per key. Add admission control on top to protect the service from itself
Make the LLM enriching, not load-bearing, where you can. Failure mode becomes "AI missing" not "site down"
Structured logs. Prometheus metrics. Alerts on error rate, latency, and spend
Deploy-time eval. Catch regressions before users do
One vLLM, one base, many adapters; replicate horizontally before sharding
The shipping checklist is more useful than any one trick. Run it before every release

The End

That's the series. You've gone from a CLI that counts words to a self-hosted, fine-tuned, quantized, distilled, RAG-augmented, monitored AI service. Build something with it.