The default pattern for AI products is: user clicks a button, you call an LLM, you show them the response. When the LLM call fails — model loading, inference layer down, network blip — the user sees an error. The purchase fails, the deliverable is incomplete, the trust hit is real.

For a consumer-tier product I built recently, this pattern was unacceptable. The product is "buy a compliance pack and download it." If the LLM call fails after the user has paid, that's a refund and a complaint. So I designed the product so the LLM call is enriching rather than load-bearing — and the deterministic path carries the SLA.

This post is about that design decision and why it matters more than people think.

Two Tiers, Two Inference Targets

TierAudienceInference targetWhy
Full generatorTailored multi-section deliverables32B model, AWQ-quantisedQuality > cost
Self-serve"Buy and download"4B model with domain LoRACost > complexity

Two queues, two batches, two failure domains. A spike in consumer traffic can't degrade the high-tier product's latency. The split also makes capacity planning honest — the consumer tier scales horizontally on small GPUs, the high tier scales vertically on big ones.

Why a Smaller Model on a Separate Stack

Two reasons, both deliberate.

Cost. A 4B model with a domain-specific adapter handles the simpler artefact generations at a small fraction of the GPU cost of the 32B used by the full generator. At consumer-tier pricing, the per-purchase margin only works if inference is cheap. A 32B per consumer purchase would invert the unit economics.

Isolation. Routing the consumer workload to its own inference target means the consumer tier and the high tier don't share a queue. Even if every consumer in the country hit "buy" simultaneously, the higher-tier customers don't see degraded latency. This is the kind of property that's invisible until you need it, and impossible to retrofit when you do.

The Deliverable: Deterministic Core, AI Enrichment

What ships in every purchase:

  • A zip containing a curated set of fillable templates (deterministic — no LLM)
  • A README with instructions (deterministic)
  • A small number of LLM-generated artefacts layered on top (enrichment)

The deterministic part of the pipeline doesn't touch an LLM at all. Template assembly, packaging, README generation — all of it runs on plain Python. The LLM-generated artefacts are added to the pack if inference is available, and omitted (with static-content fallbacks substituted) if it isn't.

Graceful Degradation as a Design Property

This is the part most AI products get wrong.

Default pattern:

user → button → LLM call → response → done
                  ↓ (fails)
                error

What I built:

user → button → deterministic pack → ALWAYS ships
                       │
                       └─→ LLM artefacts → enriching layer
                                  ↓ (fails)
                              static fallback

The user-visible failure mode becomes "AI features missing" rather than "purchase failed". The first is recoverable in the user's mind ("I'll get the AI version next time"). The second is a refund.

This design property has a cost: you have to write static fallbacks for every AI artefact. For a small number of artefacts, this is cheap. For dozens, it's not. The right place to apply this pattern is anywhere the AI output is enriching rather than core — and being honest about which is which.

When Not to Use This Pattern

Graceful degradation is the wrong default if the AI is the product. If a user is paying specifically for the AI-generated output, shipping a static fallback is worse than failing — it ships a worse product silently. The pattern only makes sense when:

  • The deterministic part of the deliverable has standalone value
  • The AI part is enrichment, not core
  • The user can tell the difference (so silent degradation isn't deceptive)

For a compliance pack, the templates have standalone value, the AI artefacts add convenience, and the difference is visible. For a "ChatGPT for X" product, none of this applies.

Key Takeaways

  • The default LLM product pattern propagates inference failures to the user; this is often the wrong choice
  • Two inference targets with separate failure domains means consumer load doesn't degrade enterprise latency
  • Smaller models on separate stacks make consumer-tier unit economics work
  • Make the LLM call enriching, not load-bearing, whenever the deterministic path has standalone value
  • Static fallbacks are the cost of admission for graceful degradation — write them at the same time as the AI artefacts