The default pattern for AI products is: user clicks a button, you call an LLM, you show them the response. When the LLM call fails — model loading, inference layer down, network blip — the user sees an error. The purchase fails, the deliverable is incomplete, the trust hit is real.
For a consumer-tier product I built recently, this pattern was unacceptable. The product is "buy a compliance pack and download it." If the LLM call fails after the user has paid, that's a refund and a complaint. So I designed the product so the LLM call is enriching rather than load-bearing — and the deterministic path carries the SLA.
This post is about that design decision and why it matters more than people think.
Two Tiers, Two Inference Targets
| Tier | Audience | Inference target | Why |
|---|---|---|---|
| Full generator | Tailored multi-section deliverables | 32B model, AWQ-quantised | Quality > cost |
| Self-serve | "Buy and download" | 4B model with domain LoRA | Cost > complexity |
Two queues, two batches, two failure domains. A spike in consumer traffic can't degrade the high-tier product's latency. The split also makes capacity planning honest — the consumer tier scales horizontally on small GPUs, the high tier scales vertically on big ones.
Why a Smaller Model on a Separate Stack
Two reasons, both deliberate.
Cost. A 4B model with a domain-specific adapter handles the simpler artefact generations at a small fraction of the GPU cost of the 32B used by the full generator. At consumer-tier pricing, the per-purchase margin only works if inference is cheap. A 32B per consumer purchase would invert the unit economics.
Isolation. Routing the consumer workload to its own inference target means the consumer tier and the high tier don't share a queue. Even if every consumer in the country hit "buy" simultaneously, the higher-tier customers don't see degraded latency. This is the kind of property that's invisible until you need it, and impossible to retrofit when you do.
The Deliverable: Deterministic Core, AI Enrichment
What ships in every purchase:
- A zip containing a curated set of fillable templates (deterministic — no LLM)
- A README with instructions (deterministic)
- A small number of LLM-generated artefacts layered on top (enrichment)
The deterministic part of the pipeline doesn't touch an LLM at all. Template assembly, packaging, README generation — all of it runs on plain Python. The LLM-generated artefacts are added to the pack if inference is available, and omitted (with static-content fallbacks substituted) if it isn't.
Graceful Degradation as a Design Property
This is the part most AI products get wrong.
Default pattern:
user → button → LLM call → response → done
↓ (fails)
error
What I built:
user → button → deterministic pack → ALWAYS ships
│
└─→ LLM artefacts → enriching layer
↓ (fails)
static fallback
The user-visible failure mode becomes "AI features missing" rather than "purchase failed". The first is recoverable in the user's mind ("I'll get the AI version next time"). The second is a refund.
This design property has a cost: you have to write static fallbacks for every AI artefact. For a small number of artefacts, this is cheap. For dozens, it's not. The right place to apply this pattern is anywhere the AI output is enriching rather than core — and being honest about which is which.
When Not to Use This Pattern
Graceful degradation is the wrong default if the AI is the product. If a user is paying specifically for the AI-generated output, shipping a static fallback is worse than failing — it ships a worse product silently. The pattern only makes sense when:
- The deterministic part of the deliverable has standalone value
- The AI part is enrichment, not core
- The user can tell the difference (so silent degradation isn't deceptive)
For a compliance pack, the templates have standalone value, the AI artefacts add convenience, and the difference is visible. For a "ChatGPT for X" product, none of this applies.
Key Takeaways
- The default LLM product pattern propagates inference failures to the user; this is often the wrong choice
- Two inference targets with separate failure domains means consumer load doesn't degrade enterprise latency
- Smaller models on separate stacks make consumer-tier unit economics work
- Make the LLM call enriching, not load-bearing, whenever the deterministic path has standalone value
- Static fallbacks are the cost of admission for graceful degradation — write them at the same time as the AI artefacts