Most LLM document generators are wrappers around a single big prompt. You send the model everything you know, ask for the whole document, and pray the output parses. For a one-pager, this works. For a 40-section regulatory document where every section has to be internally consistent, schema-typed, and grounded in primary law — it falls apart fast.

I learned this the hard way. The first version of my generator did exactly that. The model invented section numbers, contradicted itself across sections, and produced JSON that broke the renderer about one run in five. So I rebuilt it as a structured pipeline of small, typed LLM calls, each constrained by Pydantic, each doing one thing. This post is about what that architecture looks like and the specific decisions that made it work.

The Problem with Big-Prompt Generation

When you ask a 32B model to produce a structured 40-section document in one call, three things go wrong:

  • Schema drift. The model decides 38 sections is enough, or invents a 41st, or returns the JSON with one field renamed
  • Internal contradiction. Section 12 says the business is medium-risk; section 27 (deeper in the same response) treats it as high-risk
  • Context dilution. The instruction to ground every claim in a specific source-of-truth gets attenuated by the time the model is generating section 30

Big-prompt generation is also impossible to debug. If section 19 is wrong, you regenerate the entire document. There's no surgical fix.

The Architecture: Five Typed Artefacts

I broke the pipeline into five LLM-backed artefacts, each with its own narrow schema:

  1. Per-item structured assessment — yes/no decisions over a fixed list of inputs from the intake
  2. Narrative profile — a paragraph summarising the business in its own words
  3. Cross-section critic — reads the outputs of the first two and flags inconsistencies
  4. Repair pass — surgically rewrites a single field flagged by the critic, leaving everything else untouched
  5. Customisation overlay — applies user-specific rules layered on top of the generic output

Each call returns Pydantic-validated JSON. If validation fails, the call retries with the same prompt. If retries fail, the pipeline raises rather than papering over the error with a fallback.

Why No RAG in the Generator

This was the most counter-intuitive decision. RAG is the default move for any LLM application that touches a corpus, and skipping it on a compliance generator sounds like negligence.

But RAG inside a structured pipeline introduces two failure modes:

  • Retrieved chunks vary across runs. Same intake, same prompt, slightly different retrieved context, slightly different output. For a deliverable that customers compare side-by-side, this is unacceptable
  • Retrieved chunks can contradict the intake. If the user describes their business one way and a retrieved chunk describes a similar business differently, the model often defers to the chunk

The fix: all grounding comes from a typed intake context plus static, curated source-of-truth JSONs loaded at startup. The corpus is still there — but it's used by the chat assistant (different module, different write-up), where the question shape is open and grounding is the whole point. For structured generation against a known schema, RAG is noise.

Inference Layer: One vLLM Process, Two Aliases

WhatDetail
Base modelQwen3-32B-AWQ
KernelMarlin (fastest INT4 path on Hopper)
LoRA adapterDomain-specific, r=16, served as a second alias
GPU memory--gpu-memory-utilization 0.35
Concurrency cap--max-num-seqs 8
Prefix cachingEnabled

Both the chat assistant and the generator point at the same vLLM process. Chat hits the base alias; the generator hits the LoRA-adapted alias. One set of base weights loaded, two endpoints served. This saves ~30 GB of GPU compared to running two separate 32B instances and means I can serve both workloads on one box.

Prefix caching is the underrated win here. The cross-section critic prompt is ~2 KB and identical on every call. With prefix caching enabled, vLLM reuses the KV cache for that shared prefix across every critic invocation in the run. On a 6-section parallel fan-out, that's a measurable latency reduction for free.

LoRA Configuration

ParameterValueWhy
Rank16Matches --max-lora-rank
Alpha32Effective scaling 2.0
Dropout0.05Standard
Target modulesAll 7 projectionsFull LoRA, attention + MLP
Adapter size~537 MBTrained at checkpoint-177

The decision worth calling out: target modules cover both attention (q/k/v/o) and MLP (gate/up/down). Most LoRA tutorials default to attention-only because it's cheaper. For a domain where the structure of the output matters as much as the content — every section follows a specific template — full LoRA gave noticeably better adherence to that structure.

The Honest Trade-off

The adapter was trained against the FP16 base but loaded against the AWQ-quantised base at inference. This is documented in the code with a comment. Quality drift from this mismatch is unmeasured and on the calibration backlog.

The cleaner path is to re-train against the AWQ base directly. The reason I didn't: the GPU budget didn't allow for a parallel training run during the initial build, and visible quality on held-out examples was good enough to ship. This is the kind of decision that's fine to make explicitly and revisit, and bad to make implicitly and forget.

The Defensive Strips That Save You

Two post-processing steps run on every LLM response, regardless of prompt:

  • Strip <think>...</think> blocks. Qwen3 leaks reasoning blocks even with enable_thinking=False. Maybe one call in fifty. Strip them defensively and stop worrying about it
  • Strip JSON code-fences. The model adds them inconsistently regardless of system prompt instructions. Strip the ```json and ``` wrappers post-hoc and parse what's left

Temperature is fixed at 0.01 across all calls. For structured output, low temperature is the largest single lever for stability. The residual variance lives in non-determinism of batched inference, not in sampling.

Concurrency

Six section tasks fan out in parallel via a ThreadPoolExecutor. The HTTP layer uses a process-wide httpx.Client singleton with max_keepalive_connections=10 and max_connections=20 so connection setup isn't paid per call. vLLM handles the actual batching on the GPU side via continuous batching — my job at the application layer is just to feed it requests fast enough to keep the batch full.

Key Takeaways

  • Big-prompt generation breaks at scale; small typed calls compose
  • RAG belongs in chat, not in structured generation pipelines
  • One vLLM process with two aliases beats two vLLM processes for any workload that shares a base model
  • Prefix caching is free latency reduction whenever your prompts have stable prefixes
  • Full LoRA (attention + MLP) matters when structural adherence matters
  • Defensive output stripping is cheaper than fighting the model's leak modes