The promise of LLM chat over a legal corpus is huge. A user asks a question, the model retrieves the relevant law, and produces a grounded answer with citations. The reality is that most attempts at this hallucinate. They invent section numbers. They confidently misstate thresholds. They produce answers that look like statute but aren't.
I spent a lot of time getting this right because the use case demanded it. This post is about what actually worked: hybrid retrieval, careful chunking, and one specific bug that taught me to validate parsers before tuning anything else.
Why Statutory QA Is Hard
Three reasons general-purpose LLMs fail on statute:
- Section numbers are tokenised badly. "Rule 5-13" gets split into several tokens that don't carry meaning, so dense embeddings don't reliably retrieve the chunk containing that exact reference
- Plausible is not correct. The model has seen enough legal text in pretraining to produce confidently-shaped wrong answers
- Paraphrase gaps. A user asks about "regulated activities"; the law says "designated services". Lexical search misses the connection; dense search sometimes does too
Each of these breaks a different part of the retrieval-and-generation chain. You can't fix them with a better prompt.
The Corpus
| Source | Chunks | Share |
|---|---|---|
| Primary legislation (future-law compilation) | ~950 | ~60% |
| Regulator-issued rules | ~500 | ~31% |
| Regulator's published implementation kits | ~140 | ~9% |
~1,600 chunks total. Heavily weighted toward statute, with regulator guidance as a small but high-value minority. This split is the moat. Answers ground in the actual law and the regulator's own materials, not in whatever the base model absorbed during pretraining.
Embeddings
| Component | Value |
|---|---|
| Model | A 0.6B embedding model |
| Dimension | 1024 |
| Distance | Cosine |
| Loading | Lazy (on first query) |
| Vector store | Qdrant, file-backed, single collection |
The lazy load matters. The generator workload doesn't need embeddings, and loading them at process start would waste memory on every server that doesn't run a chat query. First-query latency takes a one-time hit; everything after is hot.
File-backed Qdrant is the right choice for one host with zero operational overhead. It's the wrong choice the moment you need failover. The migration path to a clustered Qdrant deployment is well-understood, so I made the call to start simple and scale later.
Why Hybrid Retrieval
I tested three configurations on a held-out question set drawn from the regulator's own worked examples:
| Retrieval | Recall@3 | Notes |
|---|---|---|
| Pure dense | Low on section-number queries | Misses "Rule 5-13" |
| Pure BM25 | Low on paraphrase queries | Misses "designated services" ↔ "regulated activities" |
| Hybrid: 0.7 dense + 0.3 BM25 | Best on both | Tuned empirically |
The 70/30 split wasn't picked from a tutorial. It came out of running the retrieval against a question set drawn from the regulator's own worked examples and tuning until both classes of query landed in the top-3.
The Bug That Taught Me Everything
The first version of the RAG store used a popular document parser. It looked correct. Chunks came out the right length. Metadata was attached properly. Embeddings indexed cleanly. By every visible signal, it worked.
But the chat assistant would occasionally produce confidently wrong answers — not vague ones, specific ones. It would cite "Section 84(2)(b)" with content that didn't match the actual text of that section.
The cause: the parser mishandled the two-column layout of the source PDFs. Chunks ended up with text from adjacent columns interleaved. To the embedding model, the chunk read as syntactically coherent (because each column individually was fluent legal English). But semantically, the chunk was scrambled — a sentence from column 1 followed by a sentence from column 2 followed by a sentence from column 1.
The fix:
- Parser: switched to
pymupdf4llm - Chunking: switched to
MarkdownHeaderTextSplitterso chunks respected section boundaries
The hallucination rate dropped substantially after the rebuild. The lesson: when RAG quality is bad, the model is rarely the problem. The parser is. Validate chunks against the source document before tuning anything else.
Inference Path
Same vLLM process as the structured generator, different alias. The chat assistant uses the base 32B model — no LoRA. For chat, the LoRA's domain bias is a liability rather than an asset: chat questions span a broader surface than the generator's narrow output schema, and the unadapted base behaves more predictably across that surface.
Key Takeaways
- Pure dense retrieval misses lexical anchors like section numbers; pure BM25 misses paraphrase. Hybrid solves both
- Tune retrieval weights against questions drawn from your domain, not from generic IR benchmarks
- When RAG quality is bad, the parser is almost always the problem. Validate chunks against source documents
- Lazy-load embedding models so workloads that don't need them don't pay the memory cost
- For chat over a corpus, the unadapted base often beats the LoRA-adapted alias