Building a Hybrid-RAG Assistant That Doesn't Hallucinate Statute

The promise of LLM chat over a legal corpus is huge. A user asks a question, the model retrieves the relevant law, and produces a grounded answer with citations. The reality is that most attempts at this hallucinate. They invent section numbers. They confidently misstate thresholds. They produce answers that look like statute but aren't.

I spent a lot of time getting this right because the use case demanded it. This post is about what actually worked: hybrid retrieval, careful chunking, and one specific bug that taught me to validate parsers before tuning anything else.

Why Statutory QA Is Hard

Three reasons general-purpose LLMs fail on statute:

Section numbers are tokenised badly. "Rule 5-13" gets split into several tokens that don't carry meaning, so dense embeddings don't reliably retrieve the chunk containing that exact reference
Plausible is not correct. The model has seen enough legal text in pretraining to produce confidently-shaped wrong answers
Paraphrase gaps. A user asks about "regulated activities"; the law says "designated services". Lexical search misses the connection; dense search sometimes does too

Each of these breaks a different part of the retrieval-and-generation chain. You can't fix them with a better prompt.

The Corpus

Source	Chunks	Share
Primary legislation (future-law compilation)	~950	~60%
Regulator-issued rules	~500	~31%
Regulator's published implementation kits	~140	~9%

~1,600 chunks total. Heavily weighted toward statute, with regulator guidance as a small but high-value minority. This split is the moat. Answers ground in the actual law and the regulator's own materials, not in whatever the base model absorbed during pretraining.

Embeddings

Component	Value
Model	A 0.6B embedding model
Dimension	1024
Distance	Cosine
Loading	Lazy (on first query)
Vector store	Qdrant, file-backed, single collection

The lazy load matters. The generator workload doesn't need embeddings, and loading them at process start would waste memory on every server that doesn't run a chat query. First-query latency takes a one-time hit; everything after is hot.

File-backed Qdrant is the right choice for one host with zero operational overhead. It's the wrong choice the moment you need failover. The migration path to a clustered Qdrant deployment is well-understood, so I made the call to start simple and scale later.

Why Hybrid Retrieval

I tested three configurations on a held-out question set drawn from the regulator's own worked examples:

Retrieval	Recall@3	Notes
Pure dense	Low on section-number queries	Misses "Rule 5-13"
Pure BM25	Low on paraphrase queries	Misses "designated services" ↔ "regulated activities"
Hybrid: 0.7 dense + 0.3 BM25	Best on both	Tuned empirically

The 70/30 split wasn't picked from a tutorial. It came out of running the retrieval against a question set drawn from the regulator's own worked examples and tuning until both classes of query landed in the top-3.

The Bug That Taught Me Everything

The first version of the RAG store used a popular document parser. It looked correct. Chunks came out the right length. Metadata was attached properly. Embeddings indexed cleanly. By every visible signal, it worked.

But the chat assistant would occasionally produce confidently wrong answers — not vague ones, specific ones. It would cite "Section 84(2)(b)" with content that didn't match the actual text of that section.

The cause: the parser mishandled the two-column layout of the source PDFs. Chunks ended up with text from adjacent columns interleaved. To the embedding model, the chunk read as syntactically coherent (because each column individually was fluent legal English). But semantically, the chunk was scrambled — a sentence from column 1 followed by a sentence from column 2 followed by a sentence from column 1.

The fix:

Parser: switched to pymupdf4llm
Chunking: switched to MarkdownHeaderTextSplitter so chunks respected section boundaries

The hallucination rate dropped substantially after the rebuild. The lesson: when RAG quality is bad, the model is rarely the problem. The parser is. Validate chunks against the source document before tuning anything else.

Inference Path

Same vLLM process as the structured generator, different alias. The chat assistant uses the base 32B model — no LoRA. For chat, the LoRA's domain bias is a liability rather than an asset: chat questions span a broader surface than the generator's narrow output schema, and the unadapted base behaves more predictably across that surface.

Key Takeaways

Pure dense retrieval misses lexical anchors like section numbers; pure BM25 misses paraphrase. Hybrid solves both
Tune retrieval weights against questions drawn from your domain, not from generic IR benchmarks
When RAG quality is bad, the parser is almost always the problem. Validate chunks against source documents
Lazy-load embedding models so workloads that don't need them don't pay the memory cost
For chat over a corpus, the unadapted base often beats the LoRA-adapted alias