Part 5: Building a Production RAG System

Hardware: 8 GB RAM. CPU is enough. Python: 3.11+.

RAG Demos vs RAG in Production

RAG demos look like magic. RAG in production looks like 200 lines of glue code, two failure modes you didn't anticipate, and one parser bug. Most RAG tutorials skip the parts that actually matter.

What we'll build: a /rag endpoint that takes a question, retrieves relevant chunks via hybrid retrieval, builds a grounded prompt with citation IDs, calls the LLM via the Part 2 wrapper, and returns the answer plus citations.

The Minimal RAG Loop (Working but Bad)

def naive_rag(question: str, store: VectorStore, client) -> str:
    hits = store.search(question, top_k=5)
    context = "\n\n".join(h.text for h in hits)
    prompt = f"Answer the question using the context.\n\nContext:\n{context}\n\nQuestion: {question}"
    return client.complete(prompt, max_tokens=400).text

This works. It also has at least four problems we'll fix:

No chunking strategy — the indexed "documents" might be paragraphs, sentences, or whole files
Pure dense retrieval — misses queries that depend on exact terms (section numbers, function names, IDs)
No citation — the model says things that look like answers, but you can't tell which retrieved chunk a claim came from
No eval — no way to know whether changes make it better or worse

Chunking

Chunk strategy ranked from worst to best for most cases:

Strategy	Verdict
Fixed-size by characters	Bad — cuts mid-sentence, loses structure
Sentence splitter	Okay — respects sentence boundaries; chunks too small
Recursive character splitter	Good baseline — respects paragraphs and sentences
Markdown header splitter	Best when source is structured — chunks are coherent sections

Install splitters and a robust PDF parser:

uv add langchain-text-splitters==0.3.0 pymupdf4llm==0.0.20 rank-bm25==0.2.2

The Parser Trap (Why `pymupdf4llm`)

Most PDF parsers mishandle multi-column layouts. Chunks come out with text from adjacent columns interleaved — syntactically fluent, semantically scrambled. Embeddings index it cleanly, retrieval looks fine, then the model produces confidently-wrong answers because the chunk it cited literally doesn't say what the citation claims.

I wrote a whole post about this exact bug: Building a Hybrid-RAG Assistant That Doesn't Hallucinate Statute. Short version: use pymupdf4llm; convert PDFs to Markdown; chunk with MarkdownHeaderTextSplitter. Always validate chunks against the source document before tuning anything else.

Indexing Pipeline

Save as src/text_tool/indexing.py:

"""Parse documents, chunk by markdown headers, index in Qdrant."""
from __future__ import annotations

from pathlib import Path
from uuid import uuid4
from typing import Iterable

import pymupdf4llm
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

from text_tool.vector_store import Document, VectorStore

HEADERS = [("#", "h1"), ("##", "h2"), ("###", "h3")]


def parse_document(path: Path) -> str:
    """Convert PDF or text to markdown."""
    if path.suffix.lower() == ".pdf":
        return pymupdf4llm.to_markdown(str(path))
    return path.read_text(encoding="utf-8")


def chunk_markdown(md: str, *, source: str, max_chars: int = 1500) -> list[Document]:
    """Header-aware chunking, with a recursive fallback for sections that are too large."""
    header_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=HEADERS, strip_headers=False)
    header_chunks = header_splitter.split_text(md)

    fallback = RecursiveCharacterTextSplitter(chunk_size=max_chars, chunk_overlap=120)

    out: list[Document] = []
    for hc in header_chunks:
        text = hc.page_content
        meta = {"source": source, **hc.metadata}
        if len(text) <= max_chars:
            out.append(Document(id=str(uuid4()), text=text, metadata={k: str(v) for k, v in meta.items()}))
        else:
            for piece in fallback.split_text(text):
                out.append(Document(id=str(uuid4()), text=piece, metadata={k: str(v) for k, v in meta.items()}))
    return out


def index_documents(store: VectorStore, paths: Iterable[Path]) -> int:
    docs: list[Document] = []
    for p in paths:
        md = parse_document(p)
        docs.extend(chunk_markdown(md, source=str(p)))
    return store.index(docs)

Hybrid Retrieval

Dense retrieval misses lexical anchors (section numbers, function names, exact phrases). BM25 misses paraphrase. Hybrid solves both. Start with a 0.7/0.3 dense-to-BM25 blend — tune later against a real query set.

Save as src/text_tool/hybrid_retrieval.py:

"""Combine dense (Qdrant) and BM25 retrieval scores."""
from __future__ import annotations

from dataclasses import dataclass
from typing import Iterable

from rank_bm25 import BM25Okapi

from text_tool.vector_store import VectorStore, SearchHit


@dataclass
class RetrievedChunk:
    id: str
    text: str
    metadata: dict[str, str]
    score: float
    sources: tuple[str, ...]  # ("dense",) ("bm25",) or ("dense","bm25")


class HybridRetriever:
    """Build a BM25 index over the same corpus as Qdrant; score-blend at query time."""
    def __init__(self, store: VectorStore, corpus: Iterable[tuple[str, str, dict[str, str]]]) -> None:
        # corpus is iterable of (id, text, metadata)
        self._store = store
        self._ids: list[str] = []
        self._texts: list[str] = []
        self._meta: list[dict[str, str]] = []
        for cid, text, meta in corpus:
            self._ids.append(cid); self._texts.append(text); self._meta.append(meta)
        tokenised = [t.lower().split() for t in self._texts]
        self._bm25 = BM25Okapi(tokenised) if tokenised else None

    def retrieve(self, query: str, *, top_k: int = 5,
                 dense_weight: float = 0.7) -> list[RetrievedChunk]:
        bm25_weight = 1.0 - dense_weight

        # Dense
        dense_hits = self._store.search(query, top_k=top_k * 3)
        dense_scores = {h.id: h.score for h in dense_hits}
        # Min-max to [0, 1] so weights are comparable
        dense_scores = _normalise(dense_scores)

        # BM25
        bm25_scores: dict[str, float] = {}
        if self._bm25:
            scores = self._bm25.get_scores(query.lower().split())
            top_idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[: top_k * 3]
            bm25_scores = {self._ids[i]: float(scores[i]) for i in top_idx}
            bm25_scores = _normalise(bm25_scores)

        all_ids = set(dense_scores) | set(bm25_scores)
        blended: list[RetrievedChunk] = []
        for cid in all_ids:
            score = dense_weight * dense_scores.get(cid, 0.0) + bm25_weight * bm25_scores.get(cid, 0.0)
            sources = tuple([s for s, m in (("dense", dense_scores), ("bm25", bm25_scores)) if cid in m])
            text = next((h.text for h in dense_hits if h.id == cid), None) \
                   or self._texts[self._ids.index(cid)] if cid in self._ids else None
            meta = next((h.metadata for h in dense_hits if h.id == cid), None) \
                   or (self._meta[self._ids.index(cid)] if cid in self._ids else {})
            if text is None: continue
            blended.append(RetrievedChunk(id=cid, text=text, metadata=meta, score=score, sources=sources))

        blended.sort(key=lambda r: r.score, reverse=True)
        return blended[:top_k]


def _normalise(scores: dict[str, float]) -> dict[str, float]:
    if not scores: return {}
    lo, hi = min(scores.values()), max(scores.values())
    if hi == lo: return {k: 1.0 for k in scores}
    return {k: (v - lo) / (hi - lo) for k, v in scores.items()}

Prompt Construction with Citations

The trick to citations is giving every chunk an ID the model can echo. Save as src/text_tool/rag.py:

"""RAG pipeline: question → retrieve → prompt → generate → structured answer."""
from __future__ import annotations

from pydantic import BaseModel, Field

from text_tool.hybrid_retrieval import HybridRetriever, RetrievedChunk
from text_tool.llm_client import LLMClient


class Citation(BaseModel):
    chunk_id: str
    text: str


class RagAnswer(BaseModel):
    answer: str = Field(..., description="Final answer in 1-3 sentences.")
    citations: list[str] = Field(..., description="IDs of chunks the answer used.")


SYSTEM = (
    "You answer questions using only the provided context chunks. "
    "Each chunk has an ID like [c01]. When you make a claim, cite the chunk you used. "
    "If the context doesn't contain the answer, say so explicitly. "
    "Output JSON: {answer: str, citations: [chunk_id, ...]}."
)


def build_prompt(question: str, chunks: list[RetrievedChunk]) -> str:
    pieces = [SYSTEM, ""]
    pieces.append("Context:")
    for i, ch in enumerate(chunks):
        pieces.append(f"[c{i:02d}] (source: {ch.metadata.get('source', '?')})\n{ch.text}")
        pieces.append("")
    pieces.append(f"Question: {question}")
    return "\n".join(pieces)


def answer(question: str, retriever: HybridRetriever, client: LLMClient,
           *, top_k: int = 5) -> tuple[RagAnswer, list[Citation]]:
    chunks = retriever.retrieve(question, top_k=top_k)
    if not chunks:
        return RagAnswer(answer="I don't have any relevant context for this question.", citations=[]), []
    prompt = build_prompt(question, chunks)
    raw = client.complete_structured(prompt, RagAnswer, max_tokens=600)

    id_map = {f"c{i:02d}": chunks[i] for i in range(len(chunks))}
    full_citations = [
        Citation(chunk_id=cid, text=id_map[cid].text)
        for cid in raw.citations if cid in id_map
    ]
    return raw, full_citations

Wire It into the FastAPI Service

Add to server.py:

from text_tool.hybrid_retrieval import HybridRetriever
from text_tool.rag import answer as rag_answer, Citation


# In lifespan:
@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.client = get_client()
    app.state.store = VectorStore(db_path="./qdrant_data")
    # Build the hybrid retriever from whatever's in the store at startup.
    raw_points, _ = app.state.store._client.scroll(collection_name="documents", limit=10000)
    corpus = [(str(p.id), p.payload["text"],
               {k: v for k, v in p.payload.items() if k != "text"}) for p in raw_points]
    app.state.retriever = HybridRetriever(app.state.store, corpus)
    yield


class RagRequest(BaseModel):
    question: str = Field(..., min_length=1, max_length=2000)
    top_k: int = Field(5, ge=1, le=20)


class RagResponse(BaseModel):
    answer: str
    citations: list[Citation]


@app.post("/rag", response_model=RagResponse)
@limiter.limit("30/minute")
async def rag(request: Request, body: RagRequest, _key: str = Depends(get_caller_key)):
    raw, cites = await run_in_thread(
        rag_answer, body.question, request.app.state.retriever, request.app.state.client,
        top_k=body.top_k,
    )
    return RagResponse(answer=raw.answer, citations=cites)

Worked example:

curl -X POST http://localhost:8000/rag \
  -H "X-API-Key: dev-key-001" -H "Content-Type: application/json" \
  -d '{"question":"What is FastAPI good for?","top_k":4}'

Re-ranking (Optional but Often Worth It)

A cross-encoder re-ranks the top-k from hybrid retrieval before they enter the prompt. It's an extra ~50ms per query and noticeably tightens precision when the corpus is large. Add it only after you've measured retrieval quality on a held-out set — otherwise you're optimising blind.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, c.text) for c in chunks]
scores = reranker.predict(pairs)
chunks = [c for _, c in sorted(zip(scores, chunks), key=lambda x: x[0], reverse=True)]

Evaluation

If you don't measure, you're guessing. Build a small held-out set:

EVAL = [
    {"q": "What is FastAPI good for?", "expected_keywords": ["async", "Pydantic", "auto-documentation"]},
    {"q": "Why is the parser important in RAG?", "expected_keywords": ["multi-column", "scrambled", "hallucinate"]},
]


def precision_at_k(retriever, k=5):
    """Did the retriever surface a chunk containing any expected keyword?"""
    hits = 0
    for case in EVAL:
        chunks = retriever.retrieve(case["q"], top_k=k)
        text = " ".join(c.text.lower() for c in chunks)
        if any(kw.lower() in text for kw in case["expected_keywords"]):
            hits += 1
    return hits / len(EVAL)

Run it before and after every retrieval change. RAGAS gives a fancier suite (faithfulness, answer relevancy, context precision/recall) but the simple precision@k loop above catches 80% of regressions.

When RAG Is the Wrong Tool

Small corpus. If everything fits in the prompt, just stuff it — cheaper, simpler, no parser bugs
Structured data. SQL or filtered metadata search beats embeddings for tabular questions ("how many users signed up last month?")
Tabular reasoning. Code execution against a dataframe will reliably outperform embeddings on math-y questions
Single-document QA. Long-context models can swallow a 200-page document; the hassle of RAG isn't worth it

Key Takeaways

Naive RAG works in five lines of code and breaks in five places. The interesting work is everywhere else
Chunking by markdown headers beats fixed-size every time when source structure exists. Keep chunks under ~1500 chars; overlap on recursive fallback
Hybrid retrieval (dense + BM25) catches the queries either method alone misses. Start at 0.7/0.3, tune against a real query set
Citations need chunk IDs the model can quote back. Pydantic-validate the structured output
Validate parser output against source documents before tuning anything else (see the parser-bug post)
Build a tiny eval set early. Precision@k catches most regressions without any machinery
RAG isn't the answer to every retrieval problem. Sometimes the answer is SQL, code execution, or just a long-context prompt

Next Up

Part 6 starts the fine-tuning track. RAG gives the model knowledge it doesn't have. Fine-tuning gives it behaviour it doesn't have. They solve different problems and they compose.

Next: Part 6 — Fine-Tuning with LoRA and QLoRA