Hardware: 8 GB RAM. CPU is enough. Python: 3.11+.
RAG Demos vs RAG in Production
RAG demos look like magic. RAG in production looks like 200 lines of glue code, two failure modes you didn't anticipate, and one parser bug. Most RAG tutorials skip the parts that actually matter.
What we'll build: a /rag endpoint that takes a question, retrieves relevant chunks via hybrid retrieval, builds a grounded prompt with citation IDs, calls the LLM via the Part 2 wrapper, and returns the answer plus citations.
The Minimal RAG Loop (Working but Bad)
def naive_rag(question: str, store: VectorStore, client) -> str:
hits = store.search(question, top_k=5)
context = "\n\n".join(h.text for h in hits)
prompt = f"Answer the question using the context.\n\nContext:\n{context}\n\nQuestion: {question}"
return client.complete(prompt, max_tokens=400).text
This works. It also has at least four problems we'll fix:
- No chunking strategy — the indexed "documents" might be paragraphs, sentences, or whole files
- Pure dense retrieval — misses queries that depend on exact terms (section numbers, function names, IDs)
- No citation — the model says things that look like answers, but you can't tell which retrieved chunk a claim came from
- No eval — no way to know whether changes make it better or worse
Chunking
Chunk strategy ranked from worst to best for most cases:
| Strategy | Verdict |
|---|---|
| Fixed-size by characters | Bad — cuts mid-sentence, loses structure |
| Sentence splitter | Okay — respects sentence boundaries; chunks too small |
| Recursive character splitter | Good baseline — respects paragraphs and sentences |
| Markdown header splitter | Best when source is structured — chunks are coherent sections |
Install splitters and a robust PDF parser:
uv add langchain-text-splitters==0.3.0 pymupdf4llm==0.0.20 rank-bm25==0.2.2
The Parser Trap (Why pymupdf4llm)
Most PDF parsers mishandle multi-column layouts. Chunks come out with text from adjacent columns interleaved — syntactically fluent, semantically scrambled. Embeddings index it cleanly, retrieval looks fine, then the model produces confidently-wrong answers because the chunk it cited literally doesn't say what the citation claims.
I wrote a whole post about this exact bug: Building a Hybrid-RAG Assistant That Doesn't Hallucinate Statute. Short version: use pymupdf4llm; convert PDFs to Markdown; chunk with MarkdownHeaderTextSplitter. Always validate chunks against the source document before tuning anything else.
Indexing Pipeline
Save as src/text_tool/indexing.py:
"""Parse documents, chunk by markdown headers, index in Qdrant."""
from __future__ import annotations
from pathlib import Path
from uuid import uuid4
from typing import Iterable
import pymupdf4llm
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from text_tool.vector_store import Document, VectorStore
HEADERS = [("#", "h1"), ("##", "h2"), ("###", "h3")]
def parse_document(path: Path) -> str:
"""Convert PDF or text to markdown."""
if path.suffix.lower() == ".pdf":
return pymupdf4llm.to_markdown(str(path))
return path.read_text(encoding="utf-8")
def chunk_markdown(md: str, *, source: str, max_chars: int = 1500) -> list[Document]:
"""Header-aware chunking, with a recursive fallback for sections that are too large."""
header_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=HEADERS, strip_headers=False)
header_chunks = header_splitter.split_text(md)
fallback = RecursiveCharacterTextSplitter(chunk_size=max_chars, chunk_overlap=120)
out: list[Document] = []
for hc in header_chunks:
text = hc.page_content
meta = {"source": source, **hc.metadata}
if len(text) <= max_chars:
out.append(Document(id=str(uuid4()), text=text, metadata={k: str(v) for k, v in meta.items()}))
else:
for piece in fallback.split_text(text):
out.append(Document(id=str(uuid4()), text=piece, metadata={k: str(v) for k, v in meta.items()}))
return out
def index_documents(store: VectorStore, paths: Iterable[Path]) -> int:
docs: list[Document] = []
for p in paths:
md = parse_document(p)
docs.extend(chunk_markdown(md, source=str(p)))
return store.index(docs)
Hybrid Retrieval
Dense retrieval misses lexical anchors (section numbers, function names, exact phrases). BM25 misses paraphrase. Hybrid solves both. Start with a 0.7/0.3 dense-to-BM25 blend — tune later against a real query set.
Save as src/text_tool/hybrid_retrieval.py:
"""Combine dense (Qdrant) and BM25 retrieval scores."""
from __future__ import annotations
from dataclasses import dataclass
from typing import Iterable
from rank_bm25 import BM25Okapi
from text_tool.vector_store import VectorStore, SearchHit
@dataclass
class RetrievedChunk:
id: str
text: str
metadata: dict[str, str]
score: float
sources: tuple[str, ...] # ("dense",) ("bm25",) or ("dense","bm25")
class HybridRetriever:
"""Build a BM25 index over the same corpus as Qdrant; score-blend at query time."""
def __init__(self, store: VectorStore, corpus: Iterable[tuple[str, str, dict[str, str]]]) -> None:
# corpus is iterable of (id, text, metadata)
self._store = store
self._ids: list[str] = []
self._texts: list[str] = []
self._meta: list[dict[str, str]] = []
for cid, text, meta in corpus:
self._ids.append(cid); self._texts.append(text); self._meta.append(meta)
tokenised = [t.lower().split() for t in self._texts]
self._bm25 = BM25Okapi(tokenised) if tokenised else None
def retrieve(self, query: str, *, top_k: int = 5,
dense_weight: float = 0.7) -> list[RetrievedChunk]:
bm25_weight = 1.0 - dense_weight
# Dense
dense_hits = self._store.search(query, top_k=top_k * 3)
dense_scores = {h.id: h.score for h in dense_hits}
# Min-max to [0, 1] so weights are comparable
dense_scores = _normalise(dense_scores)
# BM25
bm25_scores: dict[str, float] = {}
if self._bm25:
scores = self._bm25.get_scores(query.lower().split())
top_idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[: top_k * 3]
bm25_scores = {self._ids[i]: float(scores[i]) for i in top_idx}
bm25_scores = _normalise(bm25_scores)
all_ids = set(dense_scores) | set(bm25_scores)
blended: list[RetrievedChunk] = []
for cid in all_ids:
score = dense_weight * dense_scores.get(cid, 0.0) + bm25_weight * bm25_scores.get(cid, 0.0)
sources = tuple([s for s, m in (("dense", dense_scores), ("bm25", bm25_scores)) if cid in m])
text = next((h.text for h in dense_hits if h.id == cid), None) \
or self._texts[self._ids.index(cid)] if cid in self._ids else None
meta = next((h.metadata for h in dense_hits if h.id == cid), None) \
or (self._meta[self._ids.index(cid)] if cid in self._ids else {})
if text is None: continue
blended.append(RetrievedChunk(id=cid, text=text, metadata=meta, score=score, sources=sources))
blended.sort(key=lambda r: r.score, reverse=True)
return blended[:top_k]
def _normalise(scores: dict[str, float]) -> dict[str, float]:
if not scores: return {}
lo, hi = min(scores.values()), max(scores.values())
if hi == lo: return {k: 1.0 for k in scores}
return {k: (v - lo) / (hi - lo) for k, v in scores.items()}
Prompt Construction with Citations
The trick to citations is giving every chunk an ID the model can echo. Save as src/text_tool/rag.py:
"""RAG pipeline: question → retrieve → prompt → generate → structured answer."""
from __future__ import annotations
from pydantic import BaseModel, Field
from text_tool.hybrid_retrieval import HybridRetriever, RetrievedChunk
from text_tool.llm_client import LLMClient
class Citation(BaseModel):
chunk_id: str
text: str
class RagAnswer(BaseModel):
answer: str = Field(..., description="Final answer in 1-3 sentences.")
citations: list[str] = Field(..., description="IDs of chunks the answer used.")
SYSTEM = (
"You answer questions using only the provided context chunks. "
"Each chunk has an ID like [c01]. When you make a claim, cite the chunk you used. "
"If the context doesn't contain the answer, say so explicitly. "
"Output JSON: {answer: str, citations: [chunk_id, ...]}."
)
def build_prompt(question: str, chunks: list[RetrievedChunk]) -> str:
pieces = [SYSTEM, ""]
pieces.append("Context:")
for i, ch in enumerate(chunks):
pieces.append(f"[c{i:02d}] (source: {ch.metadata.get('source', '?')})\n{ch.text}")
pieces.append("")
pieces.append(f"Question: {question}")
return "\n".join(pieces)
def answer(question: str, retriever: HybridRetriever, client: LLMClient,
*, top_k: int = 5) -> tuple[RagAnswer, list[Citation]]:
chunks = retriever.retrieve(question, top_k=top_k)
if not chunks:
return RagAnswer(answer="I don't have any relevant context for this question.", citations=[]), []
prompt = build_prompt(question, chunks)
raw = client.complete_structured(prompt, RagAnswer, max_tokens=600)
id_map = {f"c{i:02d}": chunks[i] for i in range(len(chunks))}
full_citations = [
Citation(chunk_id=cid, text=id_map[cid].text)
for cid in raw.citations if cid in id_map
]
return raw, full_citations
Wire It into the FastAPI Service
Add to server.py:
from text_tool.hybrid_retrieval import HybridRetriever
from text_tool.rag import answer as rag_answer, Citation
# In lifespan:
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.client = get_client()
app.state.store = VectorStore(db_path="./qdrant_data")
# Build the hybrid retriever from whatever's in the store at startup.
raw_points, _ = app.state.store._client.scroll(collection_name="documents", limit=10000)
corpus = [(str(p.id), p.payload["text"],
{k: v for k, v in p.payload.items() if k != "text"}) for p in raw_points]
app.state.retriever = HybridRetriever(app.state.store, corpus)
yield
class RagRequest(BaseModel):
question: str = Field(..., min_length=1, max_length=2000)
top_k: int = Field(5, ge=1, le=20)
class RagResponse(BaseModel):
answer: str
citations: list[Citation]
@app.post("/rag", response_model=RagResponse)
@limiter.limit("30/minute")
async def rag(request: Request, body: RagRequest, _key: str = Depends(get_caller_key)):
raw, cites = await run_in_thread(
rag_answer, body.question, request.app.state.retriever, request.app.state.client,
top_k=body.top_k,
)
return RagResponse(answer=raw.answer, citations=cites)
Worked example:
curl -X POST http://localhost:8000/rag \
-H "X-API-Key: dev-key-001" -H "Content-Type: application/json" \
-d '{"question":"What is FastAPI good for?","top_k":4}'
Re-ranking (Optional but Often Worth It)
A cross-encoder re-ranks the top-k from hybrid retrieval before they enter the prompt. It's an extra ~50ms per query and noticeably tightens precision when the corpus is large. Add it only after you've measured retrieval quality on a held-out set — otherwise you're optimising blind.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, c.text) for c in chunks]
scores = reranker.predict(pairs)
chunks = [c for _, c in sorted(zip(scores, chunks), key=lambda x: x[0], reverse=True)]
Evaluation
If you don't measure, you're guessing. Build a small held-out set:
EVAL = [
{"q": "What is FastAPI good for?", "expected_keywords": ["async", "Pydantic", "auto-documentation"]},
{"q": "Why is the parser important in RAG?", "expected_keywords": ["multi-column", "scrambled", "hallucinate"]},
]
def precision_at_k(retriever, k=5):
"""Did the retriever surface a chunk containing any expected keyword?"""
hits = 0
for case in EVAL:
chunks = retriever.retrieve(case["q"], top_k=k)
text = " ".join(c.text.lower() for c in chunks)
if any(kw.lower() in text for kw in case["expected_keywords"]):
hits += 1
return hits / len(EVAL)
Run it before and after every retrieval change. RAGAS gives a fancier suite (faithfulness, answer relevancy, context precision/recall) but the simple precision@k loop above catches 80% of regressions.
When RAG Is the Wrong Tool
- Small corpus. If everything fits in the prompt, just stuff it — cheaper, simpler, no parser bugs
- Structured data. SQL or filtered metadata search beats embeddings for tabular questions ("how many users signed up last month?")
- Tabular reasoning. Code execution against a dataframe will reliably outperform embeddings on math-y questions
- Single-document QA. Long-context models can swallow a 200-page document; the hassle of RAG isn't worth it
Key Takeaways
- Naive RAG works in five lines of code and breaks in five places. The interesting work is everywhere else
- Chunking by markdown headers beats fixed-size every time when source structure exists. Keep chunks under ~1500 chars; overlap on recursive fallback
- Hybrid retrieval (dense + BM25) catches the queries either method alone misses. Start at 0.7/0.3, tune against a real query set
- Citations need chunk IDs the model can quote back. Pydantic-validate the structured output
- Validate parser output against source documents before tuning anything else (see the parser-bug post)
- Build a tiny eval set early. Precision@k catches most regressions without any machinery
- RAG isn't the answer to every retrieval problem. Sometimes the answer is SQL, code execution, or just a long-context prompt
Next Up
Part 6 starts the fine-tuning track. RAG gives the model knowledge it doesn't have. Fine-tuning gives it behaviour it doesn't have. They solve different problems and they compose.