Part 4: Embeddings and Vector Search

Hardware: 8 GB RAM minimum. CPU is fine for the embedding sizes we use here. Python: 3.11+.

The Pitch

Search by exact keyword matches "machine learning" but misses "ML"; embeddings match both. That's the whole pitch. The interesting questions are which embedding model to use, how to store the vectors, and how to retrieve quickly.

By the end of this part the Part 3 service has a /search endpoint that semantically searches a corpus of documents. We use sentence-transformers for embeddings and Qdrant in file-backed mode as the vector store.

What an Embedding Actually Is

An embedding is a fixed-size vector (typically 384, 768, or 1024 floats) where similar meaning → similar vectors. Two examples and their cosine similarity score:

from sentence_transformers import SentenceTransformer
import numpy as np

m = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

a = m.encode("Machine learning models need lots of data.")
b = m.encode("ML algorithms are data-hungry.")
c = m.encode("I had eggs for breakfast.")

def cos(x, y): return float(np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y)))

print(f"a vs b: {cos(a, b):.3f}")  # ~0.78 — close meaning, close vectors
print(f"a vs c: {cos(a, c):.3f}")  # ~0.05 — unrelated, near-orthogonal

That's the whole intuition. Skip the linear-algebra deep dive unless you're training the embedding model itself. For using one, this is enough.

Choosing an Embedding Model

Model	Dim	Size	Speed	MTEB	When
`all-MiniLM-L6-v2`	384	90 MB	fastest	56	Default for prototyping; English only
`BAAI/bge-large-en-v1.5`	1024	1.3 GB	medium	64	Quality matters; English
`Qwen3-Embedding-0.6B`	1024	1.2 GB	medium	67	Multilingual; current SOTA-ish
OpenAI `text-embedding-3-small`	1536	API	API latency	62	Don't want to host; happy to pay per token

Rule of thumb: start with all-MiniLM-L6-v2. If quality is the bottleneck (you're missing relevant results), upgrade to BGE or Qwen. If you can't host a model, use the API. Don't agonise over the choice on day one — the chunking strategy (Part 5) matters more than the embedding model.

Distance Metrics

Cosine similarity. Measures angle between vectors, ignores magnitude. Default for most embedding models because they're trained with cosine
Dot product. Same as cosine if vectors are unit-normalised. Faster on some hardware
Euclidean. Measures distance in space. Almost never the right choice for text embeddings — embeddings are direction-meaningful, not magnitude-meaningful

Use cosine. Move on.

Vector Stores

Store	Hosted	File-backed local	Good for
Qdrant	yes	yes	One server or self-hosted; great DX
Pinecone	yes	no	"I never want to think about infra"
Weaviate	yes	partial	Hybrid search out-of-box
pgvector	self-hosted Postgres	n/a	You already have Postgres

For this series we use file-backed Qdrant. It's a single process, zero config, the data lives in a directory you can tar up. When you outgrow it, the same client code talks to a hosted Qdrant cluster.

Install

uv add sentence-transformers==3.4.0 qdrant-client==1.13.0 numpy==2.1.0

The Search Module

Save as src/text_tool/vector_store.py:

"""File-backed Qdrant + sentence-transformers wrapper."""
from __future__ import annotations

import logging
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable

from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, PointStruct, VectorParams, Filter, FieldCondition, MatchValue
from sentence_transformers import SentenceTransformer

log = logging.getLogger(__name__)

EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
EMBEDDING_DIM = 384
COLLECTION = "documents"


@dataclass
class Document:
    id: str
    text: str
    metadata: dict[str, str]


@dataclass
class SearchHit:
    id: str
    score: float
    text: str
    metadata: dict[str, str]


class VectorStore:
    def __init__(self, db_path: str = "./qdrant_data") -> None:
        Path(db_path).mkdir(parents=True, exist_ok=True)
        self._client = QdrantClient(path=db_path)
        self._model: SentenceTransformer | None = None  # lazy load
        self._ensure_collection()

    @property
    def model(self) -> SentenceTransformer:
        if self._model is None:
            log.info("loading embedding model %s", EMBEDDING_MODEL)
            self._model = SentenceTransformer(EMBEDDING_MODEL)
        return self._model

    def _ensure_collection(self) -> None:
        existing = [c.name for c in self._client.get_collections().collections]
        if COLLECTION not in existing:
            self._client.create_collection(
                collection_name=COLLECTION,
                vectors_config=VectorParams(size=EMBEDDING_DIM, distance=Distance.COSINE),
            )

    def index(self, docs: Iterable[Document], batch_size: int = 64) -> int:
        docs = list(docs)
        if not docs: return 0
        # Embed in batches so we don't OOM on large corpora.
        n = 0
        for i in range(0, len(docs), batch_size):
            batch = docs[i:i + batch_size]
            vectors = self.model.encode([d.text for d in batch], normalize_embeddings=True)
            points = [
                PointStruct(id=d.id, vector=v.tolist(), payload={"text": d.text, **d.metadata})
                for d, v in zip(batch, vectors)
            ]
            self._client.upsert(collection_name=COLLECTION, points=points)
            n += len(batch)
        log.info("indexed %d docs", n)
        return n

    def search(
        self, query: str, *, top_k: int = 5, score_threshold: float | None = None,
        filter_metadata: dict[str, str] | None = None,
    ) -> list[SearchHit]:
        qv = self.model.encode(query, normalize_embeddings=True)
        qfilter = None
        if filter_metadata:
            qfilter = Filter(
                must=[FieldCondition(key=k, match=MatchValue(value=v))
                       for k, v in filter_metadata.items()]
            )
        hits = self._client.search(
            collection_name=COLLECTION,
            query_vector=qv.tolist(),
            limit=top_k,
            score_threshold=score_threshold,
            query_filter=qfilter,
        )
        return [
            SearchHit(id=str(h.id), score=h.score,
                      text=h.payload["text"],
                      metadata={k: v for k, v in h.payload.items() if k != "text"})
            for h in hits
        ]

Index a Sample Corpus

Use this script to index 50 short documents (use your own corpus once you have one). Save as scripts/build_index.py:

from uuid import uuid4
from text_tool.vector_store import VectorStore, Document

CORPUS = [
    ("Machine learning is a subfield of AI focused on data-driven prediction.", "ai-101"),
    ("FastAPI is a modern Python web framework with async support.", "python-web"),
    ("Transformers use self-attention to process sequences in parallel.", "nlp"),
    ("Docker containers package applications with their dependencies.", "devops"),
    ("Postgres pgvector enables vector similarity search inside SQL.", "databases"),
    # ... add more lines, ~50 total in the real script
]

def main():
    store = VectorStore(db_path="./qdrant_data")
    docs = [Document(id=str(uuid4()), text=t, metadata={"category": c}) for t, c in CORPUS]
    n = store.index(docs)
    print(f"indexed {n} documents")

if __name__ == "__main__":
    main()

Run it once:

uv run python scripts/build_index.py

Add `/search` to the FastAPI Service

Patch server.py from Part 3 with these additions:

from text_tool.vector_store import VectorStore


# In lifespan, alongside the LLM client:
@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.client = get_client()
    app.state.store = VectorStore(db_path="./qdrant_data")
    yield


class SearchRequest(BaseModel):
    query: str = Field(..., min_length=1, max_length=2000)
    top_k: int = Field(5, ge=1, le=50)
    score_threshold: float | None = Field(None, ge=0.0, le=1.0)
    category: str | None = None


class SearchHitOut(BaseModel):
    id: str
    score: float
    text: str
    metadata: dict[str, str]


class SearchResponse(BaseModel):
    hits: list[SearchHitOut]


@app.post("/search", response_model=SearchResponse)
@limiter.limit("120/minute")
async def search(request: Request, body: SearchRequest, _key: str = Depends(get_caller_key)):
    store = request.app.state.store
    filter_md = {"category": body.category} if body.category else None
    hits = await run_in_thread(
        store.search, body.query,
        top_k=body.top_k, score_threshold=body.score_threshold,
        filter_metadata=filter_md,
    )
    return SearchResponse(hits=[SearchHitOut(**h.__dict__) for h in hits])

Restart the service and hit it:

curl -X POST http://localhost:8000/search \
  -H "X-API-Key: dev-key-001" \
  -H "Content-Type: application/json" \
  -d '{"query":"how does FastAPI handle async?","top_k":3}'

You'll get back semantically-ranked hits, even though "FastAPI" doesn't textually appear in some of the relevant chunks.

Lazy Loading and Why It Matters

The VectorStore above doesn't load the embedding model until the first call. That's deliberate. When the /complete endpoint is hit but /search isn't, you don't pay the ~90 MB memory cost. It also makes the boot fast — the service responds to /health in under 100ms even though warm-loading the model takes several seconds.

Batch Sizes That Don't OOM

The index method batches at 64 by default. For a 384-dim model on CPU, this is comfortable. Things to know:

Larger batch → better throughput, more peak memory
On a 1024-dim model with long docs, drop batch size to 16 or 32
If you OOM mid-batch, the partial state is in Qdrant; rerun from the failed offset rather than starting over

The Chunking Setup for Part 5

Notice we embedded one-sentence "documents". On a real corpus, "embed the whole document" rarely works because:

Long documents have multiple topics; their embedding is the average and matches nothing well
Models have token limits (512 for MiniLM, often 8192 for newer ones) — long inputs are truncated silently
You want to retrieve the relevant passage, not the document — so you can fit it in the LLM's prompt

Part 5 is the chunking strategy in detail. The teaser: chunk on header boundaries with MarkdownHeaderTextSplitter when you can. Fixed-size chunks with overlap when you can't. Try 500-1000 tokens per chunk as a starting point.

Key Takeaways

Embeddings are fixed-size vectors where similar meaning maps to similar vectors. Cosine similarity is the default metric
Start with all-MiniLM-L6-v2. Upgrade to BGE or Qwen3 only if quality is the bottleneck
File-backed Qdrant is right for one host. The same client code scales to a Qdrant cluster when you outgrow it
Lazy-load the embedding model. Boot stays fast; memory cost paid only when needed
Always normalise embeddings (normalize_embeddings=True). It makes cosine and dot product equivalent and avoids silent bugs
Batch your indexing. 64 is a reasonable default for small models; drop on bigger ones

Next Up

Part 5 combines retrieval and generation into a full RAG system — chunking, hybrid retrieval, prompt construction, citations, and evaluation. The bug at the end of that post is the same parser bug from Building a Hybrid-RAG Assistant.

Next: Part 5 — Building a Production RAG System