Hardware: 8 GB RAM minimum. CPU is fine for the embedding sizes we use here. Python: 3.11+.
The Pitch
Search by exact keyword matches "machine learning" but misses "ML"; embeddings match both. That's the whole pitch. The interesting questions are which embedding model to use, how to store the vectors, and how to retrieve quickly.
By the end of this part the Part 3 service has a /search endpoint that semantically searches a corpus of documents. We use sentence-transformers for embeddings and Qdrant in file-backed mode as the vector store.
What an Embedding Actually Is
An embedding is a fixed-size vector (typically 384, 768, or 1024 floats) where similar meaning → similar vectors. Two examples and their cosine similarity score:
from sentence_transformers import SentenceTransformer
import numpy as np
m = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
a = m.encode("Machine learning models need lots of data.")
b = m.encode("ML algorithms are data-hungry.")
c = m.encode("I had eggs for breakfast.")
def cos(x, y): return float(np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y)))
print(f"a vs b: {cos(a, b):.3f}") # ~0.78 — close meaning, close vectors
print(f"a vs c: {cos(a, c):.3f}") # ~0.05 — unrelated, near-orthogonal
That's the whole intuition. Skip the linear-algebra deep dive unless you're training the embedding model itself. For using one, this is enough.
Choosing an Embedding Model
| Model | Dim | Size | Speed | MTEB | When |
|---|---|---|---|---|---|
all-MiniLM-L6-v2 | 384 | 90 MB | fastest | 56 | Default for prototyping; English only |
BAAI/bge-large-en-v1.5 | 1024 | 1.3 GB | medium | 64 | Quality matters; English |
Qwen3-Embedding-0.6B | 1024 | 1.2 GB | medium | 67 | Multilingual; current SOTA-ish |
OpenAI text-embedding-3-small | 1536 | API | API latency | 62 | Don't want to host; happy to pay per token |
Rule of thumb: start with all-MiniLM-L6-v2. If quality is the bottleneck (you're missing relevant results), upgrade to BGE or Qwen. If you can't host a model, use the API. Don't agonise over the choice on day one — the chunking strategy (Part 5) matters more than the embedding model.
Distance Metrics
- Cosine similarity. Measures angle between vectors, ignores magnitude. Default for most embedding models because they're trained with cosine
- Dot product. Same as cosine if vectors are unit-normalised. Faster on some hardware
- Euclidean. Measures distance in space. Almost never the right choice for text embeddings — embeddings are direction-meaningful, not magnitude-meaningful
Use cosine. Move on.
Vector Stores
| Store | Hosted | File-backed local | Good for |
|---|---|---|---|
| Qdrant | yes | yes | One server or self-hosted; great DX |
| Pinecone | yes | no | "I never want to think about infra" |
| Weaviate | yes | partial | Hybrid search out-of-box |
| pgvector | self-hosted Postgres | n/a | You already have Postgres |
For this series we use file-backed Qdrant. It's a single process, zero config, the data lives in a directory you can tar up. When you outgrow it, the same client code talks to a hosted Qdrant cluster.
Install
uv add sentence-transformers==3.4.0 qdrant-client==1.13.0 numpy==2.1.0
The Search Module
Save as src/text_tool/vector_store.py:
"""File-backed Qdrant + sentence-transformers wrapper."""
from __future__ import annotations
import logging
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, PointStruct, VectorParams, Filter, FieldCondition, MatchValue
from sentence_transformers import SentenceTransformer
log = logging.getLogger(__name__)
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
EMBEDDING_DIM = 384
COLLECTION = "documents"
@dataclass
class Document:
id: str
text: str
metadata: dict[str, str]
@dataclass
class SearchHit:
id: str
score: float
text: str
metadata: dict[str, str]
class VectorStore:
def __init__(self, db_path: str = "./qdrant_data") -> None:
Path(db_path).mkdir(parents=True, exist_ok=True)
self._client = QdrantClient(path=db_path)
self._model: SentenceTransformer | None = None # lazy load
self._ensure_collection()
@property
def model(self) -> SentenceTransformer:
if self._model is None:
log.info("loading embedding model %s", EMBEDDING_MODEL)
self._model = SentenceTransformer(EMBEDDING_MODEL)
return self._model
def _ensure_collection(self) -> None:
existing = [c.name for c in self._client.get_collections().collections]
if COLLECTION not in existing:
self._client.create_collection(
collection_name=COLLECTION,
vectors_config=VectorParams(size=EMBEDDING_DIM, distance=Distance.COSINE),
)
def index(self, docs: Iterable[Document], batch_size: int = 64) -> int:
docs = list(docs)
if not docs: return 0
# Embed in batches so we don't OOM on large corpora.
n = 0
for i in range(0, len(docs), batch_size):
batch = docs[i:i + batch_size]
vectors = self.model.encode([d.text for d in batch], normalize_embeddings=True)
points = [
PointStruct(id=d.id, vector=v.tolist(), payload={"text": d.text, **d.metadata})
for d, v in zip(batch, vectors)
]
self._client.upsert(collection_name=COLLECTION, points=points)
n += len(batch)
log.info("indexed %d docs", n)
return n
def search(
self, query: str, *, top_k: int = 5, score_threshold: float | None = None,
filter_metadata: dict[str, str] | None = None,
) -> list[SearchHit]:
qv = self.model.encode(query, normalize_embeddings=True)
qfilter = None
if filter_metadata:
qfilter = Filter(
must=[FieldCondition(key=k, match=MatchValue(value=v))
for k, v in filter_metadata.items()]
)
hits = self._client.search(
collection_name=COLLECTION,
query_vector=qv.tolist(),
limit=top_k,
score_threshold=score_threshold,
query_filter=qfilter,
)
return [
SearchHit(id=str(h.id), score=h.score,
text=h.payload["text"],
metadata={k: v for k, v in h.payload.items() if k != "text"})
for h in hits
]
Index a Sample Corpus
Use this script to index 50 short documents (use your own corpus once you have one). Save as scripts/build_index.py:
from uuid import uuid4
from text_tool.vector_store import VectorStore, Document
CORPUS = [
("Machine learning is a subfield of AI focused on data-driven prediction.", "ai-101"),
("FastAPI is a modern Python web framework with async support.", "python-web"),
("Transformers use self-attention to process sequences in parallel.", "nlp"),
("Docker containers package applications with their dependencies.", "devops"),
("Postgres pgvector enables vector similarity search inside SQL.", "databases"),
# ... add more lines, ~50 total in the real script
]
def main():
store = VectorStore(db_path="./qdrant_data")
docs = [Document(id=str(uuid4()), text=t, metadata={"category": c}) for t, c in CORPUS]
n = store.index(docs)
print(f"indexed {n} documents")
if __name__ == "__main__":
main()
Run it once:
uv run python scripts/build_index.py
Add /search to the FastAPI Service
Patch server.py from Part 3 with these additions:
from text_tool.vector_store import VectorStore
# In lifespan, alongside the LLM client:
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.client = get_client()
app.state.store = VectorStore(db_path="./qdrant_data")
yield
class SearchRequest(BaseModel):
query: str = Field(..., min_length=1, max_length=2000)
top_k: int = Field(5, ge=1, le=50)
score_threshold: float | None = Field(None, ge=0.0, le=1.0)
category: str | None = None
class SearchHitOut(BaseModel):
id: str
score: float
text: str
metadata: dict[str, str]
class SearchResponse(BaseModel):
hits: list[SearchHitOut]
@app.post("/search", response_model=SearchResponse)
@limiter.limit("120/minute")
async def search(request: Request, body: SearchRequest, _key: str = Depends(get_caller_key)):
store = request.app.state.store
filter_md = {"category": body.category} if body.category else None
hits = await run_in_thread(
store.search, body.query,
top_k=body.top_k, score_threshold=body.score_threshold,
filter_metadata=filter_md,
)
return SearchResponse(hits=[SearchHitOut(**h.__dict__) for h in hits])
Restart the service and hit it:
curl -X POST http://localhost:8000/search \
-H "X-API-Key: dev-key-001" \
-H "Content-Type: application/json" \
-d '{"query":"how does FastAPI handle async?","top_k":3}'
You'll get back semantically-ranked hits, even though "FastAPI" doesn't textually appear in some of the relevant chunks.
Lazy Loading and Why It Matters
The VectorStore above doesn't load the embedding model until the first call. That's deliberate. When the /complete endpoint is hit but /search isn't, you don't pay the ~90 MB memory cost. It also makes the boot fast — the service responds to /health in under 100ms even though warm-loading the model takes several seconds.
Batch Sizes That Don't OOM
The index method batches at 64 by default. For a 384-dim model on CPU, this is comfortable. Things to know:
- Larger batch → better throughput, more peak memory
- On a 1024-dim model with long docs, drop batch size to 16 or 32
- If you OOM mid-batch, the partial state is in Qdrant; rerun from the failed offset rather than starting over
The Chunking Setup for Part 5
Notice we embedded one-sentence "documents". On a real corpus, "embed the whole document" rarely works because:
- Long documents have multiple topics; their embedding is the average and matches nothing well
- Models have token limits (512 for MiniLM, often 8192 for newer ones) — long inputs are truncated silently
- You want to retrieve the relevant passage, not the document — so you can fit it in the LLM's prompt
Part 5 is the chunking strategy in detail. The teaser: chunk on header boundaries with MarkdownHeaderTextSplitter when you can. Fixed-size chunks with overlap when you can't. Try 500-1000 tokens per chunk as a starting point.
Key Takeaways
- Embeddings are fixed-size vectors where similar meaning maps to similar vectors. Cosine similarity is the default metric
- Start with
all-MiniLM-L6-v2. Upgrade to BGE or Qwen3 only if quality is the bottleneck - File-backed Qdrant is right for one host. The same client code scales to a Qdrant cluster when you outgrow it
- Lazy-load the embedding model. Boot stays fast; memory cost paid only when needed
- Always normalise embeddings (
normalize_embeddings=True). It makes cosine and dot product equivalent and avoids silent bugs - Batch your indexing. 64 is a reasonable default for small models; drop on bigger ones
Next Up
Part 5 combines retrieval and generation into a full RAG system — chunking, hybrid retrieval, prompt construction, citations, and evaluation. The bug at the end of that post is the same parser bug from Building a Hybrid-RAG Assistant.