Hardware required: any laptop. Python: 3.11+. API key: you'll need either an OpenAI or Anthropic key, or both.
The Lesson Isn't the Four-Line Baseline
Calling an LLM API is pip install openai and four lines of code. That's not the lesson. The lesson is everything that goes wrong in production — rate limits, timeouts, malformed JSON, bills you didn't expect — and how to handle each.
This post is about building llm_client.py: a wrapper module that the rest of the series imports. By the end, calling an LLM looks like:
from text_tool.llm_client import get_client
client = get_client()
result = client.complete("Summarise this paragraph: ...", max_tokens=200)
And under the hood: retries on transient failures, optional streaming, optional structured output, automatic cost tracking, easy provider swap.
The Four-Line Baseline (and Why It Isn't Enough)
Install the SDKs:
uv add openai==2.5.0 anthropic==1.5.0 tenacity==9.0.0 pydantic==2.10.0 httpx==0.28.0
The naive call:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
This works once. It does not work in production. It has no timeout, no retries, no cost tracking, no error handling, no streaming, no way to swap to Anthropic without rewriting it.
The Wrapper Interface
I want one interface, two backends. The shape:
class LLMClient(Protocol):
def complete(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> CompletionResult: ...
def stream(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> Iterator[str]: ...
def complete_structured[T: BaseModel](self, prompt: str, schema: type[T], *, max_tokens: int = 1024) -> T: ...
Three methods cover ~95% of LLM use. complete for batch, stream for chat UX, complete_structured for anywhere the output goes into another piece of code.
Build It: llm_client.py
Save this as src/text_tool/llm_client.py. It's longer than a snippet because the wrapper has to actually work; copy it whole.
"""Provider-agnostic LLM client. OpenAI + Anthropic. Retries, streaming, structured output."""
from __future__ import annotations
import json
import logging
import os
from dataclasses import dataclass, field
from typing import Iterator, Protocol, TypeVar
import httpx
from openai import OpenAI, APIError, APITimeoutError, RateLimitError
from anthropic import Anthropic, APIError as AnthropicAPIError
from pydantic import BaseModel
from tenacity import (
retry, stop_after_attempt, wait_exponential, retry_if_exception_type,
)
logger = logging.getLogger(__name__)
T = TypeVar("T", bound=BaseModel)
# Per-1M-token prices, USD. Update when models change.
PRICING = {
"gpt-4o-mini": {"in": 0.15, "out": 0.60},
"gpt-4o": {"in": 2.50, "out": 10.00},
"claude-haiku-4-5": {"in": 1.00, "out": 5.00},
"claude-sonnet-4-6": {"in": 3.00, "out": 15.00},
}
@dataclass
class CompletionResult:
text: str
model: str
input_tokens: int
output_tokens: int
@property
def cost_usd(self) -> float:
p = PRICING.get(self.model, {"in": 0.0, "out": 0.0})
return (self.input_tokens * p["in"] + self.output_tokens * p["out"]) / 1_000_000
@dataclass
class CostTracker:
"""Accumulates spend across a session. Singleton-friendly."""
calls: int = 0
input_tokens: int = 0
output_tokens: int = 0
cost_usd: float = 0.0
def record(self, r: CompletionResult) -> None:
self.calls += 1
self.input_tokens += r.input_tokens
self.output_tokens += r.output_tokens
self.cost_usd += r.cost_usd
def report(self) -> str:
return (f"calls={self.calls} in={self.input_tokens} out={self.output_tokens} "
f"cost=${self.cost_usd:.4f}")
class LLMClient(Protocol):
def complete(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> CompletionResult: ...
def stream(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> Iterator[str]: ...
def complete_structured(self, prompt: str, schema: type[T], *, max_tokens: int = 1024) -> T: ...
# ---------- OpenAI backend ----------
class OpenAIClient:
def __init__(self, model: str = "gpt-4o-mini", tracker: CostTracker | None = None) -> None:
self._client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
timeout=httpx.Timeout(30.0, connect=5.0),
)
self._model = model
self._tracker = tracker or CostTracker()
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=20),
retry=retry_if_exception_type((RateLimitError, APITimeoutError, APIError)),
)
def complete(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> CompletionResult:
resp = self._client.chat.completions.create(
model=self._model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=temperature,
)
result = CompletionResult(
text=resp.choices[0].message.content or "",
model=self._model,
input_tokens=resp.usage.prompt_tokens,
output_tokens=resp.usage.completion_tokens,
)
self._tracker.record(result)
logger.info("openai call: %s", self._tracker.report())
return result
def stream(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> Iterator[str]:
with self._client.chat.completions.stream(
model=self._model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=temperature,
) as stream:
for event in stream:
if event.type == "content.delta":
yield event.delta
def complete_structured(self, prompt: str, schema: type[T], *, max_tokens: int = 1024) -> T:
resp = self._client.chat.completions.parse(
model=self._model,
messages=[{"role": "user", "content": prompt}],
response_format=schema,
max_tokens=max_tokens,
)
return resp.choices[0].message.parsed # type: ignore[return-value]
# ---------- Anthropic backend ----------
class AnthropicClient:
def __init__(self, model: str = "claude-haiku-4-5", tracker: CostTracker | None = None) -> None:
self._client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"], timeout=30.0)
self._model = model
self._tracker = tracker or CostTracker()
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=20),
retry=retry_if_exception_type(AnthropicAPIError),
)
def complete(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> CompletionResult:
resp = self._client.messages.create(
model=self._model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=temperature,
)
result = CompletionResult(
text=resp.content[0].text if resp.content else "",
model=self._model,
input_tokens=resp.usage.input_tokens,
output_tokens=resp.usage.output_tokens,
)
self._tracker.record(result)
logger.info("anthropic call: %s", self._tracker.report())
return result
def stream(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> Iterator[str]:
with self._client.messages.stream(
model=self._model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=temperature,
) as stream:
for delta in stream.text_stream:
yield delta
def complete_structured(self, prompt: str, schema: type[T], *, max_tokens: int = 1024) -> T:
# Anthropic doesn't have native parsed-response yet; do JSON-mode + parse.
instructions = (
f"Respond with JSON matching this schema:\n{schema.model_json_schema()}\n"
f"No prose, no code fences, just JSON."
)
resp = self._client.messages.create(
model=self._model,
messages=[{"role": "user", "content": f"{instructions}\n\n{prompt}"}],
max_tokens=max_tokens,
)
raw = resp.content[0].text.strip()
# Defensive: strip ```json fences if the model adds them
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0].strip()
return schema.model_validate(json.loads(raw))
# ---------- Factory ----------
_TRACKER = CostTracker()
def get_client(provider: str | None = None) -> LLMClient:
"""Pick a provider. Default: env LLM_PROVIDER, fallback OpenAI."""
provider = (provider or os.environ.get("LLM_PROVIDER", "openai")).lower()
if provider == "openai":
return OpenAIClient(tracker=_TRACKER)
if provider == "anthropic":
return AnthropicClient(tracker=_TRACKER)
raise ValueError(f"unknown provider: {provider}")
def session_cost() -> CostTracker:
return _TRACKER
What Each Piece Buys You
| Concern | Mechanism | Why |
|---|---|---|
| Rate limits / 5xx | tenacity retry with exponential backoff | Transient errors retry automatically; permanent errors raise after 3 attempts |
| Timeouts | httpx.Timeout(30.0, connect=5.0) | Default SDK timeout is too long — you want to fail fast and retry |
| Cost | CostTracker singleton | Every call updates the running total. Log it; alert on it; don't get billed in the dark |
| Streaming | stream() generator | Token-by-token rendering for chat UX. Doesn't change the integration shape |
| Structured output | OpenAI parse; Anthropic JSON-mode + Pydantic model_validate | Schema goes in, validated object comes out, no json.loads in your business logic |
| Provider swap | get_client(provider) + LLMClient Protocol | One env var change moves you between OpenAI, Anthropic, and (in Part 8) self-hosted vLLM |
Use It from the CLI
Add a --summarise flag to the Part 1 CLI. Save as src/text_tool/cli.py (replacing the previous version):
from __future__ import annotations
import argparse, logging, sys
from pathlib import Path
from text_tool.llm_client import get_client, session_cost
from text_tool.processing import top_words, read_text, FileNotReadable # from Part 1
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
def main(argv: list[str] | None = None) -> int:
p = argparse.ArgumentParser()
p.add_argument("path", type=Path)
p.add_argument("-n", "--top", type=int, default=10)
p.add_argument("--summarise", action="store_true")
p.add_argument("--stream", action="store_true")
args = p.parse_args(argv)
try:
text = read_text(args.path)
except FileNotReadable as e:
log.error("%s", e); return 2
if args.summarise:
client = get_client()
prompt = f"Summarise the following text in 3 bullet points:\n\n{text[:8000]}"
if args.stream:
for chunk in client.stream(prompt, max_tokens=300):
print(chunk, end="", flush=True)
print()
else:
result = client.complete(prompt, max_tokens=300)
print(result.text)
log.info("session %s", session_cost().report())
else:
for word, count in top_words(text, args.top):
print(f"{count:6d} {word}")
return 0
if __name__ == "__main__":
sys.exit(main())
Run it:
export OPENAI_API_KEY=sk-...
uv run python -m text_tool.cli sample.txt --summarise --stream
Caching: When and Why
For deterministic prompts (temperature 0, same input), repeated calls return the same output and you're paying twice. A simple in-memory cache:
from functools import lru_cache
@lru_cache(maxsize=512)
def cached_complete(prompt: str) -> str:
return get_client().complete(prompt).text
This is fine for development. For production you want a disk-backed cache (so cache survives restarts) or a Redis cache (so cache shares across processes). Don't cache creative generations — the user is paying for novelty, not a stale answer.
Common Failure Modes (and What They Look Like)
- Truncation. The model hits
max_tokensand stops mid-sentence. Always set max_tokens high enough — or detectfinish_reason == "length"and continue - Malformed JSON. Even with structured output, sometimes the model emits a code fence around it. The wrapper above strips fences defensively. Always parse via Pydantic, never
json.loadsdirectly - Rate-limit cascades. One slow request triggers a queue of retries that triggers more rate limits. Exponential backoff is the only correct response
- Surprise bills. Big context + big output + lots of calls = lots of dollars. Always log per-call cost. Always set a daily ceiling somewhere upstream
Key Takeaways
- Wrap every LLM call. Never use the raw SDK in business logic. The wrapper is the seam where retries, cost tracking, streaming, and provider swap all live
- Use
tenacityfor retries with exponential backoff. The defaults will save you - Always set explicit timeouts. The SDK defaults are too long
- Track cost every call. Log it. Alert on it
- For structured output: schema in, validated Pydantic object out. The wrapper handles both native (OpenAI) and prompted (Anthropic) JSON modes
- Hide the provider behind a Protocol. Part 8 swaps OpenAI for self-hosted vLLM by changing one env var
Next Up
Part 3 wraps this client in a FastAPI service so other applications can call it over HTTP — with auth, rate limiting, streaming responses, and a Dockerfile.