Part 2: Your First LLM API Call

Hardware required: any laptop. Python: 3.11+. API key: you'll need either an OpenAI or Anthropic key, or both.

The Lesson Isn't the Four-Line Baseline

Calling an LLM API is pip install openai and four lines of code. That's not the lesson. The lesson is everything that goes wrong in production — rate limits, timeouts, malformed JSON, bills you didn't expect — and how to handle each.

This post is about building llm_client.py: a wrapper module that the rest of the series imports. By the end, calling an LLM looks like:

from text_tool.llm_client import get_client

client = get_client()
result = client.complete("Summarise this paragraph: ...", max_tokens=200)

And under the hood: retries on transient failures, optional streaming, optional structured output, automatic cost tracking, easy provider swap.

The Four-Line Baseline (and Why It Isn't Enough)

Install the SDKs:

uv add openai==2.5.0 anthropic==1.5.0 tenacity==9.0.0 pydantic==2.10.0 httpx==0.28.0

The naive call:

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

This works once. It does not work in production. It has no timeout, no retries, no cost tracking, no error handling, no streaming, no way to swap to Anthropic without rewriting it.

The Wrapper Interface

I want one interface, two backends. The shape:

class LLMClient(Protocol):
    def complete(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> CompletionResult: ...
    def stream(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> Iterator[str]: ...
    def complete_structured[T: BaseModel](self, prompt: str, schema: type[T], *, max_tokens: int = 1024) -> T: ...

Three methods cover ~95% of LLM use. complete for batch, stream for chat UX, complete_structured for anywhere the output goes into another piece of code.

Build It: `llm_client.py`

Save this as src/text_tool/llm_client.py. It's longer than a snippet because the wrapper has to actually work; copy it whole.

"""Provider-agnostic LLM client. OpenAI + Anthropic. Retries, streaming, structured output."""
from __future__ import annotations

import json
import logging
import os
from dataclasses import dataclass, field
from typing import Iterator, Protocol, TypeVar

import httpx
from openai import OpenAI, APIError, APITimeoutError, RateLimitError
from anthropic import Anthropic, APIError as AnthropicAPIError
from pydantic import BaseModel
from tenacity import (
    retry, stop_after_attempt, wait_exponential, retry_if_exception_type,
)

logger = logging.getLogger(__name__)
T = TypeVar("T", bound=BaseModel)


# Per-1M-token prices, USD. Update when models change.
PRICING = {
    "gpt-4o-mini":     {"in": 0.15,  "out": 0.60},
    "gpt-4o":          {"in": 2.50,  "out": 10.00},
    "claude-haiku-4-5": {"in": 1.00,  "out": 5.00},
    "claude-sonnet-4-6": {"in": 3.00, "out": 15.00},
}


@dataclass
class CompletionResult:
    text: str
    model: str
    input_tokens: int
    output_tokens: int

    @property
    def cost_usd(self) -> float:
        p = PRICING.get(self.model, {"in": 0.0, "out": 0.0})
        return (self.input_tokens * p["in"] + self.output_tokens * p["out"]) / 1_000_000


@dataclass
class CostTracker:
    """Accumulates spend across a session. Singleton-friendly."""
    calls: int = 0
    input_tokens: int = 0
    output_tokens: int = 0
    cost_usd: float = 0.0

    def record(self, r: CompletionResult) -> None:
        self.calls += 1
        self.input_tokens += r.input_tokens
        self.output_tokens += r.output_tokens
        self.cost_usd += r.cost_usd

    def report(self) -> str:
        return (f"calls={self.calls} in={self.input_tokens} out={self.output_tokens} "
                f"cost=${self.cost_usd:.4f}")


class LLMClient(Protocol):
    def complete(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> CompletionResult: ...
    def stream(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> Iterator[str]: ...
    def complete_structured(self, prompt: str, schema: type[T], *, max_tokens: int = 1024) -> T: ...


# ---------- OpenAI backend ----------
class OpenAIClient:
    def __init__(self, model: str = "gpt-4o-mini", tracker: CostTracker | None = None) -> None:
        self._client = OpenAI(
            api_key=os.environ["OPENAI_API_KEY"],
            timeout=httpx.Timeout(30.0, connect=5.0),
        )
        self._model = model
        self._tracker = tracker or CostTracker()

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=20),
        retry=retry_if_exception_type((RateLimitError, APITimeoutError, APIError)),
    )
    def complete(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> CompletionResult:
        resp = self._client.chat.completions.create(
            model=self._model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature,
        )
        result = CompletionResult(
            text=resp.choices[0].message.content or "",
            model=self._model,
            input_tokens=resp.usage.prompt_tokens,
            output_tokens=resp.usage.completion_tokens,
        )
        self._tracker.record(result)
        logger.info("openai call: %s", self._tracker.report())
        return result

    def stream(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> Iterator[str]:
        with self._client.chat.completions.stream(
            model=self._model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature,
        ) as stream:
            for event in stream:
                if event.type == "content.delta":
                    yield event.delta

    def complete_structured(self, prompt: str, schema: type[T], *, max_tokens: int = 1024) -> T:
        resp = self._client.chat.completions.parse(
            model=self._model,
            messages=[{"role": "user", "content": prompt}],
            response_format=schema,
            max_tokens=max_tokens,
        )
        return resp.choices[0].message.parsed  # type: ignore[return-value]


# ---------- Anthropic backend ----------
class AnthropicClient:
    def __init__(self, model: str = "claude-haiku-4-5", tracker: CostTracker | None = None) -> None:
        self._client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"], timeout=30.0)
        self._model = model
        self._tracker = tracker or CostTracker()

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=20),
        retry=retry_if_exception_type(AnthropicAPIError),
    )
    def complete(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> CompletionResult:
        resp = self._client.messages.create(
            model=self._model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature,
        )
        result = CompletionResult(
            text=resp.content[0].text if resp.content else "",
            model=self._model,
            input_tokens=resp.usage.input_tokens,
            output_tokens=resp.usage.output_tokens,
        )
        self._tracker.record(result)
        logger.info("anthropic call: %s", self._tracker.report())
        return result

    def stream(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.0) -> Iterator[str]:
        with self._client.messages.stream(
            model=self._model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature,
        ) as stream:
            for delta in stream.text_stream:
                yield delta

    def complete_structured(self, prompt: str, schema: type[T], *, max_tokens: int = 1024) -> T:
        # Anthropic doesn't have native parsed-response yet; do JSON-mode + parse.
        instructions = (
            f"Respond with JSON matching this schema:\n{schema.model_json_schema()}\n"
            f"No prose, no code fences, just JSON."
        )
        resp = self._client.messages.create(
            model=self._model,
            messages=[{"role": "user", "content": f"{instructions}\n\n{prompt}"}],
            max_tokens=max_tokens,
        )
        raw = resp.content[0].text.strip()
        # Defensive: strip ```json fences if the model adds them
        if raw.startswith("```"):
            raw = raw.split("\n", 1)[1].rsplit("```", 1)[0].strip()
        return schema.model_validate(json.loads(raw))


# ---------- Factory ----------
_TRACKER = CostTracker()


def get_client(provider: str | None = None) -> LLMClient:
    """Pick a provider. Default: env LLM_PROVIDER, fallback OpenAI."""
    provider = (provider or os.environ.get("LLM_PROVIDER", "openai")).lower()
    if provider == "openai":
        return OpenAIClient(tracker=_TRACKER)
    if provider == "anthropic":
        return AnthropicClient(tracker=_TRACKER)
    raise ValueError(f"unknown provider: {provider}")


def session_cost() -> CostTracker:
    return _TRACKER

What Each Piece Buys You

Concern	Mechanism	Why
Rate limits / 5xx	`tenacity` retry with exponential backoff	Transient errors retry automatically; permanent errors raise after 3 attempts
Timeouts	`httpx.Timeout(30.0, connect=5.0)`	Default SDK timeout is too long — you want to fail fast and retry
Cost	`CostTracker` singleton	Every call updates the running total. Log it; alert on it; don't get billed in the dark
Streaming	`stream()` generator	Token-by-token rendering for chat UX. Doesn't change the integration shape
Structured output	OpenAI `parse`; Anthropic JSON-mode + Pydantic `model_validate`	Schema goes in, validated object comes out, no `json.loads` in your business logic
Provider swap	`get_client(provider)` + `LLMClient` Protocol	One env var change moves you between OpenAI, Anthropic, and (in Part 8) self-hosted vLLM

Use It from the CLI

Add a --summarise flag to the Part 1 CLI. Save as src/text_tool/cli.py (replacing the previous version):

from __future__ import annotations
import argparse, logging, sys
from pathlib import Path

from text_tool.llm_client import get_client, session_cost
from text_tool.processing import top_words, read_text, FileNotReadable  # from Part 1

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)


def main(argv: list[str] | None = None) -> int:
    p = argparse.ArgumentParser()
    p.add_argument("path", type=Path)
    p.add_argument("-n", "--top", type=int, default=10)
    p.add_argument("--summarise", action="store_true")
    p.add_argument("--stream", action="store_true")
    args = p.parse_args(argv)

    try:
        text = read_text(args.path)
    except FileNotReadable as e:
        log.error("%s", e); return 2

    if args.summarise:
        client = get_client()
        prompt = f"Summarise the following text in 3 bullet points:\n\n{text[:8000]}"
        if args.stream:
            for chunk in client.stream(prompt, max_tokens=300):
                print(chunk, end="", flush=True)
            print()
        else:
            result = client.complete(prompt, max_tokens=300)
            print(result.text)
        log.info("session %s", session_cost().report())
    else:
        for word, count in top_words(text, args.top):
            print(f"{count:6d}  {word}")
    return 0


if __name__ == "__main__":
    sys.exit(main())

Run it:

export OPENAI_API_KEY=sk-...
uv run python -m text_tool.cli sample.txt --summarise --stream

Caching: When and Why

For deterministic prompts (temperature 0, same input), repeated calls return the same output and you're paying twice. A simple in-memory cache:

from functools import lru_cache


@lru_cache(maxsize=512)
def cached_complete(prompt: str) -> str:
    return get_client().complete(prompt).text

This is fine for development. For production you want a disk-backed cache (so cache survives restarts) or a Redis cache (so cache shares across processes). Don't cache creative generations — the user is paying for novelty, not a stale answer.

Common Failure Modes (and What They Look Like)

Truncation. The model hits max_tokens and stops mid-sentence. Always set max_tokens high enough — or detect finish_reason == "length" and continue
Malformed JSON. Even with structured output, sometimes the model emits a code fence around it. The wrapper above strips fences defensively. Always parse via Pydantic, never json.loads directly
Rate-limit cascades. One slow request triggers a queue of retries that triggers more rate limits. Exponential backoff is the only correct response
Surprise bills. Big context + big output + lots of calls = lots of dollars. Always log per-call cost. Always set a daily ceiling somewhere upstream

Key Takeaways

Wrap every LLM call. Never use the raw SDK in business logic. The wrapper is the seam where retries, cost tracking, streaming, and provider swap all live
Use tenacity for retries with exponential backoff. The defaults will save you
Always set explicit timeouts. The SDK defaults are too long
Track cost every call. Log it. Alert on it
For structured output: schema in, validated Pydantic object out. The wrapper handles both native (OpenAI) and prompted (Anthropic) JSON modes
Hide the provider behind a Protocol. Part 8 swaps OpenAI for self-hosted vLLM by changing one env var

Next Up

Part 3 wraps this client in a FastAPI service so other applications can call it over HTTP — with auth, rate limiting, streaming responses, and a Dockerfile.

Next: Part 3 — Building APIs with FastAPI