How I Build AI Agents That Actually Work in Production

Introduction

I've spent the past year building AI systems that go beyond chat. My TradingAgents project uses 13 LLM agents that debate and trade NSE equities. The AML/CTF compliance system I'm building at Supreme AI runs three specialized models in sequence, each making real decisions about real people. These aren't toy demos. They're production systems where a wrong decision has consequences.

Along the way, I've developed a clear mental model for what an AI agent actually is, which frameworks are worth using, and where things break in production. This post is everything I wish I'd known when I started.

What Is an AI Agent, Really?

An AI agent is a program where an LLM controls the workflow. Unlike a chatbot that takes input and returns output, an agent operates in a loop: it observes the current state, reasons about what to do, takes an action (calls a tool, queries an API, runs code), observes the result, and repeats until the task is done.

The core loop looks like this:

memory = [user_task]

while llm_should_continue(memory):
    action = llm_get_next_action(memory)    # Think
    observations = execute_action(action)    # Act
    memory += [action, observations]         # Observe

Here's a minimal working simulation you can run right now. No API keys needed:

# Simulated agent loop (no API needed)
import random

tools = {
    "search": lambda q: f"Result for '{q}': Python was created by Guido van Rossum in 1991",
    "calculate": lambda expr: str(eval(expr)),
}

def simple_agent(task):
    memory = [f"Task: {task}"]
    steps = [
        ("search", "who created Python"),
        ("calculate", "2026 - 1991"),
    ]

    for tool_name, tool_input in steps:
        print(f"Thought: I should use {tool_name}('{tool_input}')")
        result = tools[tool_name](tool_input)
        print(f"Observation: {result}")
        memory.append(result)
        print()

    print("Final Answer: Python was created by Guido van Rossum, 35 years ago.")
    print(f"\nMemory has {len(memory)} entries")

simple_agent("How old is Python?")

That's it. Every agent framework, regardless of complexity, implements some version of this loop. The differences come down to how they manage state, route between steps, and handle failures.

The Four Components Every Agent Needs

1. The LLM (Brain)

The LLM is the reasoning engine. It interprets instructions, decides which tools to call, and synthesizes results. I've found that model choice matters a lot here. Larger models (GPT-4o, Claude Opus, Qwen3-72B) handle complex multi-step reasoning far better than smaller ones. In production, I often use a cheaper model for routine steps and a more capable one for complex planning.

2. Tools

Tools give the agent hands. Without them, an LLM can only generate text. With tools, it can search the web, query databases, execute code, call APIs, and modify files. The quality of your tool descriptions directly impacts agent performance. Vague descriptions lead to the agent calling the wrong tool or hallucinating arguments.

3. Memory

Short-term memory is the conversation history in the context window. As conversations grow, older context gets pushed out or needs summarization.

Long-term memory persists across sessions, typically via vector stores (Pinecone, Chroma, FAISS) using RAG. Text is embedded, stored, and retrieved by semantic similarity when needed.

4. Planning

The agent's ability to break complex tasks into subtasks and adapt as new information arrives. Key strategies include Chain-of-Thought (step-by-step reasoning), ReAct (interleaved reasoning and acting), and Plan-and-Execute (generate a full plan upfront, then run it).

The ReAct Pattern: How Agents Think and Act

Most production agents use the ReAct pattern from Yao et al. (2022). The core idea is simple: interleave reasoning traces with actions. The LLM thinks out loud about what it needs to do, calls a tool, observes the result, then reasons again.

# ReAct in action:

# Thought 1: I need to find the population of Nairobi.
# Action 1:  search("population of Nairobi 2026")
# Observation 1: Nairobi's population is approximately 5.2 million.

# Thought 2: Now I need to compare it with Mombasa.
# Action 2:  search("population of Mombasa 2026")
# Observation 2: Mombasa's population is approximately 1.4 million.

# Thought 3: Nairobi is about 3.7x larger. I have my answer.
# Action 3:  finish("Nairobi (5.2M) is ~3.7x larger than Mombasa (1.4M)")

Neither reasoning alone nor action alone is sufficient. Chain-of-thought without tool access leads to hallucination. Action without reasoning leads to blind tool-calling with no strategy. ReAct combines both, and 1-2 shot ReAct prompting has been shown to outperform RL methods trained on thousands of examples.

Building an Agent from Scratch (OpenAI Function Calling)

Before reaching for a framework, I think it's important to understand the raw mechanics. Here's a minimal agent using nothing but the OpenAI SDK:

import json
from openai import OpenAI

client = OpenAI()

# Define your tools
def get_weather(city: str) -> str:
    weather_data = {
        "Nairobi": "Sunny, 24C",
        "London": "Cloudy, 12C",
        "Sydney": "Clear, 19C",
    }
    return weather_data.get(city, f"No data for {city}")

def calculate(expression: str) -> str:
    try:
        return str(eval(expression, {"__builtins__": {}}))
    except Exception as e:
        return f"Error: {e}"

available_functions = {
    "get_weather": get_weather,
    "calculate": calculate,
}

# Tool schemas for the API
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"}
                },
                "required": ["city"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Evaluate a math expression using Python syntax.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string", "description": "e.g. '2**10 + 5'"}
                },
                "required": ["expression"],
            },
        },
    },
]

# The agent loop
def run_agent(user_message: str, max_iterations: int = 10):
    messages = [
        {"role": "system", "content": "You are a helpful assistant. Use tools when needed."},
        {"role": "user", "content": user_message},
    ]

    for i in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            tools=tools,
            tool_choice="auto",
        )

        msg = response.choices[0].message

        # No tool calls means we have the final answer
        if not msg.tool_calls:
            print(f"Agent: {msg.content}")
            return msg.content

        messages.append(msg)

        for tool_call in msg.tool_calls:
            name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)
            print(f"  [Tool] {name}({args})")

            result = available_functions[name](**args)
            print(f"  [Result] {result}")

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result,
            })

    return "Max iterations reached."

run_agent("What's the weather in Nairobi, and what is 2^10 + 42?")

Output:

  [Tool] get_weather({'city': 'Nairobi'})
  [Result] Sunny, 24C
  [Tool] calculate({'expression': '2**10 + 42'})
  [Result] 1066
Agent: The weather in Nairobi is Sunny at 24C. And 2^10 + 42 = 1,066.

That's 60 lines of code for a working agent. The LLM decided which tools to call, called them in parallel, and synthesized the results. No framework needed.

Try this simplified version that simulates the same pattern without needing an API key:

# Tool-calling agent simulation (runs in your browser)
import json

# Define tools
def get_weather(city):
    data = {"Nairobi": "Sunny, 24C", "London": "Cloudy, 12C", "Sydney": "Clear, 19C"}
    return data.get(city, f"No data for {city}")

def calculate(expression):
    return str(eval(expression))

tools = {"get_weather": get_weather, "calculate": calculate}

# Simulate an agent deciding to call tools
planned_calls = [
    {"name": "get_weather", "args": {"city": "Nairobi"}},
    {"name": "calculate", "args": {"expression": "2**10 + 42"}},
]

print("User: What's the weather in Nairobi, and what is 2^10 + 42?\n")

results = []
for call in planned_calls:
    fn = tools[call["name"]]
    result = fn(**call["args"])
    print(f"  [Tool] {call['name']}({call['args']})")
    print(f"  [Result] {result}\n")
    results.append(result)

print(f"Agent: The weather in Nairobi is {results[0]}.")
print(f"       And 2^10 + 42 = {results[1]}.")

Scaling Up with LangGraph

The bare-bones approach works for simple agents, but production systems need state management, persistence, error handling, and human-in-the-loop controls. That's where I reach for LangGraph.

LangGraph models your agent as a state machine: nodes (LLM calls, tool execution), edges (transitions), and conditional routing. Here's a proper agent with tools:

from typing import Literal
from langchain_core.messages import HumanMessage, SystemMessage, ToolMessage
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
from langgraph.graph import END, START, MessagesState, StateGraph

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

@tool
def search_web(query: str) -> str:
    """Search the web for current information about a topic."""
    # In production, use Tavily, SerpAPI, etc.
    return f"Search results for: {query}"

@tool
def run_python(code: str) -> str:
    """Run Python code and return the output."""
    from io import StringIO
    import sys
    old_stdout = sys.stdout
    sys.stdout = captured = StringIO()
    try:
        exec(code, {})
        return captured.getvalue().strip() or "Executed successfully."
    except Exception as e:
        return f"Error: {e}"
    finally:
        sys.stdout = old_stdout

tools = [search_web, run_python]
tools_by_name = {t.name: t for t in tools}
model_with_tools = model.bind_tools(tools)

def call_llm(state: MessagesState):
    system = SystemMessage(content="You are a research assistant. Use tools when needed.")
    response = model_with_tools.invoke([system] + state["messages"])
    return {"messages": [response]}

def call_tools(state: MessagesState):
    results = []
    for tc in state["messages"][-1].tool_calls:
        result = tools_by_name[tc["name"]].invoke(tc["args"])
        results.append(ToolMessage(content=result, tool_call_id=tc["id"]))
    return {"messages": results}

def should_continue(state: MessagesState) -> Literal["tools", "__end__"]:
    if state["messages"][-1].tool_calls:
        return "tools"
    return END

# Build the graph
graph = StateGraph(MessagesState)
graph.add_node("llm", call_llm)
graph.add_node("tools", call_tools)
graph.add_edge(START, "llm")
graph.add_conditional_edges("llm", should_continue, {"tools": "tools", END: END})
graph.add_edge("tools", "llm")

agent = graph.compile()

# Run it
result = agent.invoke(
    {"messages": [HumanMessage(content="Search for LangGraph, then write Python to sort [5,2,8,1]")]}
)

The graph structure is the classic ReAct loop:

START -> [LLM] -> has_tool_calls? -> yes -> [Tools] -> back to [LLM]
                                    -> no  -> END

LangGraph gives me checkpointing (resume from any state), human-in-the-loop interrupts, and streaming. These are the things that matter when you're building agents that make real decisions.

Choosing a Framework

Framework	Best For	Learning Curve
LangGraph	Production agents, custom control flow	Medium-High
CrewAI	Role-based multi-agent pipelines	Low
AutoGen	Agent debate and negotiation	Medium
Smolagents	Lightweight code-executing agents	Low
OpenAI SDK	OpenAI-ecosystem agents with guardrails	Low

I use LangGraph for most of my production work because I need fine-grained control over agent behavior. For quick prototypes or role-based pipelines, CrewAI is surprisingly effective. If you're building with OpenAI models exclusively, their Agents SDK has excellent built-in tracing and guardrails.

Multi-Agent Systems: When One Agent Isn't Enough

My TradingAgents project taught me that some problems genuinely need multiple agents. When I need four different analytical perspectives (fundamentals, technicals, sentiment, news) evaluated independently and then debated, a single agent with 15 tools just doesn't cut it. The context gets too crowded and tool selection degrades.

The main orchestration patterns I've used:

Pipeline (Sequential): Agent A processes, passes to Agent B, then Agent C. I use this for my AML/CTF system where risk profiling feeds into the CDD interview, which feeds into the decision engine.
Fan-out/Fan-in (Concurrent): Multiple agents analyze the same input in parallel, then an aggregator synthesizes results. This is how TradingAgents works: four analyst agents run concurrently, then a debate agent resolves disagreements.
Debate (Group Chat): Agents argue for and against a position. Useful for high-stakes decisions where you want multiple perspectives before committing.

A word of caution: don't use multi-agent systems when a single agent can do the job. Each additional agent adds latency, cost, and failure modes. 88-95% of AI agent pilots never reach production, and over-engineering is a common reason.

Where Agents Break in Production

Here's what I've learned the hard way:

1. Tool Hallucination

The LLM invents tool names that don't exist or fabricates arguments. In a compliance system, a hallucinated function call doesn't just fail quietly. It can cascade into multiple downstream systems. Always validate tool names and arguments against the schema before execution.

2. Infinite Loops

An agent keeps calling the same tool with slightly different arguments, or loops back to verify something it already verified. Always set a hard max_iterations limit. I use 10-15 for most agents.

3. Context Window Overflow

As tool results accumulate, the context fills up. The LLM starts drowning in information, and paradoxically, sometimes less context produces better results. I summarize tool outputs before appending them to context, keeping only the key facts.

4. Prompt Injection

Any agent that processes external data (web pages, documents, user uploads) is vulnerable. A webpage could contain hidden instructions that the agent follows. I use input validation, output sanitization, and human-in-the-loop approval for sensitive actions.

Key Takeaways

An AI agent is just an LLM in a loop: Observe, Think, Act, repeat
Start with raw function calling to understand the mechanics, then add a framework when you need state management
The ReAct pattern (interleaved reasoning and acting) is the foundation of most production agents
Tool descriptions are as important as the model choice. Bad descriptions lead to bad tool selection
Multi-agent systems are powerful but add complexity. Use them only when a single agent genuinely can't handle the task
Production agents need hard iteration limits, input validation, tool schema enforcement, and human-in-the-loop controls

Research Papers Worth Reading

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)
Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023)
Voyager: An Open-Ended Embodied Agent with LLMs (Wang et al., 2023)
HuggingGPT: Solving AI Tasks with ChatGPT and Hugging Face (Shen et al., 2023)
Generative Agents: Interactive Simulacra of Human Behavior (Park et al., 2023)
A Survey on LLM-based Autonomous Agents (Wang et al., 2023)