How do AI agents work, anyway?

9 minute read

In this post, I’ll provide a brief conceptual introduction to AI agents and analyze the implementation of the smolagents library by examining its OpenAI API calls using mitmproxy and DuckDB.

What are agents?

An AI agent is a program that performs actions through a set of tools. For example, ChatGPT is an agent that can search the web via a tool. Agents use Large Language Models (LLMs) to break down tasks into smaller ones (planning), choose which tools to use at each step, and determine when the task is complete.

Tools are typically functions (like Python functions) that the agent calls to retrieve results or perform actions (such as writing to a database).

The plan is the series of steps that the agent will perform. Not all plans are created equal: shorter plans and less computationally expensive plans are desirable.

To learn more about AI agents, check out Chip Huyen’s blog post and the smolagents blog post.

Code agents

Since the LLM decides which tools to run at each step, we need a way to represent tool calling (aka function calling). Code agents represent their tool calls using actual code (e.g., Python code), in contrast to other agents which represent tool calls with JSON. Research has shown that code-based tool calling produces more effective agents.

We’ll be using the smolagents framework to understand how agents work with the code agent configuration (though you can also use the JSON configuration).

Setup

First, let’s install the required dependencies:

  • smolagents for running the agent
  • mitmproxy for intercepting OpenAI API requests
  • DuckDB for querying the OpenAI API logs
  • rich for prettier terminal output
# Install packages (including litellm for OpenAI model support)
pip install 'smolagents[litellm]' mitmproxy duckdb rich

Download the code:

git clone --depth 1 https://github.com/edublancas/posts
cd smolagents/

Next, start the reverse proxy to intercept OpenAI requests and log them to a .jsonl file that we’ll query later with DuckDB:

mitmdump -s proxy_logger.py --mode reverse:https://api.openai.com --listen-port 8080

Basic example (multiplication, no tools)

Let’s start with a basic example: asking the model to perform a simple multiplication without providing any tools.

from smolagents import CodeAgent, OpenAIServerModel

# Initialize the model with our reverse proxy
model = OpenAIServerModel(
    model_id="gpt-4o-mini",
    api_base="http://localhost:8080/v1",
)

# Create an agent with no tools
agent = CodeAgent(tools=[], model=model, add_base_tools=False)

# Run the agent with a simple multiplication task
agent.run("How much is 2 * 21?")

After running the code, we can view the execution logs by running python print.py, which displays all logs from the .jsonl file.

Prompt

Here’s the prompt for the API call, I removed several parts to make it shorter (several parts are redundant to deal with the stuborrness of LLMs), but kept the overall message the same. You can see the complete logs in the openai_logs_no_base_tools.jsonl file.

System

The system prompt indicates the model what their purpose is and the rules it must abide to. It essentially tells the LLM that its job is to solve tasks with tools, and that solving the tasks involves three steps: Thought, Code, and Observation:

You are an expert assistant who can solve any task using code blobs. You have been
given access to a list of tools: these tools are Python functions.

To solve the task, you must plan forward to proceed in a series of steps, in a cycle
of 'Thought:', 'Code:', and 'Observation:' sequences.

'Thought:' sequence, you should explain your reasoning and the tools that you want to use.

'Code:' sequence, you should write the code in Python.

During each intermediate step, use `print()` to save important information.
These `print` outputs will then appear in the 'Observation:' field, which will be
available as input for the next step.

In the end you have to return a final answer using the `final_answer` tool.

Here are a few examples:

---

Task: "What is the result of the following operation: 5 + 3 + 2"

Thought: I will use python code to compute the result of the operation and then return
the final answer using the `final_answer` tool

Code:
```py
result = 5 + 3 + 2
final_answer(result)
```
<end_code>
---

[...MORE EXAMPLES HERE]

After listing a few more examples, the system prompt includes the available tools (we only have the final_answer tool) and the rules it must abide by:

You only have access to these tools:


- final_answer: Provides a final answer to the given problem.
    Takes inputs: {'answer': {'type': 'any', 'description': 'The final answer to the problem'}}
    Returns an output of type: any
[... MORE TOOLS ARE ADDED HERE, IF ANY]


Here are the rules you should always follow to solve your task:

1. Always provide a 'Thought:' sequence, and a 'Code:
```py' sequence ending with '```<end_code>' sequence, else you will fail.
[...MORE RULES HERE]

User

The next message has {"role": "user"}, and it contains the task to perform:

New task: How much is 2 * 21?

Response

Remember that one of the rules in the system prompt says:

Always provide a ‘Thought:’ sequence, and a ‘Code:’ sequence

Hence, the model proceeds to return Thought: and Code:

Thought: This is a simple multiplication task. I will multiply 2 by 21 and return the
result using the `final_answer` tool.

Code:

```py
result = 2 * 21
final_answer(result)
```

The agent then runs the Python code, and since the Code: portion already uses final_answer, it knows it has finished.

Output

Here’s the output that agent.run displays:

╭──────────── New run ────────────╮
│                                 │
│ How much is 2 * 21?             │
│                                 │
╰─ OpenAIServerModel - gpt-4o-mini╯
━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━
 ─ Executing parsed code: ─────────────
  result = 2 * 21
  final_answer(result)
 ──────────────────────────────────────
Out - Final answer: 42
[Step 0: Duration 1.08 seconds| Input tokens: 1,956 | Output tokens: 52]

Multi-step example (fibonacci series, no tools)

The previous task was trivial (no tools required, besides the final_answer tool). The true potential of AI agents lies in performing complex tasks by using tools. Let’s look at another agent whose plan involves two steps.

from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(
    model_id="gpt-4o-mini",
    api_base="http://localhost:8080/v1",
)

agent = CodeAgent(tools=[], model=model, add_base_tools=False)

agent.run(
    "Could you give me the 118th number in the Fibonacci sequence?",
)

Prompt (1st step)

System

The system prompt is the same as in the previous example.

User

New task: Could you give me the 118th number in the Fibonacci sequence?

Response (1st step)

Unlike the first example, this response doesn’t use final_answer:

Thought: To find the 118th number in the Fibonacci sequence, I can use a simple
iterative approach to compute Fibonacci numbers until the 118th index. This will be
efficient and easy to implement. I'll create two variables to hold the last two
Fibonacci numbers and iteratively compute the next Fibonacci number until I reach
the 118th one.

Code:
```py
def fibonacci(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

fibonacci_118 = fibonacci(118)
print(fibonacci_118)
```

This example highlights one of the key challenges in agent systems: planning efficiency. While a task can often be solved in multiple ways, some plans are more efficient than others. Longer plans require more computational steps and time to execute, but current models don’t necessarily optimize for the shortest possible solution.

In this case, the model chose a two-step approach when a single step would have sufficed: it could have directly called final_answer(fibonacci_118) instead of using print(fibonacci_118).

Since the model didn’t use final_answer to conclude the task, we need to proceed with an additional step.

Prompt (2nd step)

System

The system prompt is the same as in the previous example.

User

New task: Could you give me the 118th number in the Fibonacci sequence?

Assistant

Unlike the first example where we had a single step, this time we have an assistant message. This message includes the output from the previous step (Thought: and Code:) along with information about the tools that were called (Calling tools:).

Thought: To find the 118th number in the Fibonacci sequence, I can use a simple iterative approach to compute Fibonacci
numbers until the 118th index. This will be efficient and easy to implement. I'll create two variables to hold the last
two Fibonacci numbers and iteratively compute the next Fibonacci number until I reach the 118th one.

Code:
```py
def fibonacci(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

fibonacci_118 = fibonacci(118)
print(fibonacci_118)
```
Calling tools:
[{'id': 'call_1', 'type': 'function', 'function': {'name': 'python_interpreter', 'arguments': 'def fibonacci(n):\n    a,
b = 0, 1\n    for _ in range(n):\n        a, b = b, a + b\n    return a\n\nfibonacci_118 =
fibonacci(118)\nprint(fibonacci_118)'}}]

User

To generate the final message, the agent runs the Python code, and prints the output:

Call id: call_1
Observation:
Execution logs:
2046711111473984623691759
Last output from code snippet:
None

Response

The response from the second API call shows that the model has already identified the final output and produces a new code snippet that just uses final_answer, which finishes the agent execution.

Thought: The 118th number in the Fibonacci sequence is 2046711111473984623691759. Now I will provide this as the final
answer using the `final_answer` tool.

Code:
```py
final_answer(2046711111473984623691759)
```

Output

This is the output we see in the terminal.

╭──────────── New run ────────────╮
│                                 │
│ Could you give me the 118th     │
│ number in the Fibonacci         │
│ sequence?                       │
│                                 │
╰─ OpenAIServerModel - gpt-4o-mini╯
━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━
 ─ Executing parsed code: ─────────────
  def fibonacci(n):
      a, b = 0, 1
      for _ in range(n):
          a, b = b, a + b
      return a

  fibonacci_118 = fibonacci(118)
  print(fibonacci_118)
 ──────────────────────────────────────
Execution logs:
2046711111473984623691759

Out: None
[Step 0: Duration 2.70 seconds| Input tokens: 1,961 | Output tokens: 127]
━━━━━━━━━━━━━━━━ Step 2 ━━━━━━━━━━━━━━━━
 ─ Executing parsed code: ─────────────
  final_answer(2046711111473984623691759)
 ──────────────────────────────────────
Out - Final answer: 2046711111473984623691759
[Step 1: Duration 1.55 seconds| Input tokens: 4,179 | Output tokens: 186]

Final thoughts

The concept of AI agents is rapidly evolving. Encouragingly, a consensus is emerging around their core concepts: agents plan and use tools to accomplish tasks. However, research in this field is still a work in progress. As new research emerges and more powerful models are developed, many existing frameworks will likely become outdated. This is why I believe it’s crucial to understand what’s happening behind the scenes; specifically, what API calls are being made to the LLM. This understanding allows us to grasp their strengths and weaknesses, customize their behavior, or even develop our own solutions when existing options don’t meet our needs.

A significant limitation of current frameworks is their reliance on hardcoded prompts, as there’s no guarantee these prompts will perform optimally for specific tasks. I predict that future agent frameworks will evolve into meta-frameworks, offering greater flexibility to customize prompts and choose between different planning strategies (such as defining a complete plan upfront versus incrementally adding steps until reaching a stopping condition, like smolagents does).

Updated: