Tool Calling and Agents

Interacting with the world

The Fundamental Limitation

Language models are remarkable at understanding and generating text. They can explain quantum physics, write poetry, debug code, and carry on thoughtful conversations. But there is a fundamental limitation built into their very nature.

LLMs can only produce text. They cannot send emails, browse websites, execute code, query databases, or check the weather. When you ask a language model what time it is, it cannot actually look at a clock. It can only guess based on patterns in its training data.

This limitation runs deeper than it first appears:

  • Knowledge is frozen. Everything the model knows comes from its training data. It cannot learn about events that happened yesterday.
  • No verification. The model cannot check if its claims are accurate against live sources.
  • No action. Generating a beautifully formatted email does nothing if the model cannot actually send it.

The core issue: to interact with the real world, LLMs need tools. The model's role shifts from being the one who does things to being the one who decides what should be done.

The Evolution of Tool Use

The journey from text generation to tool use happened remarkably fast:

2022: WebGPT — OpenAI trained a model to use a text-based web browser. The model could search, click links, and scroll through pages. It was clunky but proved the concept: language models could learn to use external tools.

2023: Toolformer — Meta showed that models could teach themselves when to use tools. Given a few examples, the model learned to insert API calls into its own generations—calling a calculator for math, a search engine for facts.

2023: Function Calling APIs — OpenAI released structured function calling in their API. Developers could define functions with JSON schemas, and the model would output properly formatted calls. This made tool use practical for production applications.

2024 and beyond — Sophisticated agent frameworks emerged. Models began chaining multiple tools together, planning multi-step tasks, and recovering from errors. The line between "language model" and "autonomous system" started to blur.

How Function Calling Works

Modern function calling follows a clean pattern. The model does not actually execute anything—it produces structured data that your application interprets.

Function Calling Flow

User Query"What's the weather in Tokyo?"
Model DecidesNeeds real-time data
JSON Output{"function": "get_weather"}
App ExecutesCalls weather API
Result Back{"temp": 18°C}
Final Response"It's 18°C and cloudy"

The model never executes functions directly. It outputs structured JSON that your application interprets and executes.

Here is the typical flow:

  1. Define available functions. You provide JSON schemas describing what functions exist, what parameters they accept, and what they return.

  2. User sends a query. "What's the weather in Tokyo?"

  3. Model outputs a function call. Instead of guessing the weather, the model outputs structured JSON: {"function": "get_weather", "arguments": {"city": "Tokyo"}}

  4. Your application executes. Your code parses the JSON, calls the real weather API, and gets the result.

  5. Return result to model. You send the API response back: {"temperature": 18, "conditions": "cloudy"}

  6. Model generates final response. "The weather in Tokyo is 18°C and cloudy."

The elegance of this approach is separation of concerns. The model decides what to do and when. Your application handles the how. The model never needs direct access to external systems—it just speaks a structured language that your code interprets.

The ReAct Pattern

Simple function calling handles single-step tasks well. But complex problems require reasoning through multiple steps: gathering information, making decisions based on results, and adapting when things go wrong.

The ReAct pattern (Reasoning + Acting) addresses this by interleaving explicit reasoning with actions:

ReAct Trace

Thought

I need to find who won the most recent Super Bowl. My knowledge might be outdated.

Action

search("Super Bowl 2024 winner")

Observation

The Kansas City Chiefs defeated the San Francisco 49ers 25-22 in Super Bowl LVIII...

Thought

I found the answer. The Chiefs won the most recent Super Bowl.

Answer

The Kansas City Chiefs won Super Bowl LVIII in February 2024.

Press Play to see the ReAct trace unfold

The model explicitly reasons before acting. This makes decisions interpretable and helps avoid mistakes.

The model alternates between three modes:

  • Thought: Explicit reasoning about what to do next and why. This is not sent to tools—it is the model "thinking out loud."
  • Action: A specific tool call to gather information or make changes.
  • Observation: The result returned from the tool.

The key insight: making reasoning explicit helps the model plan better. Instead of immediately jumping to an action, the model first considers what information it needs and why. This reduces errors and makes the model's decision-making interpretable.

For example, answering "Who won the most recent Super Bowl?" might look like:

Thought: I need to find the most recent Super Bowl and its winner. I should search for current information since my training data might be outdated.

Action: search("Super Bowl 2024 winner")

Observation: The Kansas City Chiefs won Super Bowl LVIII on February 11, 2024...

Thought: I now have the information I need to answer the question.

Answer: The Kansas City Chiefs won the most recent Super Bowl (LVIII) in 2024.

The Agent Loop

ReAct handles individual reasoning chains. But true agents operate in a continuous loop, taking actions in the world, observing results, and deciding what to do next.

Agent Loop

ObserveGather information
ThinkReason & plan
ActExecute action
AgentLoop

Observing: The agent gathers information about the current state.

This simple loop enables complex behaviors: multi-step planning, error recovery, and adaptive problem-solving.

The agent loop follows a simple pattern that repeats until the task is complete:

  1. Observe — Gather information about the current state. What just happened? What data is available?
  2. Think — Reason about the observation. What does this mean? What should I do next?
  3. Act — Execute an action, whether that is calling a tool, sending a message, or declaring the task complete.

This loop is deceptively powerful. It allows agents to:

  • Plan multi-step tasks. Break complex goals into manageable steps.
  • Handle failures gracefully. If an action fails, observe the error and try a different approach.
  • Adapt to new information. Adjust the plan based on what is learned along the way.

The challenge is knowing when to stop. Without careful design, agents can get stuck in loops, waste resources on dead ends, or take unintended actions. Effective agents need clear termination conditions and boundaries on what they can do.

Model Context Protocol (MCP)

As tool use became central to AI applications, a problem emerged: every application defined tools differently. Integrating a language model with a database required custom code. Adding email required more custom code. Each new capability meant more bespoke integration work.

The Model Context Protocol (MCP) aims to solve this with standardization. Think of it as a "USB-C port for AI applications"—a common interface that any tool provider can implement and any AI application can consume.

MCP defines:

  • Resources — Data the model can read (files, database records, API responses)
  • Tools — Actions the model can take (send email, create file, query database)
  • Prompts — Pre-built templates for common tasks

A model using MCP does not need custom integrations for every service. It speaks a common protocol, and any MCP-compatible tool just works.

Major providers have adopted MCP. Claude, the model you might be interacting with right now, uses MCP to connect to file systems, browsers, and external services. The standardization means tool ecosystems can grow independently of any single model provider.

The Path Forward

Tool calling transformed what language models can do. They went from impressive text generators to systems that can actually accomplish tasks in the world.

But this is just the beginning. As models become more capable at reasoning and planning, the line between "tool" and "collaborator" will continue to blur. The fundamental pattern—a model that decides what to do, with external systems that execute—will likely remain, even as the sophistication of both sides increases.

Understanding this pattern is essential for building AI applications today. The model provides intelligence and flexibility. Your tools provide capabilities and guardrails. Together, they create systems that neither could achieve alone.

Key Takeaways

  • Language models can only generate text—they need tools to interact with the world
  • Function calling lets models output structured JSON that applications execute
  • The ReAct pattern interleaves explicit reasoning with actions for better planning
  • The agent loop (Observe → Think → Act) enables multi-step, adaptive behavior
  • Model Context Protocol (MCP) standardizes tool integration across applications
  • Effective agents need clear boundaries and termination conditions