Building agents · 3 of 7

Tools — the agent's hands

An agent without tools is a chatbot. An agent with too many tools is paralyzed. Tool design is the most underrated engineering decision in agent building: each tool is a small API the agent has to understand, call correctly, handle errors from, and combine with others. The principles below are what separate agents that work in production from agents that work only in demos.

Where the binding constraint sits today

How many tools is too many? What does a good tool description look like? How do you handle errors that the model cannot recover from? These are unsolved enough that most teams discover the answers the hard way. The shortcut is to copy the patterns that have worked in production agents from frontier labs and serious app teams.

Anatomy of a tool the model can actually use

A tool, from the model's point of view, is a structured object with four parts: a name, a description, a parameter schema, and a return contract. The name is short and verb-led. The description is one paragraph telling the model when to use this tool, when not to use it, and what it returns. The parameter schema names every input, types it, and marks which are required. The return contract specifies what success and failure look like — what fields come back, what error shapes are possible, what each error means.

Most production failures trace back to one of these four being underspecified. A vague description leads the model to call the tool in the wrong situation. An ambiguous parameter schema leads to malformed calls. An undocumented error shape leads to the model attempting a sensible-looking recovery that is actually wrong. Spending an hour tightening a tool definition is often higher leverage than spending a day improving the prompt.

The description specifically deserves more attention than most teams give it. Good descriptions are written as if they are addressing a competent but uninformed engineer: explicit about preconditions, explicit about side effects, explicit about what other tool to use if this one is the wrong fit. The model performs noticeably better on tool selection when the descriptions actively help it rule tools out, not just rule them in.

The cardinality question

There is no universal right number of tools, but the curve is clear. Below roughly five tools, the agent has too little leverage and the engineering effort feels disproportionate to what was built. Between roughly five and fifteen tools, the agent has expressive range and the model can still hold the entire tool surface in mind. Above twenty or thirty, tool-selection accuracy starts to degrade. Above fifty, the degradation is usually severe enough that the agent should be redesigned.

The teams that ship the best agents in 2026 have learned to be disciplined about tool count. The first instinct when a feature is requested is to add a new tool. The disciplined instinct is to ask whether an existing tool can be extended with a new parameter, or whether the work can be hidden inside an existing tool rather than exposed as a new one. Every tool added is a small tax on every future tool-selection decision the agent makes.

If the agent genuinely needs many tools — and some do — the right pattern is usually to break the agent into subagents. A primary agent dispatches to a research subagent that has its own twenty tools, or to a billing subagent that has its own ten. Each subagent sees a smaller tool surface; the primary agent sees a smaller surface of just the subagent dispatch tools. This pattern is increasingly standard in serious production deployments.

5–15

Tools per agent that production teams typically converge on

>30

Tool count at which selection accuracy noticeably degrades for current frontier models

Read tools vs write tools

Tools split into two categories with very different engineering implications. Read tools query the world: fetch a user's profile, search a knowledge base, look up an order status. Read tools are forgiving. The cost of calling one is small, the cost of calling one in error is usually just wasted compute, and the result can be discarded if irrelevant.

Write tools change the world: send an email, charge a card, file a ticket, write a record to the database. Write tools are unforgiving. The cost of calling one in error can be a confused customer, a regulatory issue, or real money moved. A production agent should treat read tools and write tools differently. Read tools can be called freely. Write tools should be called only after explicit reasoning, often only after explicit human confirmation, and always with full logging.

The practical pattern: write tools should require structured inputs the model has to assemble explicitly, not free-text fields it can populate loosely. They should include a confirmation step where the agent surfaces the intended action to the user (or to a logging system if the user is absent) before executing. They should be idempotent where possible — re-running the same write should not double-execute — and where idempotency is not possible, they should carry a unique transaction key the agent must generate fresh per call.

Idempotency and error handling

Idempotency is the property that calling the same operation twice has the same effect as calling it once. For agents, idempotency is not a nice-to-have. Agents retry. They retry because the network fails, because the API rate-limits, because the model decides on reflection that the previous call did not work. If the underlying tool is not idempotent, a retry double-executes — two emails, two charges, two tickets.

The shortcut for non-idempotent operations is to generate a unique idempotency key per logical action and pass it through. Most modern APIs support this pattern (Stripe was one of the early adopters and the pattern has spread). The agent generates the key once per intended action; if the call is retried, the server recognizes the key and either re-executes safely or returns the cached result.

Error handling more broadly should distinguish between errors the agent can recover from and errors it should escalate. Rate-limit errors are recoverable with backoff. Authentication errors are not recoverable from inside the agent — they escalate. Schema-validation errors are usually recoverable because the agent can retry with corrected parameters. Permanent business-logic errors (the customer does not exist, the account is closed) should be returned to the agent as structured data, not as exceptions, so the agent can reason about them and decide what to do.

Telemetry from day one

Every tool call should produce a log entry that includes the tool name, the inputs, the output (or error), the latency, and the agent context that produced the call (which step, which sub-task, which conversation). This sounds like operational hygiene; it is the foundation of agent debugging, agent evaluation, and agent improvement, and it is consistently underinvested in early.

The cost of adding telemetry on day one is one engineer-week. The cost of adding it on day ninety, after the agent is in production and the team is debugging customer-facing failures with no historical data, is a month and a customer-trust deficit. Production agents accumulate failure modes faster than the team can keep up; without telemetry, the team is debugging by guesswork and customer complaint.

The pattern that works: a single agent-trace object that follows the agent from goal to final response, with each tool call appended as it happens. The trace is persisted, queryable, and exposed in a debugging UI. When something goes wrong, the engineer reads the trace, not the source code. The trace is also the input to the eval system covered in chapter five.

Tools should be at the level the human would think about

The single most common mistake in tool design is to expose the underlying API directly to the model. If the underlying API has fifteen endpoints with overlapping responsibilities, the agent now has fifteen confusingly-named tools to choose between. The right move is almost always to wrap the underlying API in a smaller, opinionated tool surface that operates at the level the agent's job actually requires.

A concrete example: a calendar agent does not need a tool for each Google Calendar endpoint. It needs three tools — find available times, book a meeting, cancel or reschedule a meeting — each of which may call multiple underlying endpoints. The translation between the agent's vocabulary and the API's vocabulary is the engineering work of building the tool layer, and that translation is what makes the agent reliable.

The test for whether a tool is at the right level: can a non-engineer who understands the agent's job read the tool list and understand what the agent can do? If yes, the level is right. If the tool list reads like an API reference, the level is too low.

Strategic read

Tool design is leverage. Two agents using the same underlying model can perform very differently because one has well-designed tools and the other does not. For a buyer evaluating agent products, the tool surface is one of the most reliable indicators of engineering quality, more reliable than the prompt or the model choice, which can be copied trivially. Tools take taste and time to get right.

For a team building agents, the rule is to spend disproportionate time on tools relative to other components. The model is a commodity. The prompts are imitable. The tool surface — the names, descriptions, parameter shapes, error contracts, idempotency story, and telemetry — is where engineering excellence shows up and where production reliability is determined.