Designing Backends for Agents, Not Just Humans

The Caller You Didn't Design For

In early 2024, almost every API had two kinds of callers: your own frontend, and a small number of partner integrations written by humans who had read your docs. By early 2026, that isn't quite true anymore. Anthropic launched the Model Context Protocol in November 2024 — a standard for how agents discover and call tools. OpenAI's function calling had been GA since June 2023. By late 2025 the list of vendors shipping MCP servers or native tool bindings included GitHub, GitLab, Linear, Stripe, Notion, Cloudflare, Sentry, and hundreds of smaller teams. Somewhere in your infrastructure right now, an LLM is almost certainly calling one of your endpoints.

The shift is quiet because the wire format looks the same. HTTP, JSON, status codes. But the caller on the other end has a different cognitive profile than a human developer. It doesn't read your changelog. It doesn't skim the Authentication page. It takes a tool description, a JSON schema, and an error message as input and decides — on the basis of those alone — whether to call you, what to send, and what to do when you return a 400.

That's the architectural shift that isn't in most decks yet: the fastest-growing class of API consumers in 2026 doesn't have human intuition and won't patiently email support. If your error body says 'Invalid request,' the agent will try again with a slightly different payload, fail in a slightly different way, and burn tokens until it times out. Most backends were not built for that caller. The ones that are — quietly — have a massive advantage.

Tool Descriptions Are the New API Docs

When you expose an endpoint to an agent, three things go into the model's context: the tool name, the description, and the parameter schema. That is the entire surface the model has to work with. Not your OpenAPI page, not the example payloads buried in your SDK readme, not the onboarding tutorial — just those three fields. If the description is vague, the model either avoids the tool or calls it wrong.

The guidance from Anthropic and the MCP spec has been consistent since late 2024: write tool descriptions the way you'd write onboarding notes for a new engineer who's about to be on call. State what the tool does, when to use it, when not to use it, and what each parameter means in plain language. The difference between 'Creates a resource' and 'Creates a new deployment for the specified environment. Use only after the build step has completed successfully' shows up directly in how reliably the model picks and calls the tool.

The second-order effect is organizational. Tool descriptions live closer to product copy than to code comments, and they need to be versioned, reviewed, and regression-tested against eval suites the same way you regression-test a UI. Teams taking agents seriously in 2026 have someone whose job is partly to write and maintain those descriptions. That role didn't exist two years ago.

"The tools you hand an agent are part of its prompt. Poor tool design looks, from the outside, exactly like the model being incapable — and it mostly isn't."
— Anthropic Engineering — Building Effective Agents, December 2024

Error Messages Become Part of the Prompt

When a human hits a 400 in the browser, they see a red toast, read the doc, and fix their code. When an agent hits a 400, it reads your error body and decides what to do next — retry, change approach, or escalate to the user. Your error message is now a prompt. The thing you optimized for terseness, or for 'not leaking internal state,' is the thing that determines whether the agent recovers or loops until someone cancels the run.

The useful shape is specific: name the exact field that failed, state what was expected, and suggest the corrective action. 'amount must be a positive integer; received -12' beats 'validation failed' by orders of magnitude in agent recovery rate. Stripe's API errors have worked this way for a decade, and that's part of why their Agent Toolkit, released in late 2024, works out of the box with most coding agents — the errors already read like instructions.

A practical corollary: stop using 500 as a default. Any error that could plausibly be the client's fault should be a structured 4xx with a machine-parseable field identifier. Any 5xx should be explicitly retryable or explicitly not, and the body should say which. Agents use that distinction to decide whether to back off or abandon the call entirely. Ambiguity here costs real money in token spend and in production incidents where agents retry into overload.

Idempotency Stops Being Optional

Agents retry. They explore. They will POST twice if the first response was ambiguous, and three times if the second looked suspicious. In human API usage you could rely on the fact that a user doesn't double-click buttons that hard. In agent usage you cannot, and the cost of getting it wrong is duplicate deployments, duplicate charges, or duplicate tickets in your queue.

The remediation is boring and well-understood: idempotency keys on every write operation, explicit dedup windows, and create-if-not-exists semantics for operations that logically should only happen once. Stripe's idempotency keys, which go back to 2015, are the reference implementation most teams copy. What changed in 2026 is that the same pattern is now load-bearing for internal APIs that never had to worry about it before — because those APIs now have LLM callers.

The subtler failure mode is operations that are technically idempotent but have observable side effects — emails, webhooks, audit log entries. If the agent retries and you re-send the welcome email, the user notices. Idempotency on write endpoints has to be paired with dedup on the side-effect emitters. This is often where the architecture genuinely has to change, not just the contract at the edge.

The Long Loop: Agents Don't Do Request / Response

Human API interactions are measured in milliseconds to seconds. Agent runs are measured in minutes to hours. A coding agent fixing a bug might read twenty files, run tests four times, roll back a change, and try a different approach before finishing — one session that the backend sees as dozens of correlated calls across forty minutes. Claude Code, Cursor agents, and GitHub Copilot agent mode all run sessions of this shape in production by early 2026.

What breaks first: token lifetimes, connection timeouts, request-scoped resources, and any state the backend expected to live inside a single HTTP call. The systems that handle this cleanly use durable execution primitives — Temporal, Restate, Inngest, or equivalent — to model agent sessions as checkpointed workflows rather than long-lived connections. The agent's state is persisted between tool calls; the backend's work is decomposed into steps that can resume after a restart. Request/response is not the right unit anymore.

This is the architectural change most teams haven't made yet. They treat agent traffic as an unusually chatty flavor of normal API traffic, and they get paged when a session outlives its token, or when a half-finished deployment gets stuck because the coordinating agent's context was compacted. If you're shipping anything agent-driven at scale in 2026, the orchestration layer is a first-class product concern, not infra plumbing.

"Durable execution is to agents what HTTP was to web apps — the primitive that makes the rest of the stack buildable. Request/response was fine when calls took milliseconds. It isn't fine when they take hours."
— Temporal Engineering Blog, 2025

Auth for Callers That Aren't People

OAuth was built for a human in a browser clicking 'Allow.' Agent auth is messier: the agent is acting on behalf of a user, often through a platform, sometimes with scoped capabilities the user didn't fully read, occasionally across organizations. Handing an agent a user's full-permission token is the equivalent of giving a new intern your root password and hoping for the best. In 2025, a handful of well-publicized incidents — agents leaking private repo contents into logs, agents running destructive operations the user didn't intend — made the cost of doing this casually visible.

The direction the ecosystem settled on by early 2026 is scoped, short-lived tokens with explicit capability grants and human-in-the-loop approval for destructive operations. MCP's auth discussion through 2025 converged on OAuth 2.1 with PKCE as the baseline, plus per-tool scope definitions. In practice this means your backend needs to understand a token not as 'this user' but as 'this user, acting through this agent, with these specific capabilities, for this session.'

The audit trail gets more important, not less. Every tool call an agent makes should be attributable to the underlying human and the agent together, with the request body recorded at a granularity your legal and security teams can actually answer questions from. A log that says 'user X called POST /deployments at time T' was fine in 2022. In 2026 the useful log says 'user X's coding agent session S called deploy_to_production with arguments Y at time T in response to instruction Z.' That shape should fall out of your auth model, not be bolted on afterward.

What This Looks Like in CI/CD

The place all of this lands hardest in practice is DevOps infrastructure, and it's the example I reach for because it's the one I'm closest to. A modern CI/CD platform in 2026 has agents on it — triaging failed builds, proposing dependency upgrades, restarting jobs that flaked, investigating test regressions, opening remediation PRs. Mendral reported 16,000+ automated CI investigations per month by late 2025. GitLab's Duo Agent Platform went GA in January 2026. GitHub's Copilot coding agent mode hit enterprise GA in early 2025 and has been expanding scope steadily.

Every architectural lesson above becomes a shipping concern in this context. Tool descriptions have to tell the agent which step of a pipeline is safe to retry versus which one would redo a destructive migration. Error messages from a failed build have to be machine-readable enough that the agent can route between 'transient flake, rerun' and 'real failure, open a ticket.' Idempotency matters because a coding agent will happily retry a deploy if it didn't recognize an ambiguous response. Long-running workflows are the default, not the exception — a 'fix the failing pipeline' task routinely involves dozens of tool calls across an hour.

The platforms getting this right internally are the ones treating agent ergonomics as a first-class design constraint for their control plane, alongside human UX. The ones getting it wrong are shipping the same REST API they had in 2022 and wondering why their agent pilots plateaued in month two. The delta between those outcomes lives mostly in the boring places — error bodies, retry semantics, token scopes, workflow persistence — not in the part that looked like the product.

Design Moves That Actually Fall Out of All This

If I had to compress everything above into a short checklist for a team about to expose an existing backend to agents, it would look like this. None of these items are individually novel. The shift is treating them as load-bearing rather than nice-to-haves.

Write tool descriptions the way you'd write onboarding notes for a new on-call engineer. State what, when, when-not, and what the parameters mean in plain language.
Make every error body actionable without guessing. Name the field, state the expectation, suggest the fix. Distinguish retryable from non-retryable explicitly.
Default to idempotency on writes, and pair it with dedup on the side-effect emitters. Agents will retry. Assume they will.
Stop modeling long-running operations as HTTP request lifetimes. Put a durable execution layer behind the tool surface and let sessions outlive individual calls.
Give agents scoped, short-lived tokens with explicit capability grants. Log every tool call as 'user + agent + capability + session,' not just 'user.'
Ship an eval harness for your tool surface, not just your product. Regression-test your descriptions and error messages the same way you regression-test the UI.
Design for the case where 50% of your traffic has no human in the loop. The systems that survive that shift are the ones where agent-readiness was part of the original contract, not a retrofit.

The Part That Actually Matters

There's a recurring pattern in infrastructure history where a new kind of caller shows up and the systems that adapted their contracts for that caller won, even when the underlying capability looked identical on paper. Mobile traffic in the late 2000s. Programmatic ad bidding in the early 2010s. Serverless event-driven architectures in the late 2010s. Each time, the teams that treated the new caller as a first-class consumer ended up with backends that felt unusually pleasant to build on. The teams that didn't ended up bolting on middleware for the next five years.

Agent-driven usage is the 2026 version of that pattern. The bottleneck here is almost entirely a backend contract question, not a model capability question. Claude, GPT, and Gemini are all capable enough in early 2026 to drive real workflows reliably — the thing that's still rate-limiting adoption is the ergonomics of the surfaces they call. If you're building a backend right now and you ignore this, you'll spend 2027 retrofitting error messages, idempotency keys, and scoped auth into systems that already have users. If you're building with this in mind, the same work goes into the design instead of the cleanup. That's the whole argument.