The Anatomy of a Linguistic AI Agent

From single-turn LLM to long-horizon autonomous AI.

May 04, 2026

*After Rembrandt’s “The Anatomy Lesson of Dr. Nicolaes Tulp” (1632); generated with Nano Banana 2.*

You have used a language model in a chat box. You typed a question, you got an answer, you closed the tab. The whole interaction lasted under a minute. The model did not remember you the next time you opened the page.

You have also seen, or read about, agents that work for hours. A coding agent that ships a feature overnight. A research agent that pulls together a hundred sources before breakfast. They plan, they call tools, they back out of dead ends, they hand you something you can use.

Both are the same model. Same neural network. Same forward pass. The only thing that changed is what’s wrapped around it.

This essay is the bridge. The architecture that turns the first thing into the second is not a single insight. It is a stack, a small number of layers, each one added in response to a failure mode of the previous layer. By the end you should be able to point at any agent doing real work in 2026 — coding, research, customer ops — and name which layer is doing the heavy lifting at any given moment.

Some of those layers are old. The fundamental one was published in 2022, before ChatGPT shipped. Some are very new. One was named eighteen months ago and is still settling. None of them, individually, is hard to follow. The trick is seeing them as a sequence, each fix opening the door for the next.

If you want a number to anchor where we start: METR has been measuring the time-horizon of frontier agents, and a language model on its own, with no scaffolding around it, sustains roughly a few minutes of human-equivalent work at 50% reliability. The equivalent of writing a competent meeting summary.

That is the floor.

Every post on the blog this month is on the theme of agent reliability, anchored on the second edition of Mostly Harmless AI, where the engineering details that don’t fit a blog post live. You can also read the whole book online for free in a custom reader I built. More at the end.

The base case

Strip everything away first. No agent, no tools, no skills, no harness. Just the model.

A language model, in the strictly minimal sense, is a function from a string to a string. You hand it a sequence of tokens. It hands you back a sequence of tokens. One forward pass through the network. The input goes in at one end, the output comes out the other, one token at a time until the STOP token is generated, and that is the entire interaction. No state is held between calls. The next time you ask the same model the same question, it has no idea you have ever spoken before.

Inside that one shot, the model delivers. It will answer, draft, summarize, translate, brainstorm. Give it a good piece of context and a clear ask, and the response that comes back will, in my experience, often be useful enough to ship as-is. This is the experience that made everyone notice in late 2022. Open a chat, ask anything, get something back you can use. People called it magic at the time. Most of them still do, even though useful function with no memory is the more honest description.

But notice what it cannot do — which is most things you would ever want from an agent.

It cannot verify its own output. The same forward pass that produced the answer is the only one available to check it. There is no second opinion, no quick lookup, no let me try it and see what happens. The model is committed to whatever came out the first time.

It cannot look anything up. Whatever facts it has are baked into the weights from training, frozen at some cutoff date. If you ask about today’s news, or your codebase, or an internal company document, the model has nothing. And worse, it will frequently invent something plausible-sounding because completing a confident sentence is what it was trained to do.

It cannot act on the world. It cannot write to a file, send an email, call an API, run a command. It cannot do anything that has a side effect outside the chat window. The only thing it can produce is more text.

Inside the four walls of the context window, the base model is the most capable text engine the field has ever built. A single chat box was enough to launch the largest consumer product of the decade. Outside those walls, it is inert.

METR’s measurements of an unaugmented model — no tools, no loop, no scaffolding — put the time horizon at something on the order of minutes of human-equivalent work. Minutes. That is the starting capability. Everything else in this essay is a way of making those minutes compound.

The first leap

The first real agent paradigm is older than ChatGPT.

In October 2022, a team at Princeton and Google published ReAct: Synergizing Reasoning and Acting in Language Models. It went out about six weeks before the ChatGPT launch that made the public notice agents existed at all. Every working agent today — Claude Code, Codex, Gemini CLI, the dozens of research agents and customer-ops agents shipping this year — is some refinement of the loop that paper introduced.

Here is the setup. An agent operates in some environment: a Wikipedia API, a household simulator, a web shop, your codebase. The environment offers an action space, the set of things the agent is allowed to do. Call it A. A policy maps the current context to the next action: given everything the agent knows, what does it do next? With nothing else, the policy has to map a long, noisy trajectory of past observations directly to the right next move. This is brittle. The longer the task runs, the more lost the model gets.

ReAct’s move is to enlarge the action space. The new action space is A plus L, where L is the space of natural language. A “thought” is an action in L, the agent pausing to write itself a sticky note before reaching for the next tool. It does not change the world, it changes the context. The next action is conditioned on a context that now includes the model’s own reasoning about what just happened.

The paper spells out what thoughts are actually for, and the list is concrete, not mystical. Decomposing the goal into a plan. Injecting commonsense the environment does not supply. Extracting the relevant signal from a noisy observation. Tracking progress and noticing when a subgoal is done. Handling exceptions when something breaks. Five jobs.

Why this beats the alternatives is where the paper earns its place. Chain-of-thought prompting, the prior art, has the model reason in a closed loop inside its own head, with no contact with the world. The paper’s own ablation on the HotpotQA benchmark is brutal: chain-of-thought hallucinates in 14% of its successes and 56% of its failures. Acting alone, calling tools without thought, is grounded in the world but loses the global plan after a few steps. ReAct synthesizes them. On the same task, ReAct hallucinates in 6% of successes. Less than half. Both halves of the loop have to be there.

One concrete anchor before we move on. ReAct’s HotpotQA action space, the entire set of things the agent could do, was exactly three actions: search[entity], lookup[string], finish[answer]. Three. The first working agent paradigm operated on three tools. Hold that number.

The paper closes with the line that becomes the engine for the rest of this essay. “Complex tasks with large action spaces require more demonstrations to learn well, which unfortunately can easily go beyond the input length limit of in-context learning.” In plain English: more capability needs more action descriptions, which need more context, which we do not have. Every layer that follows is the field iteratively solving exactly that bottleneck.

METR step: a model wrapped in this loop moves from minutes to tens of minutes on bounded tasks.

Tools

So how do you fix ReAct’s bottleneck, the one the paper named in its own conclusion?

The first, most obvious answer: give the agent more actions to take. If A was the original action space and ReAct enlarged it to A ∪ L, the next move is to make A itself bigger.

That is what a tool is. A tool is a function the model can call. It has a name, a typed schema for its arguments, and a return value. The model writes a tool call into the trajectory the same way it writes a thought. Except this one has a side effect on the world. The harness picks it up, runs the function, drops the return value back into the context. The next turn of the loop sees the result and decides what to do next.

The loop is unchanged. Same thinking, same acting, same context-grows-by-a-turn shape ReAct described. The difference is what the agent is allowed to do.

ReAct, recall, ran on three tools: search, lookup, finish. That was the entire menu. Claude Code in 2026 ships with more than twenty: read a file, edit a file, run a shell command, search the codebase, fetch a URL, spawn a subagent, take a screenshot, schedule a future tick, and so on. Each one is just a function with a schema. Each one expands the set of things the agent can do without changing one line of the underlying loop.

This is the part that surprised me, the first time I sat with it. The chatbot you typed at in 2022 and the agent that wrote your test suite this morning share one loop. What changed is the tool catalog. Same loop. Bigger menu.

That observation is the unsexy version of why tool-building is now a discipline of its own. Every capability you add to an agent — search the web, read a Slack channel, hit your billing API, deploy to staging — is just another function with a schema. The architecture does not change. The leverage is entirely in which tools you build and how you describe them to the model.

The design discipline that emerges is short to state and brutal to follow. Tools should be few, sharp, and self-describing. Few, because every tool you add takes up tokens in the system prompt and a slot in the model’s attention. Sharp, because a tool that does seven things is one the model will use wrong six times out of seven. Self-describing, because the model only learns to use a tool from its name, its docstring, and its argument schema. There is no other channel. (More on this on Thursday. Anthropic’s recent guidance on writing tools for agents is the cleanest summary of this craft I have read.)

METR step: a model with the right toolkit moves from tens of minutes to hours of bounded work.

Skills

Tools fix half of ReAct’s bottleneck. They expand the action space.

The other half, recall, is the input-length limit. Every tool you add costs tokens in the system prompt to describe: name, schema, when to use it, what its return value looks like. Add fifty tools that way and the system prompt is a small book. The model is reading every single tool description on every single turn, even when ninety-five of those turns have nothing to do with that tool.

Skills are the move that fixes this.

Anthropic shipped the idea in late 2024 and the rest of the field has been catching up since. A skill is, mechanically, almost embarrassingly simple. It is a markdown file. It has a name, a one-line description of when it applies, and a body that explains how to do the thing. The agent does not read it on startup. The agent reads it on demand: when, in the middle of a task, it notices a description that matches what it is about to do.

So instead of jamming and here are seventeen other things you might want to do into the system prompt, you put each of those things in its own file with a one-liner that names when to consult it. The system prompt stays small. The latent capability of the agent becomes, for practical purposes, unbounded. Every skill you write is one more thing it can do, but only when it actually needs to.

I find the deeper shift here more interesting than the engineering. The agent is reading documentation written for it. Not training data ingested months ago and frozen into weights. Documentation. Authored in plain prose. Versioned in git. Like the laminated procedure sheet a mechanic posts above a workbench for a job done once a month. Improvable by the same process that improves any document: someone notices the agent doing the wrong thing, edits the file, the next agent reads the new version and gets it right.

This is self-extension by reading, not by retraining. A new capability used to require a new training run, or at minimum a new fine-tune. Now it requires a markdown file. The cost of teaching an agent to do one more thing has fallen from days of GPU time to the minutes it takes to write a paragraph, and almost nobody outside the people building agentic systems has noticed.

The system prompt stays small. The set of things the agent can do, on demand, grows without bound. The two used to be the same number.

METR step: skills, more than anything else in this list, are what made the time horizon stop being bounded by how cleverly you wrote the system prompt.

MCP

For most of 2024, every agentic harness invented its own way to attach the same set of capabilities. You wrote a tool for Claude Code; it would not work in Codex. You wrote a skill for one harness; another harness could not see it. You hooked your billing API into one agent and had to do the same wiring four more times for the others. Every integration was bespoke. Nothing composed.

The Model Context Protocol (MCP) is the field’s answer to that. Anthropic shipped the spec in late 2024. By the end of 2025 every serious agent harness, including the ones not built by Anthropic, had adopted some version of it. Codex talks MCP. So does Claude Desktop, and Cursor, and a long list of others. This is one of those quiet moments where an industry just... agrees on a wire format, and a year later the world is different.

The architecture is three nouns. Hosts are the applications you actually use: Claude Desktop, Codex, Cursor. Clients live inside the host and talk to one server each. Servers are the things that actually expose capability: your codebase, your billing API, the Wikipedia search box from the ReAct paper four years ago.

What a server offers is the second triple: Resources (data the model can read), Prompts (workflow templates the user can invoke), and Tools (functions the model can call). Three nouns, again. The whole protocol is two threes.

The point is portability. A skill or tool you wrote once, against the protocol instead of against a specific harness, works everywhere. The lock-in moves out from under you. The agent ecosystem starts to compose the way the web did in the late 1990s. Not because someone planned it, but because everyone independently noticed it was cheaper to talk a shared protocol than to keep reinventing the connector layer.

Worth noticing what the spec foregrounds at the top of every chapter on tool calls: user consent. Capability requires permission. The protocol does not assume the model can do whatever a server exposes. It assumes the model has to ask, and the user has to answer. A small design choice with very large downstream consequences, and the reason the rest of this stack does not collapse into something nobody would let near their email.

METR step: not a step on the ladder, but a multiplier. The tools and skills from the last two sections now travel.

Context engineering

Add tools. Add skills. Add MCP. The agent can now do, in principle, almost anything you can describe in a prompt and a function. The trouble is what happens when it actually starts trying.

A long agent run accumulates context. Every observation from a tool call goes in. Every thought goes in. Every error message, every retry, every half-attempted plan that did not work goes in. After a few hours of work the context window is mostly exhaust: the trail of everything the agent tried, the great majority of which is no longer relevant to the next move. The model is searching for signal inside its own attic.

Karpathy named this context engineering in 2024, and the name stuck because the field had been doing it without a name for two years. Simon Willison wrote it up. LangChain made it a category. By 2026 it is a craft of its own: what to put in the context, when to summarize, what to evict, what to keep verbatim because the agent will need its exact wording later.

The central primitive in the discipline is compaction. At some threshold, typically 70% to 85% of the window, the agent stops, reads its own history, and rewrites it into a smaller form. Here is what we were trying to do. Here are the decisions we made. Here is the state we are in. Here is the next move. The compacted summary replaces the noisy trail. The agent keeps going on a fresh, smaller context with the salient bits intact.

The deeper move is that the agent now owns its own working memory in a way it never did inside a single ReAct loop. ReAct kept the entire history. Compaction lets the agent curate the history. A small change of grammar with a giant change of consequence.

Notice what this fixes. ReAct’s authors, in the same 2022 paper, named the dominant failure mode of their own system: “the model repetitively generates the previous thoughts and actions, often failing to reason about what the proper next action to take should be and jump out of the loop.” Translation: the agent gets stuck because its context is full of the same noise as the previous turn, so the next turn is the same noise plus a little more. That is a context problem. Context engineering is what stops it.

Without this layer, every previous layer eventually drowns. A hundred tools is useless if the agent’s context is so saturated it cannot find the right one. The five-thousand-word skill on how to handle a billing dispute is useless if the agent compacted it away on turn forty. Context engineering is the layer that makes the others compound over a long run instead of degrading into noise.

METR step: this is the layer that turns a few hours of focused agent work into a workday.

The hierarchy of agency

Stack the layers and the picture comes into focus. At 50% reliability on the METR time-horizon scale, a language model alone, with no scaffolding around it, sustains minutes of human-equivalent work. Wrap it in a ReAct loop with no tools, and that becomes tens of minutes. Add tools to ReAct, hours. Add skills and context engineering on top, a workday. Add an external loop above all of that, a fresh agent per turn on a clock with a journal handing state to itself, and the horizon stretches into days and weeks.

Stare at that ladder for a second. Each rung is the same model. What separates a chatbot from a coding agent that finishes a feature overnight is the scaffolding stacked around it. The frontier of what an agent can do in 2026 is set, almost entirely, by where you stop climbing.

Each layer has the same shape, in the abstract. Find the thing that bottlenecks the previous layer. Add a structure that lets the model offload that thing into the world, the way a machinist offloads a measurement into a caliper rather than holding it in memory. Into language, into tools, into files, into a clock. The model’s per-turn intelligence does not change. What changes is the time horizon over which that intelligence compounds.

The last rung is the one most people have not seen yet, and it is the one I have spent the last few months running on my own infrastructure. The trick is the same one. Take the bottleneck (the agent runs out of context before it runs out of work) and offload it. The new offload target is the file system. The new clock is cron. Past-Claude writes a markdown file at the end of its turn that says what it did and what comes next. A timer fires some hours later. Future-Claude wakes into a fresh context, reads the file, makes the next move, writes the file, exits. The continuity is in the file, not in the model.

That is the entire primitive. A markdown file and a timer. Past self tells future self what to do.

What you get from it is hard to describe to someone who has not run one. The agent works on your stuff for weeks at a time. It writes new jobs for itself. It reads the documentation about its own substrate and uses the tools that documentation describes. It makes mistakes (one in five runs produces something I have to throw out) but the mistakes are caught by the same kind of boring engineering that catches mistakes in any other autonomous system. Audit log, lock registry, archive-only deletion, every state change committed to git before the next turn starts.

The point of saying this out loud is that the same trick keeps working. Extend the action space; add a layer that compounds; let the previous layer drop the things it could not hold. The trick does not stop at hours. It does not stop at days. METR’s curve has been doubling every four months over the last two years. The 2027 projection is a working day. The 2028 projection is a working week.

Each doubling is one more scaffolding layer.

The frontier is not the model

Step back from all of it.

The architecture you’ve just walked through is layered. A language model at the core. ReAct around the model, turning tokens into actions. Tools around ReAct, expanding what those actions can be. Skills letting the agent pull capability from the file system instead of carrying it in the system prompt. MCP making everything portable. Context engineering keeping the whole thing from drowning in its own exhaust. An external loop on top of all that, when the work runs longer than a single context window can hold.

Every agent doing real work in 2026 — your coding agent, your research agent, the customer-ops bot answering your refund request, my private-tick agent running once an hour — has this shape. They differ in which tools they ship and which skills they read on demand. They do not differ in the shape of the stack. Once you can see the layers, you can see them everywhere.

So here is the closing claim, the techno-pragmatist version of what the article has been arguing the whole time. The frontier is not the model. It is the layers around it. And the entire stack is the field’s three-year answer to a single sentence in a single paper from October 2022 that named its own ceiling and dared the rest of us to climb past it.

One frontier worth flagging before I close. A competent agent can already write its own tools and skills on demand. That part is shipping today. The next move is teaching it, via tools and skills, to detect by itself when its current toolkit doesn’t cover what it’s trying to do, so it knows when to extend itself without being told. Self-extension that triggers itself. That is the live edge right now, and where the next few posts are headed.

The next post zooms in on the innermost layer the agent touches: the tools themselves, and what makes a tool safe enough to live inside a stack like this. That is a story for another Thursday.

Until next time, stay curious.

If this is the worldview you want to take more seriously, the second edition of Mostly Harmless AI (due May 25th) goes deep on the agentic stack we walked through here. Full chapters on context engineering and on the harness around the model, with the math, the case studies, and the parts that didn’t fit a blog post. You can also read the whole book online for free in a custom reader I built that I’m rather proud of: dark mode, font controls, progress tracking, offline support, the works.

If you want the whole catalog of everything I’ve written, plus everything I’m going to write, that’s the Compendium. One purchase, in perpetuity.

Discussion about this post

Ready for more?