Why AI Agents Need Structure

A 5-step framework for solving any problem with the help of AI

May 27, 2026

*Actual footage of an AI handing off a research spec to another AI—it’ll make sense in the end.* (Crazy the kinds things you can see in the wild, huh?)

Every post this month is on the theme of building AI agents that actually work — anchored on the second edition of Mostly Harmless AI, 50% off during early access, where the five-phase structure this post describes is a full chapter with more failure cases, the artifact design patterns, and the context isolation mechanics in detail. You can also read the whole book online for free. More at the end.

Last night you give your AI agent a clear task. It worked hard for two hours. YOu woke up to a report that is technically correct and completely useless.

I’ve had this experience enough times that I stopped blaming the model years ago. The failure is always structural, and I mean that in a very specific, diagnosable sense.

AI labs fine-tune language models to produce output, not to question the frame. The whole reward cycle behind how these models are trained pushes them toward helpful completions; nobody in that loop is rewarding a model for pausing mid-task and asking “have you considered that you might be framing this wrong?”

Execution is the default. That setting works beautifully for atomic tasks: create a note, write a short email, implement a specific function, compare two clearly-defined options. It fails quietly, and expensively, when the real problem isn’t the task itself but the goal behind it. Here’s three examples:

First: you ask an agent to “implement user authentication” for your web application. Clear task. The agent gets to work, producing a technically sound implementation using JWT tokens, bcrypt password hashing, and session management.

Second: you ask an agent to “write a technical report on renewable energy storage.” Again, clear task. The agent downloads papers, synthesizes findings, produces a well-structured document.

Third: a family asks an agent to “plan our move to a new city.” Thorough research follows — neighborhoods, school districts, moving companies, cost estimates. In every case, the output is technically correct.

In every case, the person who asked ends up with something that doesn’t quite fit. The gap between what was asked and what was needed — that gap is what this article is about.

The structure is the problem, not the model.

Not the model’s intelligence, not the quality of your prompt, not the temperature setting. The structure of the workflow you handed the agent is the thing that determines whether it answers the right question or a reasonable-sounding wrong one. In this article, I’ll show you what the right structure is. But to understand why it must be so, we need to see how we got there.

The first fix the industry reached for was planning. It helped. It wasn’t enough. Strap on for a story.

Plan First, Build Second

The structure is the problem, not the model. Right? So the first fix the industry reached for was adding a planning mode.

If you’ve used any of the major agentic coding tools over the past couple of years, you’ve seen this pattern. There’s a plan mode: read-only, no side effects, meant for thinking through the task. Then a build mode: where the agent executes.

The intuition behind it is sound — separating thinking from doing is genuinely better than collapsing them into a single continuous stream. When you mix planning and execution in the same pass, the agent makes irreversible changes based on its first interpretation of the goal.

A planning mode forces a pause. You get to inspect the plan, argue with it, revise it before anything is written to disk. The plan is the first concrete artifact in the handoff chain. You can hold it in your hands — or at least on your screen — and have a real conversation about it.

Genuine progress. I don’t want to minimize it. But watch what happens with the family move.

The agent produces a thorough plan: compare three target neighborhoods, compare school districts in each, contact five moving companies for quotes, estimate total relocation costs, build a timeline. Excellent plan. The family reviews it, nods along, says “looks good,” and the agent executes.

Six months later: they’re living in a city that made complete sense on paper — except the job that drove the whole move turned out to be fully remote. They never needed to relocate at all, or they could have moved somewhere half the price with better schools.

The plan was correct. The execution was correct. The goal was wrong. And here’s the point: the agent that built the plan never questioned whether “plan a move to City X” was the right goal, because it was never asked to. It took the first reasonable interpretation of the prompt — “they want to move, let me help them move well” — and planned confidently inside that frame.

The software auth example is shorter but sharper. “Implement user authentication”: the agent plans for JWT tokens, bcrypt hashing, session storage. Fine plan. Sound, even. For a single-tenant web app. This is a multi-tenant SaaS product (one where dozens of customers share the same running application.)

The plan was never going to catch that, because nobody told the planning agent that multi-tenancy was a constraint. It did exactly what it was asked.

Now, the objection you might be forming: “just review the damn plan carefully before you approve it.” That objection is right about something, review does matter, bit it misses the fact that the agent that produced the plan has already committed to a frame.

Every question it asks, every tradeoff it surfaces, every option it presents — all of it is already shaped by its first interpretation of the goal. When you review that plan, you’re not reviewing a neutral set of options. You’re reviewing a plan that already selected its own success criteria. The frame is invisible because it was never made explicit.

Planning without exploration locks in the first reasonable goal.

Research Before the Plan

Once more, the structure is the problem, not the model. The next obvious fix is to put a research phase in front of planning.

By the time we hit mid-2025, a third mode was appearing in serious agentic setups. A research phase: read-only, job is to understand the problem space, not to produce a solution. The artifact it creates is a description, not a prescription — a document that maps what is known before anyone decides what to do about it.

The intuition is right again: if the planning agent doesn’t know what it doesn’t know, it can’t plan well. Research is how it finds out. For the renewable energy report, thorough research might surface the fact that the intended audience is policymakers, not engineers — which changes the vocabulary, the technical depth, and the document’s opening frame.

Real progress, again. And again, not enough.

Watch the renewable energy report closely. The agent runs a solid research phase: downloads twenty recent papers, reads industry reports, synthesizes the state of the art on battery storage, hydrogen carriers, pumped hydro, and grid-scale thermal systems. Then it transitions into planning mode — in the same context window. And here is the problem: the same context window.

The planning agent isn’t a fresh mind looking at a research report. It’s the same agent, carrying everything it concluded during research, now deciding how to structure the work. If the research agent concluded “battery storage is the central challenge in the energy transition,” the planning agent will structure the report around battery storage. Not because it made a new decision — because it never had the opportunity to question the prior decision. It just kept going. The research was excellent. The plan followed naturally from the research. The report answered the question the research agent found most interesting. Not necessarily the question the reader needed answered.

The steelman still stands, and research sharpens it. The steelman says: “planning alone is enough, just review the plan.” Research proves it isn’t — but not because planning is the wrong approach. Research proves it because research without context isolation just moves the lock-in one step earlier. The same agent that researched now plans. It cannot escape its own prior conclusions, not because it’s incapable of abstract reasoning, but because those conclusions are literally sitting in its context, shaping every next token it generates.

A fresh planning agent, handed only the research artifact as a clean document, can genuinely question whether the research answered the right question. It can push back. It can say “your research focused heavily on battery chemistry, but I’m not sure that’s what the audience needs.” The same agent that did the research cannot do that. Not really.

But here is the real kicker: research gives you facts; it doesn’t tell you if you’re solving the right problem. Save that thought for a minute, we need one more step.

Review After the Plan

The structure is still the problem, not the model. And to fix it, we added a new, and in hindsight, pretty obvious step. A review phase after implementation.

This is a dedicated pass where a separate agent — or the same agent in a separate context — evaluates the produced artifact against known criteria. Not just “does this code run,” but “does this code do what we intended.” The distinction from implementation is real and it matters. An implementation agent is building; a review agent is hunting for the thing that will break it.

What review actually solves is real. The software auth implementation, evaluated by a review agent, surfaces real questions: Is the JWT expiry window set appropriately for the threat model? Is the bcrypt cost factor tuned for this hardware? Are session tokens actually invalidated on logout, not just expired? These are genuine bugs a fresh pass can find. I’ve seen review agents catch the kind of subtle mistake that a second human reader catches — not because they’re smarter, but because they’re looking for problems rather than building a solution.

But watch what happens when the review agent evaluates the software auth implementation against the plan.

The plan said: “implement JWT-based authentication for the web application.” The review agent confirms: JWT is implemented correctly. Bcrypt is used. Session management is in place. The implementation passes review. It ships.

First enterprise customer tries to log in: there is no tenant isolation. Every user in the system shares a single authentication namespace. The review agent found no bugs. The implementation had no bugs. The plan specified the wrong thing. And the review agent couldn’t catch it — not because it was careless, but because review only checks “did we implement the plan correctly.” Not “was the plan the right plan.” Those are different jobs. And you can’t review your way out of the wrong spec.

This isn’t a failure of intelligence. It’s a consequence of what the review agent is handed. It receives the implementation and the plan. It has no clean access to the original problem statement — what the user actually needed, what constraints were implicit in the product, whether the product served one customer or a hundred who shared a namespace. It’s evaluating the gap between the artifact and the plan, not between the artifact and the goal.

Human code reviewers fail the same way, by the way. Code review finds style violations, off-by-one errors, missing null checks. It rarely questions the architecture decision that was made three sprints ago and embedded in every layer of the codebase. That kind of question requires a different context — a different meeting, a design review, a fresh set of eyes on the spec rather than the implementation.

Review catches errors, but only inside the frame you committed to in the planning phase. You can’t review your way out of the wrong spec.

Name the Problem Before Solving It

Finally, the structure IS the problem, not the model.

Research doesn’t ask it. Planning doesn’t ask it. Implementation doesn’t ask it. Review doesn’t ask it. The question is: what problem are we solving, and how will we know when we’ve solved it?

That’s not a rhetorical question. It has specific, concrete answers. And those answers should be a document — a specification — produced before the plan is written. Not during planning. Not as a side effect of research. Its own phase. Its own artifact.

A specification answers four questions, and it only takes one page to do it. What is the exact output we are trying to produce? What are the hard constraints it must satisfy? What does success look like — specifically, what would we check to confirm it? What does failure look like — what would make us say this didn’t work?

These sound obvious. They are almost never answered before a planning phase begins. In my experience, the reason is that they feel like they slow you down. They don’t. They prevent six months of work in the wrong direction. Think of it as a single page you could tape to the wall — the kind you’d point at during a disagreement about whether the output succeeded or failed.

Go back to the family move. After a thorough research phase — neighborhood data, school ratings, crime statistics, cost of living comparisons — a specification phase asks: what does a successful move look like for your family? The family sits with that question. It turns out they’ve never explicitly answered it.

The husband’s answer: proximity to his aging parents, who live in a specific region of the country. The wife’s answer: their daughter getting into a specific school district that has strong arts programs. The budget answer: keeping total housing costs under a threshold that lets them maintain their current savings rate. Three explicit success criteria.

The research phase found no conflicts because it was never told what it was optimizing for. The specification phase surfaces all three criteria before the plan commits to a single city. The planning phase can now do something useful: find cities that satisfy all three criteria — or, just as importantly, discover that no city satisfies all three and surface that conflict before anyone books a moving truck.

The software auth case is fast. Specification asks: what does correctly-implemented authentication look like for this product, given who its customers are? The answer: it must support multi-tenant isolation with strict data separation, SSO for enterprise customers, and a free tier with email-only login.

Now the plan can be written for the actual product. The research phase’s work on JWT and OAuth is still valid; it just needs to be read through the lens of multi-tenancy, which the specification made explicit.

The full chain, with its five concrete artifacts, looks like this.

Research produces a collection of source materials plus a descriptive state-of-the-art report — what is known about this problem space. Specification produces a success-and-failure criteria document. Planning produces a concrete step-by-step plan — how to get from here to there, given the spec. Implementation produces an evaluable artifact — code, document, report, recommendation. Review produces an evaluation report checked against the specification, not just the plan — a real answer to “did we solve the problem?”

Five documents. Five handoffs. Five chances to catch the wrong frame before it becomes expensive.

We Knew it all Along

The fun part is all of this was known, at least in principle, since before 1987.

IDEO, a by-now ultra famous design consultancy, articulated a five-phase creative process that Stanford’s d.school later codified as Design Thinking.

Tim Brown formalized the diverge-converge logic in Change by Design. The five phases: Empathize, Define, Ideate, Prototype, Test. If those names sound familiar given what you’ve just read, that’s not a coincidence.

Empathize is research: go wide, gather context, talk to users, understand the problem space from the outside rather than the inside. Define is specification: converge on an explicit problem statement with clear success criteria, the “how might we” question that frames everything downstream. Ideate is planning: diverge again, generate candidate solutions, explore the space of possible approaches. Prototype is implementation: produce an evaluable artifact, something you can put in someone’s hands. Test is review: evaluate the prototype against the problem statement, not just against the prototype’s internal logic.

Every phase the agentic world has been bolting on since 2023 was already named, sequenced, and justified in a framework that predates the modern web by a decade and a half. It’s a framework every Silicon Valley startup, incubater, accelerator, and VS knows in and out. It is taught in bussiness majors all over the world. It’s literally the structure of most pitch decks. But it is still missing in most agentic protocols we use every single day.

The piece most clearly missing is the Define phase — the second one, which IDEO put second for a reason: without a clear problem statement, everything downstream answers the wrong question. It’s a very old insight the field keeps rediscovering from scratch — Agile’s Definition of Done, test-driven development’s failing-test-first, specification by example. Each was the same insight under a different name.

Now, here is the strongest version of your initial objection. “You don’t need all this structure. A great prompt specifies the audience, the format, the success criteria, the constraints. Write a better prompt and you get all five phases in one go.”

Let’s take this seriously, because it’s not wrong about prompt quality. A genuinely well-crafted prompt that specifies who the output is for, what format it should take, what it must accomplish, and what would make it fail — that prompt is effectively a specification. You’re right that prompt quality matters.

But a single prompt containing research context, goal specification, a plan, and execution instructions is not a cleaner version of the five-phase process. It’s five phases collapsed into one context window, with no mechanism for each phase to question the prior one’s conclusions.

When research and planning share a context, planning can’t interrogate research. When planning and implementation share a context, implementation can’t push back on the plan. When specification and review share a context, review is already biased toward confirming the specification it helped write. Prompt quality is about what you ask.

Phase independence is about who processes each answer, and whether they can genuinely disagree with the prior step. You can write the world’s best prompt and still hand it to an agent that will execute it inside the same narrowing tunnel, compounding the same assumptions with every step.

Picture someone reading their own manuscript for the fifth time — they no longer see what’s there, only what they meant to write. It is one of the most replicated findings in human psychology: once we form a belief, we interpret subsequent evidence through the lens of that belief. We notice confirming evidence, discount disconfirming evidence, and generate hypotheses that assume the belief is correct.

This is not a weakness of intelligence. Every mind — human or artificial — interprets through the conclusions it has already drawn. The agent that researched your problem already believes things about it. When it transitions to planning, it plans in service of those beliefs. The agent that planned has a solution in mind. When it implements, it makes countless micro-decisions that serve that solution. The agent that implemented defended choices as it worked. When it reviews, it reads its own output charitably.

Context isolation breaks this chain. A fresh context hasn’t seen the prior steps. It cannot be fooled by conclusions it never drew. It reads the artifact cold, which is the only way to genuinely evaluate it.

Design Thinking’s diverge-converge logic is not about what each phase does. It’s about who does it, and whether they can arrive at it without inheriting the prior phase’s commitments.

Start Doing This Yourself Today

The artisanal version of this is simpler than it sounds:

Treat each phase as a distinct conversation. Start a fresh session for each one.
Hand it only the artifact from the prior phase — not the prior conversation, not your running context, not a summary of what you’ve been thinking about. The artifact alone.
And tell the agent explicitly what mode it’s in. “You are in Research mode. Do not propose a plan. Do not suggest solutions. Your only job is to describe the problem space and produce a research report.”

That instruction matters. Not because the model needs to be controlled, but because explicit mode assignment prevents the agent from sliding into execution behavior when it senses a gap to fill. Models are trained to be helpful; helpfulness steers every gap toward a solution. Naming the mode is how you resist it.

The discipline lives in you, not the tool. This works with any agent, any interface. Fresh context, explicit mode, artifact handoff. That’s the whole recipe.

If you want to go further, you can make phases structurally enforced rather than just instructed — agents that literally cannot execute, subagents that receive only the artifact, automated handoffs with no shared context. Programmable harnesses give you this level of control with permission levels per skill. (If you don’t have one, call me, I’ll lend you one for free.)

One small step you can take today: add a specification phase to whatever workflow you already use. Before your planning phase writes a plan, ask for a success-criteria document first. One page. Explicit pass/fail conditions. What would make this output a success? What would make you throw it out? Review that document before planning begins.

This single addition — inserting a define phase between research and planning — catches more failures than adding a review phase after the fact. Because it catches them before the plan commits to the wrong goal.

What not to do: don’t implement all five phases as a mechanical checklist. And don’t add phases as ornamentation — a research phase that shares a context with planning adds conversation turns, not structure. More words in the same window is not more phases. The phases are context boundaries, not steps in a recipe. A phase that doesn’t produce a concrete artifact and doesn’t hand it to a fresh context adds nothing.

One document per phase. Fresh context per phase. That’s it.

Structure Before You Re-Prompt

Picture the artifact chain as a physical thing. A manila folder passed from one desk to the next. The research desk produces a report, closes it, slides it across. The specification desk opens only that folder, reads it, produces a criteria document, closes it, slides it across. The planning desk never opens the research folder — it opens only the criteria document.

And so on down the line. Each desk sees exactly one prior document. Each desk produces exactly one new document. The chain is what makes independence possible. You cannot hand off a vague intention. You cannot slide a feeling across a desk. Only a document.

Context isolation is the move most pipelines skip, and it’s the one that does the most work. Every phase that shares a context with a prior phase inherits its commitments. Not because the model is lazy or wrong — because that’s how cognition works, human or otherwise. We interpret through the lens of what we already concluded.

Context isolation is cheap: start a new session, pass only the artifact. The cognitive science is unambiguous: breaking the confirmation-bias chain requires a structural break, not a better instruction. Context isolation gets skipped because it looks optional. It isn’t.

Remember, the structure is the problem, not the model. Restructure before you re-prompt.

Until next time, stay curious.

This is the core argument of the agentic workflows chapter in the second edition of Mostly Harmless AI — the full chapter walks the failure cascade with more cases, the artifact design problem (a research report can be thorough and still hand the wrong thing forward), and the context isolation mechanics in depth.

This specific article is new content, still not in the book, but it will land there shortly. The book is 50% off while it’s in early access, and also free to read online in a custom reader I built: dark mode, font controls, progress tracking, offline support, the works.

If you want the architecture behind these systems — how they fail, what the harness around them should look like, and what to actually do about it — that is what the book is for.

Get Mostly Harmless AI - 50% off

David Sutherland

Jun 12

Great post Alejandro. I'm going to give it a go.

cqm

May 27

I'm on the fences on Context Isolation as the absolute rule/cure... don't take me wrong, sounds pretty clean, and I see the value but in practice it can either reduce bias/noise or destroy helpful continuity depending on where you draw the boundary.... there is some tension for fast-pacing prototyping tasks for instance... regardless, good read, thanks!

1 reply by Alejandro Piad Morffis

4 more comments...

Discussion about this post

Ready for more?