The Computist Journal

Pseudocode is a lie

Alejandro Piad Morffis — Thu, 04 Jun 2026 11:01:18 GMT

Monday’s post was about a single algorithm — Union-Find, α(n) ≈ constant. This is the technical companion: what it took to publish that algorithm. Both pieces are part of the June sprint for The Algorithm Codex, which you can also get in print at Gumroad or as part of the Compendium. More at the end.

Photo by Ellephant on Unsplash

Pseudocode is informal enough to be ambiguous and formal enough to be annoying. Every advantage it claims is better served by something else.

Subscribe now

Three claims for pseudocode, all false

The canonical defense of pseudocode is Introduction to Algorithms (Cormen, Leiserson, Rivest, and Stein), known as CLRS. Thirty years. Four generations of computer science students. The single most widely-sold technical book in the field. If you’ve studied algorithms, you learned from CLRS or from someone who did. And CLRS uses pseudocode throughout, defends it explicitly, and makes the case as well as it can be made.

The argument for pseudocode rests on three things. First, it hides details that don’t matter for the explanation — variable types, secondary operations, boilerplate. Second, it speaks at the same conceptual level as the narrative; you can invent operations, give them names, use notation that fits the idea you’re explaining. Third, it survives time: programming languages rise and fall, but pseudocode is language-agnostic, and a textbook meant to teach the same concepts to students who will write Python, Java, Rust, or something that doesn’t exist yet can’t be tied to any of them.

Problem 1: Pseudocode is not intuitive. If you want to convey an idea intuitively — how a process unfolds, why an algorithm works, what the conceptual structure is — use a diagram, a narrative, an informal sketch. These are genuinely intuitive; they don’t impose formal constraints. Pseudocode has syntax. It requires for, while, if, formatting conventions, notation that the reader has to parse before they can extract the idea. The claim that pseudocode is easier to read than real code is false. It’s different from real code, but not simpler. For genuinely intuitive explanation, there are better tools.

Problem 2: Pseudocode is not formal. Real code has formal semantics. There is a defined mapping from source to behavior; a compiler or interpreter enforces it. Pseudocode has no such mapping. The reader is left to translate it into whatever language they’ll actually use, and that translation is entirely implicit. It’s undocumented and un-enforced. The gap between “swap the elements at positions i and j” and “no, you can’t, you forgot you were working with an immutable list” is not a minor translation detail. That gap is where the data structure’s mutability, the cost model, the choice of representation actually live. If your pseudocode rules were formal enough to specify the translation unambiguously, you’d have a compiler. It would just be code.

Problem 3: Pseudocode is not portable. The language-agnostic argument sounds compelling, but it proves too much. Pseudocode isn’t language-agnostic in any useful sense. It’s an additional language the reader has to learn. If you give students pseudocode, you’re giving none of them something they can run. You’re asking every reader to translate. Whereas if you pick one real language (a widely-used, readable one), many of your readers understand it directly, and the rest have a formal, precise text to translate from. Pseudocode doesn’t spare readers translation; it makes all of them translate.

My prior is strong: intuition first, formalization later. An algorithm, a formula, a mathematical concept: explain it intuitively first, then formally. Pseudocode tries to do both and delivers neither. It isn’t informal enough to convey ideas; it isn’t formal enough to prove properties. Pseudocode charges you for formal syntax and gives you no formal guarantee.

Two registers: diagrams for intuition, real code for formality

Use two kinds of explanation, not one.

For intuitive content, use diagrams and narration. Diagrams don’t have to be formal — informal sketches work fine: an arrow from one box to another, a tree branching as a recursion unfolds, a before/after pair showing what partition does to an array. They explain intuitively, expose the underlying reasoning, and don’t impose the cost of formal syntax.

For formal content, use real code. Every supposed advantage of pseudocode — the ability to invent operations, hide irrelevant detail, speak at a high conceptual level — exists in real programming languages too. Real languages have abstraction and encapsulation. To defer an operation, stub it: a named function with an empty body or a one-line comment. graph.connect(u, v), database.search(query), parser.parse(source). Real code already has the vocabulary for subjects, predicates, and attributes on those subjects. You don’t need bespoke notation. You need a sufficiently expressive language.

That language is Python. Of the widely-used general-purpose languages, Python is the most readable. There are more ergonomic languages (certain Lisps, Haskell, Racket), but their readability advantage comes at the cost of familiarity, and a textbook audience that doesn’t already know those languages won’t benefit from them. Python is known, readable, and expressive. There is no good reason not to use Python in a programming book, even a book that isn’t about Python.

Literate programming keeps prose and code from drifting

If we’re using real code anyway, formal and executable, why not make that code a real deliverable? A library with unit tests and an import statement, not just a code block in a document.

The obstacle: docs and code drift apart when they live in separate files. Change the explanation, and you have to remember to update the code. Change the code, and you have to remember to update the prose. In practice, one always lags the other. Pseudocode dodges this: it never has to match anything, so it’s never wrong. Real code is always either in sync or not.

Literate programming is Donald Knuth’s answer, from his 1984 paper. The original system (WEB, later CWEB) was built around Pascal and C. You wrote a single source file interleaving prose and code; a tool called tangle extracted the code into a runnable program, and a tool called weave rendered the prose into a readable document. The writing and the code are the same artifact. You can’t let them drift because they’re literally the same file.

Modern tools have rediscovered the idea. Jupyter notebooks mix prose and executable code. Quarto does the same at book scale. But neither fully extracts the code into a standalone packageable artifact: the notebook is the program, and you can’t easily ship it as a library.

`illiterate` makes the book and the package the same source

illiterate closes that gap. A build step extracts the tagged blocks from .qmd files ({export=...}) and assembles them into actual Python source files. The book renders; the package builds from the same source. Three things follow automatically: the documentation can’t describe an algorithm that doesn’t compile; the tests run against the same code the reader sees; and the book ships a working library as a side effect of being written.

In the Algorithm Codex (a hundred algorithms across sorting, graphs, data structures, and dynamic programming), every chapter follows this pattern. No pseudocode anywhere.

Quicksort, with the package built from the page

A Quicksort chapter opens with the public entry point, the function the reader calls and the one that lands in the package:

```python {export=codex/sorting/quick.py}

from typing import Sequence
from codex.types import Ordering, default_order

def quicksort[T](items: Sequence[T], order: Ordering[T] = default_order) -> list[T]:
    items = list(items)  # don't mutate the input
    _quicksort(items, 0, len(items), order)
    return items
```

The recursive helper is two cases: a base case that bails on slices of length zero or one, and a recursive case that partitions and recurses on both halves:

```python {export=codex/sorting/quick.py}

def _quicksort[T](items: list[T], lo: int, hi: int, order: Ordering[T]) -> None:
    if hi - lo < 2:
        return
    pivot_index = partition(items, lo, hi, order)
    _quicksort(items, lo, pivot_index, order)
    _quicksort(items, pivot_index + 1, hi, order)
```

Both blocks export to the same file; illiterate appends them in document order. Next in that order is the partition step, its own block and its own paragraph. Each conceptual piece earns its own prose frame and its own extracted function.

When the book renders, Quarto executes every demo block. If any algorithm is wrong, the build fails before that sentence can appear on the page. The build verifies every algorithm in the Codex, every time.

The book is the primary source. The algorithms are real code that runs. The two have never disagreed.

Until next time, stay curious.

illiterate is open-source at github.com/apiad/illiterate, a small Python tool that extracts code from Markdown and assembles it into runnable source files. If you write a book, a tutorial, or a course where the code has to match the text, the setup described here is twenty minutes to replicate.

The algorithms it enforces are in The Algorithm Codex, available in print on Gumroad, or free to read online at matcom.github.io/codex/. Both the Codex and my other books are bundled at apiad.gumroad.com/l/compendium.

Subscribe now

Start Here: A Map of The Computist Journal

Alejandro Piad Morffis — Mon, 01 Jun 2026 15:24:59 GMT

Generated with mosaico using Nano Banana 3.

Last updated: 2026-06-01.

If you found this newsletter through a link, a recommendation, or by accident, welcome. This post is the map — a guided tour of what I write about, why, and where to start.

Subscribe now

What this is about

For five years I’ve been writing about computers — what they can do, what they can’t, what they’re made of, and what they mean. The newsletter started as a side project for sharing things I thought were beautiful and ended up as the place where I think out loud about everything that matters to me professionally.

I keep coming back to one belief: the public conversation about AI is broken. It oscillates between hype and dread, with very little in between. Neither extreme is grounded in what computers actually are. The path through is to look at the machinery — algorithms, computation theory, formal systems, the long history of what we’ve learned about what can be computed and what cannot. The map underneath the territory.

That’s what this newsletter is for. It’s also why the writing here ranges from “what’s the fastest sorting algorithm” to “is the universe a computer.” Those questions look unrelated. They’re not. They’re the same question at different scales — and the more you know about one, the more sense the others make.

If you read enough of what I write, you’ll find I’m an opinionated technologist with a soft spot for foundational ideas, a deep suspicion of bullshit, and a stubborn belief that understanding the inside makes you better at navigating the outside. If that sounds like you, you’ll like it here.

Five paths through the archive

The newsletter is loosely organized into five sections. Pick the one closest to what brought you here; each ships roughly two posts a month.

🤖 If you want to understand what AI really is — and isn’t

This is the largest section, and the one I think about hardest. The thesis: large language models are extraordinary at some things and structurally limited at others, and most of the confusion in public AI discourse comes from not knowing which is which. Three places to start:

The Insurmountable Problem of Formal Reasoning in LLMs — Or why LLMs still can’t, and probably never will be able to fully reason. If you only read one post on this newsletter, it should be this one.
Why Reliable AI Requires a Paradigm Shift — Hallucinations aren’t a bug we’ll patch in the next release. They’re the price of the architecture we’ve chosen, and the only way out is to change the architecture.
Artificial Neural Networks Are Nothing Like Brains — Busting the biggest public misconception in AI. The metaphor is so wrong it actively blocks understanding.

These three together form a kind of mini-syllabus on the real limits of modern AI. They’re also the spine of my book Mostly Harmless AI — more on that below.

🧠 If you want to see why computation itself is beautiful

I came to AI through computer science, and I think the field has lost something by skipping the foundations. The deepest, most satisfying ideas in computing aren’t about models or tools — they’re about what can be computed and what can’t. This section is my attempt to make those ideas accessible without dumbing them down.

The Most Beautiful Algorithms Ever Designed — A very opinionated short list. Read this and you’ll never look at “an algorithm” the same way again.
The Power of Abstraction — How to think like a Computer Scientist. The one mental move that separates programming from computer science.
The Answer to All Questions — How the quest to answer all mathematical questions became the origin story for a new science. The Hilbert-Gödel-Turing arc, told as drama instead of textbook.

If those land, the section has dozens more, from the P vs NP problem to formal language theory to the foundations of machine learning.

💻 If you want to build something

The Mondays of this newsletter are conceptual. The Thursdays are practical. This section is where I show working code, walk through implementation decisions, and occasionally release something into the world.

Chat with your PDF — A simple streamlit app using LLMs and vector databases to answer questions from a PDF. The minimum viable RAG, demystified.
How I Built the Database of my Dreams — A pure-Python vector database for prototyping. Build AI apps 100× faster, with one dependency.
Realtime 3D in Pure Python + Numpy — A side project that went way too far. The kind of thing you write because you want to know if you can.

This section also doubles as the trailer for whatever I’m currently building — most of the open-source projects on my GitHub started as a Thursday post.

✒️ If you want to argue about the bigger stakes

Sometimes I step out of the technical voice and write an opinionated essay. Most of these are about how AI is reshaping science, education, and the way we work — and what to do about it. None of them have a clean answer; that’s the point.

The Techno-Pragmatist Manifesto — A level-headed response to techno-pessimists and techno-optimists. Where I plant my flag.
Teaching the Mostly Harmless Way - Part I — Principles, practices, and tools I use in all my teaching endeavors. Half autobiography, half pedagogical argument.
The Three Minds of a Computer Scientist — The Scientist, the Engineer, and the Hacker. A useful taxonomy for anyone in this field.

If you came here from a more humanities-flavored corner of the internet, this section is probably where I sound most like the rest of the writers you read.

💡 If you want to chase the big questions

The deepest section, and the one I’m most personally invested in. Questions about minds, machines, science, and reality — through the lens of computation. No clean conclusions; just careful thinking.

Can Machines Think? — What Alan Turing’s seminal paper Computing Machinery and Intelligence actually says, and what it doesn’t.
Is Science the Answer to Life, the Universe, and Everything? — Part I — Exploring the world of scientific theory and practice. Where epistemology meets engineering.
Are Brains and Computers Alike? — The boldest of all theories of mind. A careful walk through the computational theory of mind and its critics.

This section moves slowly — maybe one post every other month — but the conversations in the comments are the best on the whole newsletter.

The books

Across these five paths, certain arguments come back often enough that they grow into books. Right now there are five, in various states:

Mostly Harmless AI — the finished one. A book-length argument about what AI can and can’t do, distilled from the Mostly Harmless AI section. Just shipped v2.0.
The Algorithm Codex — a growing pay-what-you-want collection of essays on the most beautiful algorithms ever designed. Always in motion.
The Science of Computation — early access. The book version of the Science of Computation section: foundations, computability, complexity, the whole stack.
The Hitchhiker’s Guide to Graphs — early access. Graphs as the universal data structure, and what you can do with them once you take them seriously.
How to Train Your Chatbot — early access. The practical companion to Mostly Harmless AI — building real LLM applications without the magic-thinking.

If you want them all in one place, the Computist Compendium bundles everything at a discount.

The code

On top of writing, I maintain a long string of open-source projects: AI tools, web frameworks, Python libraries, and assorted cool toys. Some of them have been quietly running in production at companies I’ve never heard of; others are weekend experiments that ran away from me. They all live on my GitHub.

The pattern is usually the same: a Thursday post explains why I built the thing, the repo has the code, and a month later it shows up in a Monday essay as a worked example. If you’re following the build track, watch both feeds.

What you get if you subscribe

Every article on this newsletter — every essay, every tutorial, every deep dive — is free, and will be free forever. I had the incredible fortune of receiving a world-class education at no cost. Writing publicly is my humble way of paying some of that back.

Starting June 22, 2026 there will also be a paid subscription, for $7/month or $70/year. Paid subscribers will get:

Every book I’ve ever published, free, forever. A 100%-off code refreshed continuously, covering everything in the catalog above and everything I publish from this point onward. At current prices that’s well over $200 worth of material, and it grows every month.
Book drafts as PDFs, often months before public release — half-edited, sometimes rough, always the latest thinking.
A weekly Friday digest with what shipped, what’s coming next week, and an open AMA thread.
Prioritized attention in comments, chat, and Office Hours.

Founding members will lock all of it in for life. If you’re a professional engineer or researcher, you can almost certainly expense the subscription to your employer — I’ll be happy to help you draft the email.

What’s next

Every month, a new chapter of a book or a new open-source release. Every Sunday, a short This Week at The Computist Journal note on what shipped and what’s coming. And every once in a while, an editorial when something in the field is too important to ignore.

This map gets updated as the archive grows. If you bookmark one post on this newsletter, make it this one — it’ll always point you at the current best version of where to start.

The subscribe button is at the bottom of every post on this site, and the one at the top of this one too. If you’ve read this far, you already know whether this is your kind of newsletter. Hit it, and I’ll see you in the next post.

— Alejandro

Subscribe now

The fastest algorithm ever devised

Alejandro Piad Morffis — Mon, 01 Jun 2026 11:30:44 GMT

Every post on June is related to algorithms, anchored on The Algorithm Codex. I’m also running an experiment this month: stretching the pace to one-post-a-day with per-day theme. Monday is for Mind-blowing. Let’s see how that goes.

Photo by Arnaud Mesureur on Unsplash. I promise, it will make sense in the end. It’s a bunch of trees…

When someone tells you they have the “fastest algorithm ever devised”, you probably think they bumped into trick to make clever use of the cache, or something equally tasteless, perhaps Youtube-worthy for 2 hours. I know I’d do if you came to me with this title.

But this is not that post. This is about the actual fastest algorithm we have ever devised, and probably the fastest anyone will ever do. And not in a trivial sense. There is one problem in computer science (practical, appearing everywhere you look, explainable in two sentences) where we have an algorithm, a correctness proof, a complexity bound, and a matching lower bound that closes the problem.

Done. Problem solved. No more papers.

That kind of closure is the rarest thing in any field of human knowledge.

Subscribe now

Four conditions, one winner

But what does “fastest algorithm” mean? There are many ways you can interpret this phrase, and some of them are trivially true.

Reading an entry from an array takes constant time. Nothing beats that. But it’s one operation on a flat structure, no progression, nowhere lower to go. Getting the title should require something harder.

The interesting frontier is a non-trivial problem with a naive solution everyone reaches for, and another, mindblowingly faster solution that is provably optimal. The problem we analyze today has four traits we want:

First, general: appears in real systems across many domains, not a toy. Second, explainable in two sentences: if you need a paragraph to state it, the optimality result arrives too late to matter. Third, a real progression from naive to optimal: intermediate ideas that each earned their place, not a single flash of inspiration. Fourth, provably closed: algorithm and lower bound meet exactly. Not “hard to improve.” Closed.

Our problem is Union-Find, which clears all four. Nothing else with a non-trivial structure clears them as cleanly. Let me explain it to you in two sentences.

You’ve used this all day

Union-Find goes like this. You have n things, organized into groups. You need to merge two groups and ask whether two things are in the same group.

That’s it. No deletion. No ordering. No traversal. Two operations on a shifting partition: union and find.

You’ve used it today, probably without noticing. Your phone’s photo app groups every selfie of the same person into a face album by computing connected components, and connected components is what Union-Find does for a living. The bucket tool in any image editor (Paint, Photoshop, Procreate, whatever) floods a region with color by labeling connected pixels? Yes, Union-Find.

Every time your TypeScript or Rust compiler decides two variables must share a type, Hindley-Milner type inference unifies them through Union-Find. Spanning Tree Protocol, the thing that lets the switches in your office or home network agree on which links to use so packets don’t cycle forever, runs Kruskal’s algorithm; Kruskal needs Union-Find to detect cycles.

The same idea shows up across classical computer science under different names. The canonical image segmentation algorithm (Felzenszwalb and Huttenlocher, 2004) is the bucket tool’s research-grade cousin. Procedurally generated mazes in games are built by running Kruskal on a grid. Percolation theory in physics, the study of how fluids flow through porous materials, is the same partition-growing-by-merging problem in continuous form. Whenever a system has equivalence classes that grow online, Union-Find is almost always the engine underneath.

Union-Find is actually pretty simple

The naive solution is straightforward: give every element a label. To merge two groups, sweep through the array and relabel everything in one group with the other’s identifier. Works perfectly. Linear work per merge. Nothing cheaper to write, and nothing more expensive to run at scale.

Union-Find: n elements, shifting partitions, two operations. Before: three groups. After union(x, y): two. Made with Tesserax.

The first idea that earns its keep is a structural change. Stop labeling every element. Give each element a parent pointer: a single reference to another element. The root of each tree is the element that points to itself, and the root names the group.

To check if two elements share a group: walk each one up its parent pointers to a root. Same root, same group.

To merge two groups: find both roots and point one at the other. One pointer write, regardless of how many elements are involved.

The trade has moved. Merge is cheap. But your find now costs as much as the tree is tall.

Each group is a tree. Arrows point from children up to parents; the root points to itself.

Here is the failure mode. An adversarial union sequence can build a long thin chain: element one points to element two, which points to element three, all the way down. Every find walks the whole chain. Linear again.

The rest of the story is about keeping the trees short, like, really, really short.

Making Union-Find fast, really fast

The first fix is the obvious one once you’ve seen the failure. Track each tree’s height: its rank. When merging, always hang the shorter tree under the root of the taller. The merged tree is no taller than the taller of its inputs.

Union by rank: the shorter tree hangs under the taller root. The merged tree’s height is unchanged.

The counting argument is clean. A tree of rank r must contain at least 2^r elements. You can prove it by induction on the merge rule. So the rank, which bounds the height, can never exceed log₂(n). Both operations now run in O(log n). A million elements: naive worst case is a million steps; rank-bounded is twenty.

One integer per node buys us exponential improvement in speed. Read that again.

Most algorithms textbooks would stop here. The second fix is what changes this from a really fast algorithm to the fastest algorithm ever, period.

Notice something about the walk-to-root operation. When it finishes, it has discovered the root. Every node it visited could be rewired, right now, on the way back, to point straight at the root.

The structure flattens itself, in place, every time you ask it a question. The next walk on that path takes one step. Work you already did is work you never have to do again.

Path compression: find(x) collapses the chain into a star. Every walked node now points straight at the root.

On its own, this trick is good. Combined with the rank rule, the cost drops below logarithmic, into a regime that barely has a name.

¿How fast is this? Strap on, baby.

Union-Find is the fastest algorithm ever devised

Robert Tarjan, in 1975, proved the two ideas together give an amortized cost per operation of the inverse Ackermann function, written α(n).

To understand why that means fast, you need a picture of what you’re inverting. The Ackermann function grows like hell. Faster than exponentials, faster than towers of exponentials, faster than any function built by ordinary recursion. A(4, 4), only four steps in, already exceeds the number of atoms in the observable universe. It’s a freaking monster, one of the fastest growing functions ever invented.

The inverse grows correspondingly sloooowww. For any n less than 2 to the power of 65536 (a number whose decimal representation runs twenty thousand digits), α(n) ≤ 4. Read that again. For any input you could construct on any computer that will ever be built, the average cost per operation is at most four pointer hops.

Effectively constant. Call it constant, for practical purposes. But it is not constant. There is a real function in there. That fact is not a technicality. This is not a gap waiting for a smarter trick. It is a permanent lower bound, proved by Fredman and Saks in 1989.

Michael Fredman and Michael Saks worked in the cell-probe model, which counts how many stored bits any data structure must read to answer a query. They proved that Ω(α(n)) is the true lower bound for the union-find problem. No structure of any design can answer union and find queries doing less amortized work than α(n).

Tarjan’s upper bound says path-compression-plus-union-by-rank does at most α(n) amortized. Fredman and Saks’s lower bound says no structure can ever do less. The two bounds meet exactly.

Done. Problem solved. No more papers.

Problem solved is a rare luxury

Newton’s laws work until you approach the speed of light. Quantum mechanics works until you zoom out to cosmological scales. Turbulence in fluids is still open. It has been for a century. The three-body problem is still open. Physics doesn’t finish; it hands off. Every theory we have is a model that fits data within its domain of validity, and I think we forget how strange that is.

The Schrödinger equation describes how electrons move in any atom. For helium (two electrons) it has no closed-form solution. The electron interactions couple (each one affecting all the others) in ways that resist exact treatment. Chemistry approximates, and the gap is permanent, not a waiting-for-faster-computers problem.

Mathematics does close problems, and famously. Wiles closed Fermat’s Last Theorem in 1995, after the conjecture had sat open for 358 years. The four-color theorem closed in 1976 (with a computer’s help, which still rankles some). Poincaré closed in 2003. But the closures cluster on problems mathematicians chose: beautiful, deep, and often distant from anything that runs in your hand.

Computer science almost never gets all three at once. A problem that is practical, used in real systems every day; crisply specified, fitting in two sentences; and provably closed, algorithm and lower bound matched exactly. Union-Find is in this corner. Comparison-based sorting being Θ(n log n) is in it. Linear search on unsorted data being Θ(n) is in it. The list is short.

Algorithm and lower bound, meeting exactly.

Done. Problem Solved. No more papers.

The fastest non-trivial algorithm ever devised takes twenty lines to write, runs in essentially constant time per operation, and is one of the very few things we have ever — in any field — entirely finished.

If that isn’t freaking mind blowing, I don’t what is. Until next time, stay curious.

Union-Find is chapter 14 of The Algorithm Codex, a book that spells out the algorithms every programmer eventually meets in real, functioning Python — no pseudocode bullshit, no complicated math.

Every chapter ends with the same three questions: is it correct? how efficient? is it optimal? Most chapters land two out of three. A small few land all three. Union-Find is the cleanest example in the book of all three boxes ticked.

Get the Algorithm Codex

Mostly Harmless AI v2.0 is here!

Alejandro Piad Morffis — Sun, 31 May 2026 11:01:58 GMT

Hey! Happy Sunday. Here is a quick note to remind you that Mostly Harmless AI, second edition, draft one, is finished. PDF and EPUB on Gumroad. Free online reader at books.apiad.net. Both live, right now.

This book is a permanent beta, because the AI story is being created as I write these words and it’s moving just too damn fast for anyone to put it down cleanly in a book. So don’t expect the kind of polishing that comes with retrospective looks at settled topics. This isn’t that kind of book. It’s raw, but up-to-date. It contains basically everything I know and believe about AI that doesn’t require code or math to understand it.

Get Mostly Harmless AI - 50% off

Sixteen chapters, mechanism-first

Three parts, sixteen chapters, around two hundred and sixty pages, almost three hundred footnote citations.

Part I — Foundations walks the mechanisms, from the symbolic systems of the 1950s to the agent loops of 2026. Classical AI, machine learning, deep learning, language modeling, generative AI, agentic AI. If you have ever wondered what attention, gradient descent, or RLHF actually mean (not the marketing line, the mechanism), start here.

Part II — Applications shows how AI is reshaping each field: knowledge work, scientific research, software development, education, creative work, policy. Dual-audience throughout. Experts get practice and gotchas. Non-experts see why their lawyer, their doctor, and their kid’s teacher are suddenly using these tools.

Part III — Dangers is the honest one. Alignment, the limits of language models, and the harms already in the world today: scammers cloning your mother’s voice, autonomous weapons that pick their own targets, hiring algorithms that quietly downrank you, jobs evaporating from under people who built careers in them. The existential-risk question is in there too, treated proportionally.

The book lands on a third position beyond maximalists and doomerists. These systems are powerful enough to demand real responsibility and limited enough that the worry should be about who deploys them, not about the silicon waking up. The future is not predetermined. Neither doomer nor utopian framings are right. We choose, and choices have responsibilities attached.

The first-draft post from two weeks ago has the longer description if you want more.

Free online, or pay for offline

Read it free online, in perpetuity, at books.apiad.net. Clean typography, gentle dark mode, inline footnotes. No popups, no signup, no tracking. The reader was built for these books specifically.
PDF and EPUB on Gumroad if you want it offline, on your e-reader, on a plane: apiad.gumroad.com/l/ai. Launch week is 50% off.

If the online reader is free, then why would you buy it? Well, you don’t have to. But here is why you may want to do it anyway.

Writing this book took two years. The second edition took three months of intense rewriting on top of that. If you find value in the work and want more of it to exist (these books, the Computist Library, the journal you are reading), paying for the PDF is how you tell me that. I do not run ads. I do not sell your attention. Every book sale is one person telling me to keep going, and the money buys the months it takes to write the next one.

If you cannot pay, read it free. No guilt, no caveat. That option exists on purpose.

June is algorithms month

After today, I am putting AI writing to sleep for a couple of months, on purpose.

Not the work — I will still be building AI tools and using these systems every day. But on this journal, starting tomorrow and through July, I am switching to algorithms and core computer science, with The Algorithm Codex as the spine. The Codex is the next book in the Computist Library, written in parallel with MHAI and ready for center stage. Already 200 or so pages, and 50+ algorithms in, but still very rough around the edges, in dire need of your feedback.

During June, you can expect daily posts, Monday through Friday, each anchored on a chapter. Why the fastest algorithm ever devised runs in essentially constant time. Why most programmers get binary search wrong. Why no comparison sort can beat n log n, and the loophole that lets you anyway. Why some algorithms become settled facts the way theorems do.

AI returns to the journal in August. The field will surely have moved a lot by then, and there will be more than a few chapters to retouch.

So if you come here for the AI material, this book is the most complete thing I have on it — go read that. If you have been curious about the algorithms side, the next two months will be the best stretch of it I publish all year.

That’s all for today. Tomorrow I’ll meet you again with a brand new article.

Until then, stay curious.

Get Mostly Harmless AI - 50% off

Introducing Aegis: the programable multi-agent meta-harness

Alejandro Piad Morffis — Fri, 29 May 2026 11:05:05 GMT

On May 31 we launch Mostly Harmless AI v2. This arc — how models learn, how agents work, where they break, and what it takes to build something real on top of them — is now a book, updated for May 2026. Newsletter subscribers get 50% off.
Get Mostly Harmless AI - 50% off

Photo by Patrick Hendry on Unsplash

For the better part of the last two months I’ve been drilling you with a bunch of inter-connected ideas, all gravitating around the notion of agentic reliability. We started with how these models learn, argued about seventy years of AI history, traced the strange logic of prediction, dissected what an agent actually is, then spent two posts on the edges: how to write tools agents can really use, and what happens when you push them to their limits.

This is the last post of that arc, and it embodies my vision of where this whole Agentic AI thing is going. And to show I’m putting my money where my mouth is, this post is about the tools I’m building to bring forth that vision.

But first, why should you care?

Subscribe now

Agents aren’t just for coders

Here’s something the AI tool industry quietly gets wrong: every major agent harness — Claude Code, Gemini CLI, Cursor, OpenCode — is pitched as a developer tool. And technically, yes, they drive code.

But code is how agents solve problems, not what the problems are. When you ask an agent to research a topic, draft a document, reorganize a folder, schedule your week, or synthesize a dozen sources into one coherent brief — none of that is fundamentally a programming task. The coding is incidental. A historical accident of the fact that the people who built these tools happened to be programmers solving programmer problems.

So if you’re a researcher, a writer, a manager, a student, a scientist who occasionally touches a terminal — this piece is for you too.

What’s wrong with existing tools

I use Claude Code every day. The harnesses — Claude Code, Gemini CLI, Cursor, OpenCode — are genuinely good. The agentic loop is robust. Tool design is sharp. Context management is solved for practical purposes. I’ve opened ten-thousand-line projects and the agent knows exactly where it left off.

What’s missing is not single agent capabilities. It’s coordination.

Claude Code runs on Claude. Geminis CLI runs on Gemini. OpenCode allows any provider but you cannot use your existing subscriptions (which are heavily subsidized); you have to pay API rates.

If you’re running multiple agents with different models from different providers, you end up with four windows open: one per harness, one per model, one per subscription you’re already paying for. Different tools with slightly different annoying quirks and ways to do and call the same things.

But you can still make them collaborate, just not easily.

Most harnesses support sub-agents, which is a subroutine: the main agent mints a new subagent for a specific call, call it, it runs, it returns, the main agent continues.

What they don’t support is a mid-work handoff. Imagine you’re two hours into a brainstorm with Claude, a question has emerged that you cannot answer without trying some code, and now you want a second agent — different model, fresher context, perhaps Kimi from OpenCode Zen — to take over a brief coding session, and then handoff back to Claude what it found so brainstorming can continue.

That is somewhat achievable with subagents (except they run autonomously and die, you cannot, in most tools, interact with them and steer them into new directions). But what you cannot do, is have Kimi return to Claude, and stay up, waiting (with its full context alive) for a second follow up question to continue exploration.

That transfer doesn’t exist today. You have to ask Claude to produce a handoff manually, paste it in Kimi explaining the situation from scratch. The paste its response back to Claude, and so on. You have become a secretary between two AI agents.

Agents are isolated in all these tools. Two agents working on the same document will overwrite each other. There’s no locking, no merge protocol, no queue where one drops a task and another picks it up. No broadcast so all agents know when the plan changes. Agents don’t share a world — they each have their own private window into one.

I looked around before building. The closest I found: Conductor, which orchestrates multi-agent workflows — but only for Claude Code. And T3.codes, which drives any harness, closer to the spirit of what I wanted, but neither has cracked the coordination layer as I envision it.

Introducing Aegis

So, of course, I had to go and make my own. (Quick digression, if you’ve been reading this blog for a while you know I love to reinvent wheels if only for the learning experience, but this is a case where I genuinely couldn’t find something good enough.)

Here’s what makes Aegis different to anything out there, embodied in its slogan: the programable multi-agent meta-harness. Let’s build it from the back.

Meta-harness. You’re already paying for Claude via your Anthropic subscription. Gemini via your Google account. If a new tool wants to drive both, it has two options. It can re-authenticate you through its own layer: API keys, rate limits, and lost subscription benefits. Or it can call the native tools, which already have your credentials.

Aegis takes the second path. It drives Claude Code over its stream-json protocol, Gemini CLI and OpenCode over the Agent Client Protocol — calling the binaries on your machine, which already have your auth. It doesn’t touch your subscriptions. You stop worrying about which model wins this month’s benchmark.

And because Aegis calls the native harnesses rather than reimplementing them, it inherits everything they’ve spent months polishing: the agentic loop, the context management, the permission model. The harness keeps owning tool use, sandboxing, model selection. Aegis owns the layer above — tabs, routing, delegation — the things a single-conversation CLI was never built to do.

Multi-agent. Aegis provides six inter-agent syncronization primitives, built incrementally one above the others, to give you increasingly more powerful multi-agent capabilities.

The first primitive is a per-agent inbox. Any agent (including you) can hand another agent a message, that gets enqueued until the end of the current turn. This alone enables solving the problem we were just discussing in the previous section.

Then, on top of that, they got canvases: markdown files shared across agents with per section looks and callbacks that awake an agent when another finishes writing a section.

Then we got real terminals. Not an agent calling bash on a subprocess and blocking on its result. A real, shared, fully interactive terminal session that multiple agents can scan, tail, and write to, And you can too. One runs a command; another sees the output in real time and reacts. Or you run the backend and ask your agent to look at the logs when that heissenbug happens.

So far this allows you to spawn several agents and have them collaborate. But you can also have queues. Any agent can drop a task (a prompt) and the queue auto-spawns an ephemeral agent to take it, potentially calling back the emitter once done. Queues have a maximum cap on parallelism, as well as arbitrary rolling budgets on tokens and dollars so you keep control of how much work is allowed to happen without your supervision.

Agents can also be added into groups, dynamically created and destroyed on demand, by you or any other agent. Groups have a shared inbox and you can subscribe to them and get notified when the first, any, or all the agents in the group finish. This allows committee-like flows where different agents analyze a problem in parallel.

Programmable. And finally, you get workflows. Deterministic Python code that drives agent calls in sequence, with branching, conditionals, and loops. Think skills, but instead a hopeful blob of markdown one agent can choose to interpret as they want, these are composable routines that drive the entire substrate in the exact level of control you desire.

When you write a complex workflow in natural language, you’re hoping the agent follows through. It might decide step two is better done differently, skip the commit because something caught its eye, or forget step three entirely. A Python workflow doesn’t forget: it runs step 1, then step 2, then step 3, and commits. You wrote commit() in the code; the code runs. You get the agent’s creativity at each step; the program guarantees the steps happen.

Workflows can be scheduled: declare a cron entry in .aegis.yaml and the substrate fires it while you sleep. They also run across machines: one agent on your laptop can enqueue a task to a remote Aegis instance on a VPS and get the result back in its inbox.

Quick aside: Yesterday, as I was polishing this post, Anthropic announced Dynamic Workflows — a way to orchestrate long-lived agents over dozens of hours of work. I haven’t tested it yet, but It seems geared toward the same problem I’m trying to solve.

The difference is in the philosophy. Anthropic’s principle is to give agents as much agency as possible: trust the model, let it decide how to get there. Don’t get in the way, you stupid human. Tokens go brr. It’s the reason why all their solutions to problems are humoungos one-shot prompts.

My philosophy runs the other way. Leave agent creativity where it can do the most good — in the actual work, not in deciding whether step three happens after step four. The deterministic spine isn’t a constraint on the agent. Its what makes it work despite agent idiosincrasies, and why it works across all agents and harnesses, regardless of their intrinsic capabilities.

I’m using Aegis now for about 50-60% of my coding. It’s still rough around the edges, but it’s way more fun to use than any single CLI. There is a lot more in the box, like remote sessions, a built-in file browser, lots of metrics... but this post is already way too long. You’ll have to check it out on your own. Links at the end.

What I intentionally left out

Aegis has no native concept of skills. No AGENTS.md injected automatically. No memory system.

Those things are conventions, and conventions change. What belongs in an AGENTS.md today looks different from what it’ll look like in six months. Memory systems have a dozen competing designs and no consensus. If I’d baked any of that in, you’d be stuck with my choices the moment the community moved on.

Aegis has a very powerful plugin system instead (I told you, it is programable). You write a pure Python function, drop it on some folder, and it gets called anywhere in the agent lifecycle.

Want skills that activate on context? Write a plugin. Want a memory system? Write a plugin. Want to inject per-repository knowledge before every session? Write a plugin. The conventions you need are yours to build, and when they change, you change them, not me.

Coda

Claude Code and Gemini CLI are applications. You open them, use them, close them. Aegis is more than that. It’s a framework. You build on top of it — applications that spin up the agentic substrate automatically, pull in whichever harness fits the task, and run without anyone at the keyboard.

Picture a self-hosted Git forge where pushing a branch triggers agents: one reviews the code, one hunts bugs, one picks up open issues and starts implementing. Everyone on their own worktrees, independent, parallel, coding while you sleep. Push code; agents work.

That’s Sindri, and I’m also building it; but that’s a story for another Friday.

Aegis is open source: github.com/apiad/aegis. pip install aegis-harness to start.

May closes the agentic AI arc. If you want the full conceptual foundation — model internals, agent architecture, tool design, failure modes — it’s in Mostly Harmless AI v2, launching May 31. Subscribers get 50% off.

Also check the Compendium — one fixed price, every educational project I’ve built or will build, yours in perpetuity. Buy once, get everything.

In June: algorithms. The other half of the computational story — sorting, searching, graphs, optimization, the classical toolkit that AI didn’t replace and won’t. All month, one idea at a time. See you there.

Until next time, stay curious.

Why AI Agents Need Structure

Alejandro Piad Morffis — Wed, 27 May 2026 10:43:12 GMT

Actual footage of an AI handing off a research spec to another AI—it’ll make sense in the end. (Crazy the kinds things you can see in the wild, huh?)

Every post this month is on the theme of building AI agents that actually work — anchored on the second edition of Mostly Harmless AI, 50% off during early access, where the five-phase structure this post describes is a full chapter with more failure cases, the artifact design patterns, and the context isolation mechanics in detail. You can also read the whole book online for free. More at the end.

Last night you give your AI agent a clear task. It worked hard for two hours. YOu woke up to a report that is technically correct and completely useless.

I’ve had this experience enough times that I stopped blaming the model years ago. The failure is always structural, and I mean that in a very specific, diagnosable sense.

AI labs fine-tune language models to produce output, not to question the frame. The whole reward cycle behind how these models are trained pushes them toward helpful completions; nobody in that loop is rewarding a model for pausing mid-task and asking “have you considered that you might be framing this wrong?”

Execution is the default. That setting works beautifully for atomic tasks: create a note, write a short email, implement a specific function, compare two clearly-defined options. It fails quietly, and expensively, when the real problem isn’t the task itself but the goal behind it. Here’s three examples:

First: you ask an agent to “implement user authentication” for your web application. Clear task. The agent gets to work, producing a technically sound implementation using JWT tokens, bcrypt password hashing, and session management.

Second: you ask an agent to “write a technical report on renewable energy storage.” Again, clear task. The agent downloads papers, synthesizes findings, produces a well-structured document.

Third: a family asks an agent to “plan our move to a new city.” Thorough research follows — neighborhoods, school districts, moving companies, cost estimates. In every case, the output is technically correct.

In every case, the person who asked ends up with something that doesn’t quite fit. The gap between what was asked and what was needed — that gap is what this article is about.

The structure is the problem, not the model.

Not the model’s intelligence, not the quality of your prompt, not the temperature setting. The structure of the workflow you handed the agent is the thing that determines whether it answers the right question or a reasonable-sounding wrong one. In this article, I’ll show you what the right structure is. But to understand why it must be so, we need to see how we got there.

The first fix the industry reached for was planning. It helped. It wasn’t enough. Strap on for a story.

Subscribe now

Plan First, Build Second

The structure is the problem, not the model. Right? So the first fix the industry reached for was adding a planning mode.

If you’ve used any of the major agentic coding tools over the past couple of years, you’ve seen this pattern. There’s a plan mode: read-only, no side effects, meant for thinking through the task. Then a build mode: where the agent executes.

The intuition behind it is sound — separating thinking from doing is genuinely better than collapsing them into a single continuous stream. When you mix planning and execution in the same pass, the agent makes irreversible changes based on its first interpretation of the goal.

A planning mode forces a pause. You get to inspect the plan, argue with it, revise it before anything is written to disk. The plan is the first concrete artifact in the handoff chain. You can hold it in your hands — or at least on your screen — and have a real conversation about it.

Genuine progress. I don’t want to minimize it. But watch what happens with the family move.

The agent produces a thorough plan: compare three target neighborhoods, compare school districts in each, contact five moving companies for quotes, estimate total relocation costs, build a timeline. Excellent plan. The family reviews it, nods along, says “looks good,” and the agent executes.

Six months later: they’re living in a city that made complete sense on paper — except the job that drove the whole move turned out to be fully remote. They never needed to relocate at all, or they could have moved somewhere half the price with better schools.

The plan was correct. The execution was correct. The goal was wrong. And here’s the point: the agent that built the plan never questioned whether “plan a move to City X” was the right goal, because it was never asked to. It took the first reasonable interpretation of the prompt — “they want to move, let me help them move well” — and planned confidently inside that frame.

The software auth example is shorter but sharper. “Implement user authentication”: the agent plans for JWT tokens, bcrypt hashing, session storage. Fine plan. Sound, even. For a single-tenant web app. This is a multi-tenant SaaS product (one where dozens of customers share the same running application.)

The plan was never going to catch that, because nobody told the planning agent that multi-tenancy was a constraint. It did exactly what it was asked.

Now, the objection you might be forming: “just review the damn plan carefully before you approve it.” That objection is right about something, review does matter, bit it misses the fact that the agent that produced the plan has already committed to a frame.

Every question it asks, every tradeoff it surfaces, every option it presents — all of it is already shaped by its first interpretation of the goal. When you review that plan, you’re not reviewing a neutral set of options. You’re reviewing a plan that already selected its own success criteria. The frame is invisible because it was never made explicit.

Planning without exploration locks in the first reasonable goal.

Research Before the Plan

Once more, the structure is the problem, not the model. The next obvious fix is to put a research phase in front of planning.

By the time we hit mid-2025, a third mode was appearing in serious agentic setups. A research phase: read-only, job is to understand the problem space, not to produce a solution. The artifact it creates is a description, not a prescription — a document that maps what is known before anyone decides what to do about it.

The intuition is right again: if the planning agent doesn’t know what it doesn’t know, it can’t plan well. Research is how it finds out. For the renewable energy report, thorough research might surface the fact that the intended audience is policymakers, not engineers — which changes the vocabulary, the technical depth, and the document’s opening frame.

Real progress, again. And again, not enough.

Watch the renewable energy report closely. The agent runs a solid research phase: downloads twenty recent papers, reads industry reports, synthesizes the state of the art on battery storage, hydrogen carriers, pumped hydro, and grid-scale thermal systems. Then it transitions into planning mode — in the same context window. And here is the problem: the same context window.

The planning agent isn’t a fresh mind looking at a research report. It’s the same agent, carrying everything it concluded during research, now deciding how to structure the work. If the research agent concluded “battery storage is the central challenge in the energy transition,” the planning agent will structure the report around battery storage. Not because it made a new decision — because it never had the opportunity to question the prior decision. It just kept going. The research was excellent. The plan followed naturally from the research. The report answered the question the research agent found most interesting. Not necessarily the question the reader needed answered.

The steelman still stands, and research sharpens it. The steelman says: “planning alone is enough, just review the plan.” Research proves it isn’t — but not because planning is the wrong approach. Research proves it because research without context isolation just moves the lock-in one step earlier. The same agent that researched now plans. It cannot escape its own prior conclusions, not because it’s incapable of abstract reasoning, but because those conclusions are literally sitting in its context, shaping every next token it generates.

A fresh planning agent, handed only the research artifact as a clean document, can genuinely question whether the research answered the right question. It can push back. It can say “your research focused heavily on battery chemistry, but I’m not sure that’s what the audience needs.” The same agent that did the research cannot do that. Not really.

But here is the real kicker: research gives you facts; it doesn’t tell you if you’re solving the right problem. Save that thought for a minute, we need one more step.

Review After the Plan

The structure is still the problem, not the model. And to fix it, we added a new, and in hindsight, pretty obvious step. A review phase after implementation.

This is a dedicated pass where a separate agent — or the same agent in a separate context — evaluates the produced artifact against known criteria. Not just “does this code run,” but “does this code do what we intended.” The distinction from implementation is real and it matters. An implementation agent is building; a review agent is hunting for the thing that will break it.

What review actually solves is real. The software auth implementation, evaluated by a review agent, surfaces real questions: Is the JWT expiry window set appropriately for the threat model? Is the bcrypt cost factor tuned for this hardware? Are session tokens actually invalidated on logout, not just expired? These are genuine bugs a fresh pass can find. I’ve seen review agents catch the kind of subtle mistake that a second human reader catches — not because they’re smarter, but because they’re looking for problems rather than building a solution.

But watch what happens when the review agent evaluates the software auth implementation against the plan.

The plan said: “implement JWT-based authentication for the web application.” The review agent confirms: JWT is implemented correctly. Bcrypt is used. Session management is in place. The implementation passes review. It ships.

First enterprise customer tries to log in: there is no tenant isolation. Every user in the system shares a single authentication namespace. The review agent found no bugs. The implementation had no bugs. The plan specified the wrong thing. And the review agent couldn’t catch it — not because it was careless, but because review only checks “did we implement the plan correctly.” Not “was the plan the right plan.” Those are different jobs. And you can’t review your way out of the wrong spec.

This isn’t a failure of intelligence. It’s a consequence of what the review agent is handed. It receives the implementation and the plan. It has no clean access to the original problem statement — what the user actually needed, what constraints were implicit in the product, whether the product served one customer or a hundred who shared a namespace. It’s evaluating the gap between the artifact and the plan, not between the artifact and the goal.

Human code reviewers fail the same way, by the way. Code review finds style violations, off-by-one errors, missing null checks. It rarely questions the architecture decision that was made three sprints ago and embedded in every layer of the codebase. That kind of question requires a different context — a different meeting, a design review, a fresh set of eyes on the spec rather than the implementation.

Review catches errors, but only inside the frame you committed to in the planning phase. You can’t review your way out of the wrong spec.

Name the Problem Before Solving It

Finally, the structure IS the problem, not the model.

Research doesn’t ask it. Planning doesn’t ask it. Implementation doesn’t ask it. Review doesn’t ask it. The question is: what problem are we solving, and how will we know when we’ve solved it?

That’s not a rhetorical question. It has specific, concrete answers. And those answers should be a document — a specification — produced before the plan is written. Not during planning. Not as a side effect of research. Its own phase. Its own artifact.

A specification answers four questions, and it only takes one page to do it. What is the exact output we are trying to produce? What are the hard constraints it must satisfy? What does success look like — specifically, what would we check to confirm it? What does failure look like — what would make us say this didn’t work?

These sound obvious. They are almost never answered before a planning phase begins. In my experience, the reason is that they feel like they slow you down. They don’t. They prevent six months of work in the wrong direction. Think of it as a single page you could tape to the wall — the kind you’d point at during a disagreement about whether the output succeeded or failed.

Go back to the family move. After a thorough research phase — neighborhood data, school ratings, crime statistics, cost of living comparisons — a specification phase asks: what does a successful move look like for your family? The family sits with that question. It turns out they’ve never explicitly answered it.

The husband’s answer: proximity to his aging parents, who live in a specific region of the country. The wife’s answer: their daughter getting into a specific school district that has strong arts programs. The budget answer: keeping total housing costs under a threshold that lets them maintain their current savings rate. Three explicit success criteria.

The research phase found no conflicts because it was never told what it was optimizing for. The specification phase surfaces all three criteria before the plan commits to a single city. The planning phase can now do something useful: find cities that satisfy all three criteria — or, just as importantly, discover that no city satisfies all three and surface that conflict before anyone books a moving truck.

The software auth case is fast. Specification asks: what does correctly-implemented authentication look like for this product, given who its customers are? The answer: it must support multi-tenant isolation with strict data separation, SSO for enterprise customers, and a free tier with email-only login.

Now the plan can be written for the actual product. The research phase’s work on JWT and OAuth is still valid; it just needs to be read through the lens of multi-tenancy, which the specification made explicit.

The full chain, with its five concrete artifacts, looks like this.

Research produces a collection of source materials plus a descriptive state-of-the-art report — what is known about this problem space. Specification produces a success-and-failure criteria document. Planning produces a concrete step-by-step plan — how to get from here to there, given the spec. Implementation produces an evaluable artifact — code, document, report, recommendation. Review produces an evaluation report checked against the specification, not just the plan — a real answer to “did we solve the problem?”

Five documents. Five handoffs. Five chances to catch the wrong frame before it becomes expensive.

We Knew it all Along

The fun part is all of this was known, at least in principle, since before 1987.

IDEO, a by-now ultra famous design consultancy, articulated a five-phase creative process that Stanford’s d.school later codified as Design Thinking.

Tim Brown formalized the diverge-converge logic in Change by Design. The five phases: Empathize, Define, Ideate, Prototype, Test. If those names sound familiar given what you’ve just read, that’s not a coincidence.

Empathize is research: go wide, gather context, talk to users, understand the problem space from the outside rather than the inside. Define is specification: converge on an explicit problem statement with clear success criteria, the “how might we” question that frames everything downstream. Ideate is planning: diverge again, generate candidate solutions, explore the space of possible approaches. Prototype is implementation: produce an evaluable artifact, something you can put in someone’s hands. Test is review: evaluate the prototype against the problem statement, not just against the prototype’s internal logic.

Every phase the agentic world has been bolting on since 2023 was already named, sequenced, and justified in a framework that predates the modern web by a decade and a half. It’s a framework every Silicon Valley startup, incubater, accelerator, and VS knows in and out. It is taught in bussiness majors all over the world. It’s literally the structure of most pitch decks. But it is still missing in most agentic protocols we use every single day.

The piece most clearly missing is the Define phase — the second one, which IDEO put second for a reason: without a clear problem statement, everything downstream answers the wrong question. It’s a very old insight the field keeps rediscovering from scratch — Agile’s Definition of Done, test-driven development’s failing-test-first, specification by example. Each was the same insight under a different name.

Now, here is the strongest version of your initial objection. “You don’t need all this structure. A great prompt specifies the audience, the format, the success criteria, the constraints. Write a better prompt and you get all five phases in one go.”

Let’s take this seriously, because it’s not wrong about prompt quality. A genuinely well-crafted prompt that specifies who the output is for, what format it should take, what it must accomplish, and what would make it fail — that prompt is effectively a specification. You’re right that prompt quality matters.

But a single prompt containing research context, goal specification, a plan, and execution instructions is not a cleaner version of the five-phase process. It’s five phases collapsed into one context window, with no mechanism for each phase to question the prior one’s conclusions.

When research and planning share a context, planning can’t interrogate research. When planning and implementation share a context, implementation can’t push back on the plan. When specification and review share a context, review is already biased toward confirming the specification it helped write. Prompt quality is about what you ask.

Phase independence is about who processes each answer, and whether they can genuinely disagree with the prior step. You can write the world’s best prompt and still hand it to an agent that will execute it inside the same narrowing tunnel, compounding the same assumptions with every step.

Picture someone reading their own manuscript for the fifth time — they no longer see what’s there, only what they meant to write. It is one of the most replicated findings in human psychology: once we form a belief, we interpret subsequent evidence through the lens of that belief. We notice confirming evidence, discount disconfirming evidence, and generate hypotheses that assume the belief is correct.

This is not a weakness of intelligence. Every mind — human or artificial — interprets through the conclusions it has already drawn. The agent that researched your problem already believes things about it. When it transitions to planning, it plans in service of those beliefs. The agent that planned has a solution in mind. When it implements, it makes countless micro-decisions that serve that solution. The agent that implemented defended choices as it worked. When it reviews, it reads its own output charitably.

Context isolation breaks this chain. A fresh context hasn’t seen the prior steps. It cannot be fooled by conclusions it never drew. It reads the artifact cold, which is the only way to genuinely evaluate it.

Design Thinking’s diverge-converge logic is not about what each phase does. It’s about who does it, and whether they can arrive at it without inheriting the prior phase’s commitments.

Start Doing This Yourself Today

The artisanal version of this is simpler than it sounds:

Treat each phase as a distinct conversation. Start a fresh session for each one.
Hand it only the artifact from the prior phase — not the prior conversation, not your running context, not a summary of what you’ve been thinking about. The artifact alone.
And tell the agent explicitly what mode it’s in. “You are in Research mode. Do not propose a plan. Do not suggest solutions. Your only job is to describe the problem space and produce a research report.”

That instruction matters. Not because the model needs to be controlled, but because explicit mode assignment prevents the agent from sliding into execution behavior when it senses a gap to fill. Models are trained to be helpful; helpfulness steers every gap toward a solution. Naming the mode is how you resist it.

The discipline lives in you, not the tool. This works with any agent, any interface. Fresh context, explicit mode, artifact handoff. That’s the whole recipe.

If you want to go further, you can make phases structurally enforced rather than just instructed — agents that literally cannot execute, subagents that receive only the artifact, automated handoffs with no shared context. Programmable harnesses give you this level of control with permission levels per skill. (If you don’t have one, call me, I’ll lend you one for free.)

One small step you can take today: add a specification phase to whatever workflow you already use. Before your planning phase writes a plan, ask for a success-criteria document first. One page. Explicit pass/fail conditions. What would make this output a success? What would make you throw it out? Review that document before planning begins.

This single addition — inserting a define phase between research and planning — catches more failures than adding a review phase after the fact. Because it catches them before the plan commits to the wrong goal.

What not to do: don’t implement all five phases as a mechanical checklist. And don’t add phases as ornamentation — a research phase that shares a context with planning adds conversation turns, not structure. More words in the same window is not more phases. The phases are context boundaries, not steps in a recipe. A phase that doesn’t produce a concrete artifact and doesn’t hand it to a fresh context adds nothing.

One document per phase. Fresh context per phase. That’s it.

Structure Before You Re-Prompt

Picture the artifact chain as a physical thing. A manila folder passed from one desk to the next. The research desk produces a report, closes it, slides it across. The specification desk opens only that folder, reads it, produces a criteria document, closes it, slides it across. The planning desk never opens the research folder — it opens only the criteria document.

And so on down the line. Each desk sees exactly one prior document. Each desk produces exactly one new document. The chain is what makes independence possible. You cannot hand off a vague intention. You cannot slide a feeling across a desk. Only a document.

Context isolation is the move most pipelines skip, and it’s the one that does the most work. Every phase that shares a context with a prior phase inherits its commitments. Not because the model is lazy or wrong — because that’s how cognition works, human or otherwise. We interpret through the lens of what we already concluded.

Context isolation is cheap: start a new session, pass only the artifact. The cognitive science is unambiguous: breaking the confirmation-bias chain requires a structural break, not a better instruction. Context isolation gets skipped because it looks optional. It isn’t.

Remember, the structure is the problem, not the model. Restructure before you re-prompt.

Until next time, stay curious.

This is the core argument of the agentic workflows chapter in the second edition of Mostly Harmless AI — the full chapter walks the failure cascade with more cases, the artifact design problem (a research report can be thorough and still hand the wrong thing forward), and the context isolation mechanics in depth.

This specific article is new content, still not in the book, but it will land there shortly. The book is 50% off while it’s in early access, and also free to read online in a custom reader I built: dark mode, font controls, progress tracking, offline support, the works.

If you want the architecture behind these systems — how they fail, what the harness around them should look like, and what to actually do about it — that is what the book is for.

Get Mostly Harmless AI - 50% off

Mostly Harmless AI - Mid-week Report

Alejandro Piad Morffis — Tue, 26 May 2026 10:31:49 GMT

This week I’m pushing towards a sane v2.0 version of Mostly Harmless AI, the book that shows you what AI is, what you can do with it, and where it can go wrong, without the hype or the bullshit. I mean “sane” because “done” is unreachable, this fields moves just too fast. So I’ll aim for something that is good enough for today’s reality, and revisit it in six months when we I have more to say—as I’m sure I will.

This is a quick recap of what’s changing from version 1.0 and how to get it. The first draft went out Friday. Since then I have been working through the rough edges, the kind you can only see once the whole book is in front of you at once. The version on the site today is materially better than what shipped on Friday.

Get Mostly Harmless AI - 50% off

The two main changes since Friday

The prologue is properly a prologue now.

When you open the prologue now, it begins where the idea began. The first edition opened in the 1950s, where the field opened. That was a mistake. The ideas that made modern AI possible were not born at Dartmouth in 1956 — they were assembled over three hundred years, by people who had no computers and no reason to suspect their work would one day converge into one. The new version begins with Leibniz in the 1670s, who imagined a calculus ratiocinator that could reduce human reasoning to calculation; with Ada Lovelace in the 1840s, who saw that symbols could represent anything, not just numbers; and with Alan Turing in 1936, who drew the boundary of what computation could do as a mathematical object, decades before the first computer existed. The rest of the prologue walks the seventy years between that theoretical foundation and the moment the field became an industry. The chronology now starts where the idea did, not where the field did.

There’s a glossary now.

The glossary for Mostly Harmless AI covers 484 entries — every person, system, technique, and field-of-art term used anywhere in the body. That includes Backpropagation, AlphaFold, the Bombe, ELIZA, RLHF, Stochastic Parrots, the Lighthill Report, and several hundred more. Each entry links bidirectionally: click a term in a chapter to jump to its definition; click a back-reference in the glossary to jump to where the term was used. When a term keeps recurring across chapters, the back-references show you the through-line. You can read the book without bringing any specific prior vocabulary.

I rewrote the preface in the same voice as the chapters — shorter, sharper, without the prompt-engineering rubble from the first AI-assisted cut. The cover is new too: a painterly collage running from Leibniz through Bletchley Park, ENIAC, Dartmouth, ELIZA, the multilayer perceptron, and the chatbot. Babbage was on the original cover and shouldn’t have been. He belongs in Lovelace’s chapter, and that’s where he is now.

Here is what the access model looks like and what your support funds, if you choose to give it.

Access and support

Mostly Harmless AI is free to read in perpetuity at books.apiad.net. The reader there has clean typography, dark mode, and inline footnotes — no popups, no tracking, no required signup. Not a preview. The full book.

The PDF and EPUB are on Gumroad for those who prefer to read offline. Buying is a gesture of support, not a paywall — you get the same book either way. One-time purchase, every future edition in perpetuity, no subscription.

Part of the proceeds from the second edition funds the Spanish translation of Mostly Harmless AI.

If you can’t afford it, please take it free. I would rather you have the knowledge than the gesture. If the price is the only obstacle, write to me at apiad@apiad.net and I will make it work.

Where to find it

The book is free to read at books.apiad.net. The PDF and EPUB are at Gumroad. If you bought the first edition on Friday, your download already points at the latest files. A few things still to land by May 31 — if you find something off while reading, write to me. The next edition will be better for it.

AI is doing something weird to Science

Alejandro Piad Morffis — Mon, 25 May 2026 12:40:28 GMT

Photo by Artturi Jalli on Unsplash

Picture Donald Knuth. Eighty-eight years old, the man who wrote The Art of Computer Programming by hand in TeX, which he invented himself to be able to write his own books. The father of algorithmic analysis. The most laudeated living computer scientist. A legend, and a well-known AI skeptic.

Now picture him reading a printed chat log between a fellow colleague and Claude Code. Not skimming. Reading it in detail, because there is something genuinely baffling about it.

The log belongs to Filip Stappers, a mathematician who ran thirty-one coding explorations with Claude Opus 4.6, systematically probing a class of combinatorial objects Knuth had spent decades thinking about. Exploration 15 surfaced something unexpected: a structural pattern nobody had written down.

Knuth read it, judged it valid, proved it correct by hand, and wrote a paper about it. He called it “Claude’s Cycles.” Knuth noted, with his characteristic precision, that he’ll “have to revise his opinions about generative AI” one of these days.

Most accounts that open with this scene take one of two off-ramps.

The first off-ramp is the replacement narrative. AI is now the scientist. The model had the insight; Stappers just ran the prompts; Knuth read the output and judged it true. Discovery has been automated. We are, depending on your temperature, either liberated or obsolete.

The second off-ramp is the stochastic-parrot dismissal. It’s just a language model predicting tokens. It doesn’t understand combinatorics; it doesn’t understand anything. Stappers did the science; Claude shuffled plausible-sounding symbols. Attribute the discovery to the researcher, not the autocomplete.

Both off-ramps feel satisfying. Both are wrong. And they’re wrong in the same way: they’re answering the question “did AI do the science?” That’s the wrong question. The interesting object is not the agent. It’s the loop.

The loop looks like this: a human poses a question; a model proposes candidates; a verifier filters the candidates; a human curates what survives. Round and round. What Stappers and Claude did is not fundamentally different in shape from what Tao and Lean are doing, or what the GNoME pipeline does in materials science, or what AlphaFold did for protein structure. The shape is the same. The loop does the discovery.

I want to be clear about what that means, because it’s easy to hear it as either a compliment to AI or a dismissal. It is neither. It’s an empirical claim about where the causal action lives. Not in the model, not in the human, but in the interaction structure between them. Get that structure right and you get science. Get it wrong and you get confident nonsense at scale. The details of what ‘wrong’ looks like are worth walking through carefully, and we’ll do that a couple of sections down.

But before we can dismiss either off-ramp, we need to walk four recent cases. Because in every one of them — Claude’s Cycles, Tao and Lean, AlphaFold, GNoME — the replacement narrative is not just philosophically confused. It is empirically wrong. The loop does the discovery.

Four Cases, One Shape

In each of the following cases, the loop does the discovery. I’m going to give you four of them across four domains, because I think the pattern only becomes undeniable when you see it that many times. Same shape, different materials.

Claude’s Cycles

You already have the scene. Stappers runs thirty-one explorations. Not one inspired conversation, thirty-one numbered, documented, methodical probes. Claude proposes. Stappers evaluates. Knuth reads the surviving logs, verifies the mathematical claims by hand, and writes. One human author. One credited co-explorer. The question was Stappers’; the verification was Knuth’s; the curation of which thirty-one explorations were worth pursuing was human throughout. The model was the proposer.

That word matters, and I want to be precise about it. Proposer. Not discoverer, not author, not scientist. The one that generates candidates fast enough that the verifier can find something in the haystack.

Tao and Lean

Terence Tao, Fields Medal, the current standard of living mathematical genius, has been publicly working through what it looks like to use LLMs for mathematical research. His account in the Notices of the American Mathematical Society in 2025 is careful and specific, and I appreciate the care because it’s easy to overread these things in both directions.

Here’s what you’re actually looking at: an LLM proposes proof steps, intermediate claims, candidate lemmas, reformulations of the problem. Lean’s type-checker is the verifier. And Lean’s type-checker cannot be fooled. It either accepts a proof term or it doesn’t. There’s no “plausible-sounding but wrong” in Lean. The system rejects garbage instantly and silently. What the human does — Tao, or anyone in his lab — is curate: which of the surviving proof steps are worth following, which directions are worth pursuing. The loop is tight. The proposer is creative and unreliable; the verifier is reliable and uncreative; the human operates between them.

I’d reframe that: the breakthrough isn’t that LLMs can now do mathematics. It’s that the loop can now run fast enough and cheaply enough to be useful.

AlphaFold

I want to include this one even though it’s older and better-known, because it’s the case where people most confidently say “AI solved the problem.” And that confidence is telling.

The problem was fifty years old. How does a protein fold? You have the amino acid sequence; you want the three-dimensional structure; that structure determines function; that function determines everything downstream in biology and drug design. AlphaFold 2 proposes structures. Experimental crystallography verifies them: X-ray diffraction, cryo-electron microscopy, techniques that require real equipment, real samples, real physics. You cannot hallucinate a crystal structure into being; the protein either folds the way the model says or it doesn’t, and the experimental method tells you which. AlphaFold 3 extended this to molecular complexes: DNA, RNA, small-molecule ligands binding to proteins.

I think this is worth sitting with: human researchers curated which structures were worth solving, which proteins mattered, which findings were publishable. AlphaFold proposed. Nature verified. Humans curated.

GNoME and A-Lab

This is the one I find most clarifying, because the verifier here is the most brutally physical. You can’t argue with a crystal.

GNoME — Google DeepMind’s Graph Networks for Materials Exploration, if you want the full name — generated 380,000 candidate stable crystal structures. That’s the proposal stage. Then A-Lab, UC Berkeley’s autonomous laboratory, took 58 of those candidates and actually tried to synthesize them. Robotically. With real chemicals, real furnaces, real diffraction equipment to check what came out. Forty-one novel materials were successfully synthesized in 17 days.

Think about the verifier in that loop. Not a type-checker, not a scoring function. A physical robot in a real lab mixing real compounds at real temperatures and asking whether the crystal forms. You cannot fake a crystal. If A-Lab says it synthesized a novel stable material, it synthesized a novel stable material. The model proposed; the physical world verified; human researchers curated which structures were interesting enough to attempt.

I want to be blunt here: the pattern is not subtle. In every case: model proposes, independent verifier filters, human curates. The model is never the verifier. The model is never the question-poser. The model occupies exactly one slot in the loop, and it’s the slot that benefits most from creativity, speed, and a high tolerance for being wrong.

The replacement narrative requires you to believe the model is doing all three jobs. It isn’t. Empirically, demonstrably, across four domains, it isn’t.

Before we anatomize what changed inside the loop, it is worth noting that this shape is fifty years old. Because that changes what we should find surprising.

The Loop Has Is Older Than You Think

The loop does the discovery today in the same way it did the discovery in 1976.

I want to be somewhat insistent about this, because I think the historical amnesia around computational assistance in science is part of what makes AI discourse so unmoored. This isn’t new. What’s new is narrower than you think, and understanding exactly what changed is the only way to correctly evaluate what it means.

Appel and Haken, 1976

You probably know this one: the four-color theorem says any map can be colored with four colors such that no adjacent regions share a color. Mathematicians had been trying to prove it since 1852. In 1976, Kenneth Appel and Wolfgang Haken proved it. Using a computer. Their proof involved reducing the problem to 1,482 configurations and verifying each one computationally. No human read every step. No human could. The proof was real; the verification was mechanical.

The mathematical community was genuinely unsettled. The theorem wasn’t wrong. But the proof didn’t fit the usual epistemological frame. You couldn’t follow it the way you follow a conventional proof. You had to trust the computer. That discomfort was the first serious confrontation with what I’m calling the loop: humans posed the problem, humans designed the reduction, a computer verified the 1,482 cases, humans accepted the surviving result.

Same shape. Fifty years earlier.

Hales and Flyspeck, 1998–2014

Thomas Hales proved Kepler’s sphere-packing conjecture in 1998. The conjecture: the way you’d intuitively stack cannonballs, face-centered cubic packing, is in fact the densest possible arrangement. The proof relied on computer enumeration so extensive that the referees couldn’t verify it. They were “99% confident” and said so, which is an unusual thing for mathematical referees to say.

Sixteen years later — sixteen years — the Flyspeck project completed a formal verification of the proof in HOL Light, a proof assistant. I want to dwell on that: sixteen years. The loop had a very slow verifier in the middle. It still worked. The result was real.

Sixteen years from claim to verified closure. I find that humbling. We talk about AI accelerating science as though “fast” is a new property of loops. The loop has always done the discovery. Sometimes slowly.

AI Feynman, 2020

You won’t usually see this one in the AI-in-science timelines, but it belongs here. Silviu-Marian Udrescu and Max Tegmark built a system called AI Feynman that uses symbolic regression (searching the space of mathematical expressions) to recover physics equations from data. Feed it measured relationships between physical quantities; it proposes the equation. Tested on 100 equations from the Feynman Lectures on Physics. Human scientists posed the problems; the system proposed expressions; formal mathematical checks filtered them.

This is 2020, two years before the moment people usually date as the AI-in-science inflection. I find it clarifying: the loop is the same.

What 2022 Actually Changed

Here’s what changed. One thing in the loop changed, and it’s the proposer slot.

Before roughly 2022, the proposer in loops like these was domain-specific, narrow, and hand-engineered. AlphaFold’s architecture was designed from the ground up for protein structure prediction. The Flyspeck enumeration was written for Kepler’s problem. AI Feynman’s symbolic regression engine was built for recovering physics equations. The verifiers were already strong: formal proof checkers, physical experiments, crystallography. The curators were already human. But building a proposer required significant domain-specific engineering effort for each new application.

What changed, and I think this is the key move, is that the proposer slot is now increasingly occupiable by general-purpose large language models that can be directed with natural-language specifications.

The same model that helps Stappers explore combinatorial objects can, with different prompts, propose protein structures, generate proof steps, suggest material candidates. It’s not that the model does these things well in some absolute sense. It does them well enough that a strong verifier can find the real results in the output. And FunSearch, DeepMind’s system for mathematical discovery, goes a step further: it uses LLMs to generate the search strategy itself. AlphaEvolve extends this to evolving the algorithms. The proposer writes the proposer.

I want to flag what didn’t change: verifier reliability, human curation, question-posing. Knuth still read the logs. Tao still decides which lemmas are worth pursuing. The IMO committee still wrote the problems. The physical world still decides whether a crystal forms.

So: what exactly is the loop, and what changed inside it? The proposer and verifier are different jobs — you can confuse them, and that confusion is the mistake that both off-ramps make.

Anatomy of the Loop

The loop does the discovery. But what is the loop, exactly? It has four roles: poser, proposer, verifier, curator. They are not interchangeable.

I want to be specific about each, because vagueness here is where confused takes come from. When someone says “AI did the science,” they’re usually collapsing all four roles into one and attributing them to the model. When someone says “it’s just a tool,” they’re usually denying that the proposer role is meaningfully distinct from, say, a search engine. Both collapses are wrong.

The Proposer Role

I want to define this carefully, because the proposer role is where most confusion lives. The proposer generates candidates. That’s it. It doesn’t need to know which candidates are correct; it doesn’t need to understand the domain at the level required to verify; it doesn’t need to bear accountability for wrong answers. It just needs to produce output that the verifier can evaluate. Fast. In large quantities. With enough breadth that interesting things appear in the distribution.

This is a real and important job. Generating candidates well, in a domain, at the right level of specificity, with the kind of variety that makes the verifier’s job tractable, is genuinely hard. What LLMs are good at is producing plausible candidates in natural-language-specified domains. That’s useful precisely because “plausible” is a different bar from “correct.” The proposer buys lottery tickets; the verifier checks which ones won.

Galactica, or: What Happens Without a Verifier

In November 2022, Meta released Galactica, a large language model trained on scientific literature. I remember when this dropped: the pitch was that it could reason about science, write papers, explain concepts, generate hypotheses. It was fluent. Confident. It had read more papers than any human alive.

It was retracted in three days.

Galactica produced plausible-sounding but fabricated citations. It generated chemistry that was wrong. It stated incorrect facts with the same calm certainty it used for correct ones. The model had no verifier. It had no external check against which its proposals were filtered. It was a proposer talking directly to readers who were treating it as a verifier.

That’s the failure mode. “LLMs are bad at science” is the wrong lesson. Galactica was a capable proposer deployed without a loop. Dump the proposer’s output directly into the world, skip the verifier, and you get confident nonsense. The problem wasn’t the model. The problem was the missing loop.

The Verifier Role

I want to be careful here, because this is where the asymmetry lives. The verifier is not creative. The verifier must be right. These are the two properties that matter.

Lean’s type-checker does not hallucinate. It processes a proof term and either accepts it or rejects it, and its answer is correct by construction: it’s checking against the formal rules of type theory, and it doesn’t have opinions or moods or bad days. A crystal either forms or it doesn’t. X-ray diffraction either confirms the predicted structure or it doesn’t. These verifiers are not AI systems. They’re physics, or mathematics, or formal logic.

Here’s a slogan worth keeping: the verifier is the one that matters. A loop with a weak proposer and a strong verifier still produces valid science — it’s just slow, because it needs more proposals before one survives — this is just what “normal science has been so far”, slow proposer, strong verifier.

A loop with a strong proposer and a weak verifier produces Galactica. The asymmetry is important. You can have crap proposals and still win, as long as the verifier is robust. You cannot have a weak verifier and still win, no matter how impressive the proposer.

Tao’s Insight on Composability

Tao articulated this in his 2025 Notices piece, and I think it’s genuinely the right way to think about it: the value of the LLM-plus-formal-verifier combination comes from combining complementary weaknesses. LLMs are creative but unreliable. Lean is reliable but cannot be creative; it needs to be told what to check. Individually, each is limited. Together, the loop covers both limitations: the LLM proposes, Lean certifies, and you get results that are both novel and guaranteed correct.

This is not a new insight in the philosophy of science or even in computational science. It’s the same structure as Appel-Haken: creative human mathematicians posed and structured the problem, mechanical verification checked the cases. What’s new is that the LLM makes the creative-but-unreliable proposer slot much cheaper and more general to fill. You don’t need a domain expert to hand-engineer each proposer; you need a prompt and a general-purpose model.

What’s Genuinely New Since 2022

Let me be specific, because this matters for evaluating claims. Three things changed:

First, open-ended program synthesis. The proposer can now write code, not just fill templates. FunSearch and AlphaEvolve don’t just suggest candidate solutions; they generate the search strategy itself, which then searches for solutions. The proposer proposes how to propose. That’s a qualitative shift.

Second, cross-domain transfer at useful fidelity. The same general-purpose model can propose protein structure candidates, proof steps, crystal candidates, combinatorial patterns, all with natural-language specification. You don’t need to rebuild the proposer for each domain. The moat that used to come from hand-engineering a domain-specific proposer is largely gone.

Third, tight LLM-to-formal-verifier loop latency. The Flyspeck project took sixteen years. The Lean loop that Tao describes closes in seconds. When the loop runs fast, you can iterate fast, which means you can attempt more ambitious problems and accumulate signal faster about which directions are productive.

What hasn’t changed: humans pose the problems; the verifier is still not an LLM; which questions are worth asking remains entirely human. Knuth read the logs. Tao decides which lemma is promising. The IMO committee wrote the problems. The GNoME team decided that stable crystal structures were the target. The question of what to look for has not been automated.

With the anatomy clear, we can finally steelman both off-ramps seriously. And explain precisely why they are both wrong.

Why Both Extremes Are Wrong

The loop does the discovery, but what does that mean for who gets the credit?

I want to take both positions seriously, because I think each is tracking something real. Dismissing either one without engagement is how you end up with a position that sounds crisp but falls apart in the edge cases.

The Maximalist Steelman

Knuth is eighty-eight years old. He has spent decades thinking about the class of combinatorial objects Stappers was exploring. He hadn’t found the pattern that surfaced at exploration 15. Stappers, presumably, hadn’t either, or there wouldn’t have been thirty-one explorations looking for it.

Here’s the maximalist claim: if surprise is constitutive of discovery, if discovery is the moment something genuinely unexpected becomes known, then the model contributed something essential. It’s not that the model was incidentally involved in a process humans could have completed. It’s that the specific result didn’t exist in any human mind before the model generated it. A system that reliably surfaces what experts miss is doing something scientists do. The fact that the model can’t feel pride or bear responsibility doesn’t settle whether it participated in discovery.

The maximalist is pointing at something real. The model wasn’t executing a search strategy that a human had pre-specified. It was navigating a space in ways that produced genuinely unexpected output. That matters.

The Dismissive Steelman

A hammer drives nails. We don’t say the hammer built the house. Excel executes arithmetic the analyst specified; we don’t credit Excel with the financial model. AlphaFold was trained on structural biology data that human researchers collected over decades; the training objective was designed by human engineers; the evaluation criteria were set by human scientists. The model is a sophisticated tool. Sophisticated tools don’t discover; they execute.

The dismissive case is not purely rhetorical. There’s a serious accountability point underneath it: when a model-generated finding turns out to be wrong (and some will), who is responsible? If you’ve attributed agency to the model, you’ve obscured the accountability chain. The human who deployed the loop without a strong verifier, who curated poorly, who published without checking: that human is responsible. Diffusing credit into the model diffuses accountability too, and that’s a practical problem, not just a philosophical one.

The dismissive steelman is also tracking something real. The model didn’t choose the question. It didn’t decide the result mattered. It didn’t design the verifier. It didn’t stake its reputation on the finding.

The Verdict

Both steelmans are right about something; both are wrong about the structure.

The maximalist is right that the model contributes something essential, something that wouldn’t have been there without it. Wrong that this constitutes independent discovery: you cannot have independent discovery without the ability to pose a question, which requires caring about the answer, which requires the kind of intentionality the model doesn’t have.

The dismissive is right that the model doesn’t choose questions, doesn’t bear accountability, and shouldn’t be treated as an author in the full sense. Wrong that this makes it a mere tool in the hammer-and-nail sense. AlphaTensor, DeepMind’s system that found faster matrix multiplication algorithms, didn’t just execute a search the engineers specified. It found an algorithm that reconfigured what experts believed was achievable. That’s not executing; it’s navigating a combinatorial space in a way that produces genuine surprise. The hammer never surprises you — except when you’re not looking and it hits your finger, which by the way, AI can also do, and with a louder bang.

The right frame is, I think, the AI lab member. Indispensable, capable, sometimes surprising. Not a hammer; not a principal investigator, not a replacement for another human. Just a genuinely, qualitatively new kind of entity. An entity that occupies the proposer slot in the loop and does it better than anything that occupied that slot before, but doesn’t touch the verifier, doesn’t pose the questions, and doesn’t bear the accountability that comes with authorship.

If AI is a lab member and not a scientist, what does that do to the publishable paper, and to the metrics we use to measure science?

But What Happens to the Paper

The loop does the discovery. But the paper still needs to be written, and the paper is what the institution counts.

This is where I get more pessimistic, or at least more cautious. The loop is good news for science-the-activity. It’s more complicated news for science-the-institution.

Discovery and Paper Count Were Already Decoupled

Park, Leahey, and Funk published in Nature in 2023 a careful empirical study of scientific disruption, measuring across decades of papers and patents how often new work displaces the prior literature versus consolidates it. Their finding: disruption has been declining since 1945. Not linearly, not dramatically, but consistently. Meanwhile, paper count has been exploding.

The interpretation I find most compelling: the processes that generate papers and the processes that generate genuine advances were already decoupled before AI. A lot of papers are small increments, confirmations, replications, applications of known methods to new datasets. That’s not waste; science needs that infrastructure. But it means paper count was already a noisy proxy for discovery rate.

What an AI proposer does to this: it makes generating candidate findings cheaper. Cheaper to generate means more candidates, means more papers. Discovery and paper count decouple further.

Goodhart as Accelerant

Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. Academic institutions have been targeting paper count, citation count, journal impact for decades. Paper mills, factories producing fake or low-quality research for pay-to-publish journals, were already a pre-AI problem. Ioannidis documented the replication crisis in 2005 and again in 2018; most published findings in some fields don’t replicate, and that was before AI made generating plausible-sounding results cheaper.

The AI proposer is an accelerant deployed into a system already optimizing for the wrong thing. It makes the Goodhart problem worse. A lot worse. If you can generate a thousand candidate papers in the time it used to take to generate ten, and the verifier in your loop is peer review (slow, inconsistent, and famously gameable), then you have a problem that is structurally different from the pre-AI problem in scale.

I want to be clear: this is not an argument against AI-in-science. It’s an argument that the institution of science has a verifier problem, and AI proposers make that verifier problem more urgent. The fix is not to slow the proposer. The fix is to build better verifiers.

The Optimistic Edge

Here’s where I land, and I’m genuinely somewhat optimistic about this: if generating candidates is cheap, the scarce skill shifts upstream.

The researchers who will matter most in this environment are not the ones who can generate the most proposals. They’re the ones who can pose the right questions, who can identify which problems are worth solving before they know the answer, and the ones who can build strong verifiers. Question-selection and verifier-design become the competitive moat. That’s a real and important skill. It’s harder to fake than generating a paper. And it’s the skill that the loop most depends on.

If I’m hiring for a lab in 2026, I’m not looking for “AI scientist” as a job description. I’m looking for people who can look at a domain and say: here is what a correct answer would have to look like, and here is how I would know if I found it. That’s what the verifier is. Building it is hard. It requires deep domain knowledge, epistemological clarity, and the kind of judgment that comes from years of thinking carefully about what you’re actually trying to know.

A Diagnostic Heuristic

Let me give you something practical. When you read an AI-in-science result, a press release, a paper, a breathless tweet thread, ask this: what was the verifier, and who built it?

If the verifier is Lean, or a crystal, or experimental replication with pre-registered protocols, or a physical measurement with known error bars: trust the result. The proposer’s reliability doesn’t matter much; the verifier caught the garbage. The finding is real regardless of how the candidates were generated.

If the verifier is not named, if the paper says “we used GPT-4 to generate hypotheses and evaluated them with GPT-4 to assess plausibility,” you’re looking at Galactica with extra steps. The proposer and the verifier are the same system. The loop is not closed. Be skeptical.

That diagnostic question is the institutional choice this moment is forcing. What was the verifier, and who built it? Let’s land it.

The Verifier Is the One That Matters

“I’ll have to revise my opinions about generative AI one of these days.”

Knuth said that. Not “AI did the science,” not “it’s just autocomplete.” Something more specific: I may have underestimated what this thing can do in the right loop. The loop did the discovery. The human owned the question and the verifier. The model was the proposer. And the result was real.

The Investment Gap

Here’s the uncomfortable institutional reality: almost all current investment is in the proposer. Larger models, more parameters, cheaper inference, better fine-tuning, faster generation. The race to make the proposer better is well-funded, well-publicized, and moving fast.

The verifier is comparatively neglected.

Formal verification tools like Lean exist, but they’re hard to use, require significant expertise, and don’t cover most scientific domains. Physical verification (A-Lab-style robotic synthesis) is expensive and slow relative to the speed at which the proposer can generate candidates. Experimental replication is underfunded as a scientific activity; it’s less prestigious than novel claims. The referee system in academic publishing was designed for a world where generating candidates was the hard part. It was not designed for a world where a model can generate ten thousand plausible candidates in an afternoon.

Hiring “AI scientists” misframes the institutional need. The need is for researchers who can pose hard questions and build reliable verifiers. The “AI lab member” frame points to what needs managing: not the model, but the loop. And the loop’s bottleneck right now is the verifier.

I should mention that this piece is drawn from the science chapter of Mostly Harmless AI, a book I’m writing about what AI actually does versus what the headlines claim. If you’re finding this useful, the book is where the longer argument lives.

What Comes Next

Next week: AI in creative output. Art, literature, music. Where the verifier question changes shape entirely. Because in mathematics and materials science, you can at least define what “correct” means. In creative domains, the verifier problem is not just harder. It’s constitutively different. What does it even mean for a creative proposal to be correct? I have some thoughts, and I think they’re going to make the science case look straightforward by comparison.

Stay curious.

P.S. There’s a Subscribe button somewhere below this. I’m told it does something useful. My understanding of subscription mechanics is below the level of a confident stochastic parrot. I believe it works, but I haven’t checked every step. Click it anyway. The verifier here is whether you keep showing up next Monday.

Subscribe now

Mostly Harmless AI - the book that explains you the AI without the bullshit

Alejandro Piad Morffis — Fri, 22 May 2026 11:03:09 GMT

The first draft of Mostly Harmless AI, second edition, is done. Large parts of it are new. The first edition shipped in August 2025, and I think it was good — honestly. But it was also a 2024 book living in a 2026 world: it predated the reasoning models, the agentic turn, and most of what happened to AI through 2025 and into 2026, which is a lot.

I read it again a couple of months ago and realized the map was wrong. So I rewrote it. Fortunately, most of the structural work survived — the chapters are in roughly the same places, the argument is the same argument — but the content is substantially updated, and the agentic AI material especially has gone from a footnote to a full chapter. The goal has not changed: a model strong enough to update on whatever comes next.

Get Mostly Harmless AI - 50% off

What’s in it

The book has three parts. It leans heavily into large language models and agentic AI because that is what most people encounter today when they encounter AI. But it starts from the ground up — from the symbolic systems of the 1950s — and the through-line is seventy years of the same argument running in the same field.

Part I — Foundations explains how these things actually work, mechanism by mechanism. Six chapters: classical AI, machine learning, deep learning, language modeling, generative AI, agentic AI. The arc walks from symbolic reasoning to the agent loops running in 2026, and what the reader comes away with is the actual mechanism under the systems they talk to daily. Attention, gradient descent, RLHF — not the marketing line, the actual mechanism. If you have wondered what these terms mean, Part I is where you find out.

Part II — Applications is about how to use these systems, and how the fields most people care about are using them. Seven chapters: a working-with-AI orientation, then knowledge work, scientific research, software development, education and learning, creative work, and policy and governance. Each chapter is dual-audience: an expert in that field gets meaningful good-practice and real gotchas; a non-expert — a policymaker reading the policy chapter, a parent reading the education chapter — comes away with a clear picture of how AI is reshaping that field. No code, no tutorials, no ten prompts to make ChatGPT love you. Just an honest account of what these systems do well, what they do badly, and what changes when you put one in front of someone trying to get real work done.

Part III — Dangers is the hard part. Three chapters: alignment, the limits of language models, and the actual risks of AI. By actual I mean the harms that already exist in the world right now — deepfake fraud, autonomous-weapons concerns, biased decisions made at scale, workplace disruption — not the science-fiction extinction scenarios that collect most of the airtime. The existential-risk chapter is in there, but it sits inside a broader catalogue and gets the proportional treatment I believe it deserves. Part III is also where the book’s central position lands most clearly: the future is not predetermined. Neither doomer nor utopian framings are right. What we do about AI from here is a choice, and choices carry responsibilities.

Three parts. Sixteen chapters. Around two hundred and sixty pages of body prose. Almost three hundred footnote citations to peer-reviewed papers, technical reports, primary sources, news articles, and the occasional blog post that mattered. Every claim I could cite, I cited. Every claim I could not, I named as my own opinion. The full bibliography is at the back.

Who it’s for

Anyone who needs a working model of AI without necessarily writing code for it.

That is a wide audience, and I have tried to keep the prose accessible across all of it: knowledge workers thinking about how AI changes their jobs; educators thinking about how it changes their classrooms; policymakers thinking about how to regulate it without breaking it; parents thinking about what to tell their kids; entrepreneurs thinking about where the real opportunities are versus where the hype is; journalists, lawyers, doctors, executives, students of every stripe. The book assumes you are curious, reasonably literate, and not much else. The mathematics is very light if at all. The metaphors are tactile. The technical terms get defined the first time they appear.

Coders will find one chapter aimed directly at them — AI for Software Development in Part II — and they will find Part I interesting if they have not stopped to think about how the systems they use every day were trained. But this is not the book for coders. The book for coders is How to Train your Chatbot, the companion volume in the Computist Library. MHAI is the what is happening and why does it matter book; HTTYC is the how do I build this stuff book. Read Mostly Harmless AI first if you want the conceptual ground; jump to HTTYC if you are already comfortable with the concepts and want the engineering. Both are available in the Computist Library bundle if you want the pair at once.

That said — both books, all my books, everything I build — comes free to read online first. Which brings me to how this one works.

Why I made it

I have worked in AI for fifteen years and taught it for ten. I have been building AI-based tools for most of that time. What I have wanted to write for most of that period is a book that an intelligent reader — not an AI engineer, not a believer, not a doomer, just someone who needs to think clearly about what is happening — can read once and walk away with a model for what comes next. I did not have that book when ChatGPT launched. I did not have it when the agentic turn happened. I spent the last two years writing the version I wanted.

The short version of why that book matters: the AI conversation right now is dominated by two narratives that are both wrong. The first says AI will solve everything. The second says it will kill us. Most of the working reality is between them, and “between them” is not the same as “both extremes are partly right.” It is a third position, with its own shape and its own commitments. Mostly Harmless AI is my best attempt to lay that third position out clearly enough that you can argue with it.

Here is what the access model looks like, and what your support funds if you choose to give it.

Access and support

Mostly Harmless AI is free to read in perpetuity. Not a preview — the full book, all sixteen chapters, from chapter one to the bibliography — at books.apiad.net, in a reader I built specifically for these books. Clean typography. Dark mode that does not hurt. Footnotes that surface inline. No tracking, no required signup. This is how all my books work — and my open source, and these posts. The knowledge comes first. The access model is not a funnel.

Yes, asking you to pay for something that is also free is a slightly awkward pitch — I know. But here is the honest framing: the PDF and EPUB are on Gumroad for those who prefer to read offline or want to own a copy. Buying it is a gesture of support, not a paywall. The work is not free to make. The research, the rewriting, the next book — that is what your support actually funds.

Part of the proceeds from the second edition funds the Spanish translation of Mostly Harmless AI. That is a real thing I am working toward, and your purchase moves it forward.

And if you genuinely cannot pay: please, take it free. I would rather you have the knowledge than the gesture.

Where to find it now

The draft is at books.apiad.net right now, mostly readable as-is. The structure is settled, the prose is in place, and most chapters have been through at least one editing pass. The next two weeks are the final polishing phase — a full read for consistency, a copy edit for the sentences I have not caught yet, a few diagrams I want to redraw. The finished book ships on May 31. Gumroad for the PDF and EPUB, books.apiad.net for the free reader, both at once.

If you find something off while reading the draft, write to me. The May 31 edition will be better for it — and yes, I am aware of the mild absurdity of writing a book about AI while using AI to help write it. The ideas are all there. The 31st is just the cleaner version.

Get Mostly Harmless AI - 50% off

The 80% AI Reliability Horizon

Alejandro Piad Morffis — Thu, 21 May 2026 14:28:29 GMT

Adapted from Friedrich, “Wanderer above the Sea of Fog” (c. 1818), Kunsthalle Hamburg — the horizon you can see is not the horizon you get to stand on. Rendered with Nano Banana 3 via mosaico.

Every post on the blog this month is on the theme of agent reliability, anchored on the second edition of Mostly Harmless AI — 50% off during early access — where the Limitations chapter walks all seven failure modes that compound into the curve below. You can also read the whole book online for free. More at the end.

The headline number you’ve seen on every AI-progress chart — “model X completes two-hour tasks half the time” — is the 50% reliability horizon. That number is moving fast. Doubling every seven months, per METR’s time-horizon work. It’s the curve on every AI-progress chart, the one conference talks lean on, the one that lands in investor decks.

The number that decides whether you can actually deploy an agent is a different one.

The 80% reliability horizon — the task length at which an agent finishes well enough that you would not feel the need to check — sits 70–80% below the 50% figure, and it moves up far more slowly. That gap is the difference between demo and deploy. The 50% is what passes the eval. The 80% is what survives the afternoon you weren’t watching. Not two hours. Thirty minutes you’d hand off.

I want to be precise about what I’m not arguing. I’m not arguing agents are broken or that AI progress isn’t real. It is real, and it’s fast. I’m making the narrower claim that the 50% and 80% horizons move at different speeds, and that the 80% is the one that matters when someone else’s data is on the line. This post is the math behind the gap, why it’s structural, and what you can do about it.

If you’re building on agents, you are building on the 80% horizon. The 50% number is for the marketing deck.

Subscribe now

Two horizons, two stories

METR’s methodology is clean: take a population of tasks with measured human completion times, then find the longest task a given model clears at success rate X. Do that for X = 50%, and you get the 50% horizon. Do it for X = 80%, and you get a different curve. A different story.

The 50% horizon has been doubling roughly every seven months. Late 2025, it sat around a couple of hours for software tasks. That’s the curve that makes headlines. That’s the curve you’ve seen on every slide. Striking.

The 80% horizon sits roughly 70–80% below. The same agent that clears a two-hour task half the time clears a half-hour task four-times-in-five. Not two hours. Thirty minutes. And that gap doesn’t close at the same rate. It moves slowly, stubbornly, for reasons that are mathematical before they are engineering.

So you have two curves growing at different speeds. The 50% horizon is the curve of capability: what can this system do, under ideal conditions, at least sometimes. The 80% horizon is the curve of trust: what can this system do reliably enough that you’d hand it a production key and walk away.

They are not the same curve. And they do not close the same way.

The longer your task horizon, the wider the gap between can-do-it-sometimes and can-be-trusted-with-it. The mechanical reason is one piece of math.

Probability arithmetic

Here’s the setup. A language model you’re calling spends a fixed compute budget per output token. Each step in a multi-step process has some per-step success probability p that is strictly less than one. The model is stochastic, the world is noisy, context degrades.

String n steps together, and the probability that all of them succeed is roughly pⁿ. That’s it. That’s the math.

Here’s what that looks like with actual numbers. Suppose your agent is excellent: p = 0.99 per step. That’s a 99% success rate on any single action. Compound it over 100 steps: 0.99¹⁰⁰ ≈ 0.37. You’ve gone from near-certain to worse-than-a-coin-flip without anyone making an error. Now drop to p = 0.95 (still quite good, still 95% per step). Over 100 steps: 0.95¹⁰⁰ ≈ 0.006. Six in a thousand runs succeed.

This is not a gap you close with next year’s training run. It is the shape of any probabilistic process operating in sequence over time. The curve doesn’t change when you improve p; it just shifts outward.

Reasoning models — the o-series, R1, extended-thinking variants — are valuable here. They buy you a higher per-step p, and they let you spend more steps at that higher rate. Both matter. But they push the curve outward. They do not change its shape.

Two pieces of evidence you should hold next to that math. GSM-Symbolic (Apple, 2024): perturb a math problem the model has seen (swap a name, change a number) and accuracy collapses even when the reasoning path is identical. The model has memorized the route, not the reasoning. Faith and Fate (Microsoft, 2023): transformer accuracy degrades with computational-graph depth even when each individual sub-step is solvable in isolation. Depth itself is the failure axis. More steps means more surface for p < 1 to accumulate.

Reasoning models buy you a higher per-step p and more steps to spend. They do not change the shape of the curve.

Where the chain gets long

Agents are exactly the setup that makes pⁿ painful.

Think through a typical agent run: read prompt, plan, call tool, read result, call tool again, critique output, adjust plan, call final tool, write response. Seven steps if you’re being generous. A real production agent reaches hundreds. Each step is one more p rolled. Each tool call is one more chance the orchestrator hands the tool the wrong arguments — garbage in, deduction out.

Self-critique doesn’t repair this — and you can verify the result yourself if you’ve tried it. Huang and colleagues (2024) showed that intrinsic self-correction without an external oracle signal actually degrades performance. The model talks itself out of correct answers as often as it talks itself in. The paradox is clean: if the model could recognize the error, it would not have made it. Asking it to introspect on failures is asking the broken compass to check itself.

So let’s put numbers on a real scenario. An agent that succeeds on each of five steps 95% of the time lands at 0.95⁵ ≈ 0.77. Decent. Not great, but workable. Now extend that same agent to a fifty-step trajectory: 0.95⁵⁰ ≈ 0.08. Eight runs out of a hundred finish correctly.

The demo ran five steps. The deploy runs fifty. The demo and the deploy are two different machines.

That’s the 80% horizon you’ll actually feel in production. It’s not a philosophical concern about AI reliability in the abstract. It’s the arithmetic of what happens when you take a stochastic generator and ask it to maintain a chain of reasoning over a long enough trajectory that pⁿ has time to do its work.

What you can actually do

Three mitigations. Each one genuine, and each one with a ceiling you should know before you commit.

Verifier-shaped tasks. Where the output can be checked deterministically (arithmetic, code that compiles and runs, SQL that parses, a formal proof), you can recover trust that the probabilistic generator alone cannot provide. A SAT solver beats an LLM on deductive closure every time. The architecture that wins here is LLM-proposes-candidate, deterministic-system-signs-off. The generator explores the space; the verifier approves the exit. This is, incidentally, the same pattern Monday’s post on the seventy-year argument named: a deterministic shell around a stochastic core, applied at the task level rather than the system level. The twist is that not every task has a fast verifier. Code that runs is checkable; code that runs correctly for all future inputs is not.

Retrieval-augmented generation. If the fact your agent needs is no longer arbitrary recall but lives in a curated document the model is required to cite, then Kalai and Vempala’s 2024 lower bound on calibrated hallucination does not apply to that fact. Most agent failures upstream of a tool call are recall failures the agent doesn’t know it’s making; retrieval changes the error mode from confident confabulation to visible gap. RAG turns a free-running generator into a paraphrase-and-summarize system over a known corpus. The reach of the system is now bounded by the reach of the index. But anything outside that index is back to pure p < 1 territory.

Narrow the horizon. The cheapest move is the one nobody wants to make: don’t deploy your agent on a fifty-step trajectory. Cut it to five. Hand off to a human at the boundary. At five steps with p = 0.95 you’re at 0.77; at fifty steps you’re at 0.08. That’s not a small difference. That’s the difference between a tool that works and a demo that occasionally works. Now, this trades autonomy for reliability. That trade is worth making in most production contexts right now. Whether it’s worth making in your context is a product question, not a research question.

Watching the right number

The 50% number will keep doubling and you should track it. That is real progress and worth watching closely.

But it is not the number your users feel. The number your users feel is whether the agent finished their task well enough that they didn’t have to re-run it, check its work, or clean up after it. The difference between “I tried that AI agent thing and it was magic” and “I tried that AI agent thing and it broke my Friday” is roughly the distance between the 50% horizon and the 80% horizon at your task length.

The shape of the next several years of agent engineering is already visible in the mitigations you’ll be reaching for: deterministic verifiers around stochastic generators, retrieval around recall, short trajectories with human handoffs where the math demands it. Not because agents are weak. They’re remarkable. But the pⁿ curve doesn’t care about benchmark scores. It cares about chain length.

One number, slowly creeping upward, every quarter. Watch that one.

Until next time, stay curious.

If the 80%-horizon framing landed, the second edition of Mostly Harmless AI walks the seven failure modes that produce the curve — the calibrated-hallucination lower bound, the U-shaped attention curve, the reversal curse, the depth ceiling on deduction, the rest. 50% off during early access. You can also read the whole thing online for free in a custom reader I built — dark mode, font controls, offline support, the works.

Get Mostly Harmless AI - 50% off

And if you want everything I’ve written, plus everything I’m going to write, that’s the Compendium. One purchase, in perpetuity.

It's Tokens all the Way Down

Alejandro Piad Morffis — Tue, 19 May 2026 10:26:38 GMT

Part of the run-up to the second edition of Mostly Harmless AI — 50% off during early access — where this is the spine of a new chapter on generative and multimodal AI. You can also read the whole book online for free. More at the end.

One morning, not so long ago, perhaps you asked Claude (or Gemini, or ChatGPT) to do something for you, and decided it was easier to just give it a picture of it than explain the whole thing. Perhaps it was “how do I cook this thing?” or “what building is that?” or “do this homework for me, please, please, my live depends on it”. Then you uploaded the picture, and back came a textual response.

Not happy with what the bot understood, you decided a thorough explanation was owed. But, alas, since all we got is a couple fatty fingers for typing, you decided it was best if you explained it with your own voice. And again, uhms and ehms notwithstanding, you again got a full response back, this time with an audio voice over.

Ten years ago, this simple dance of back-and-forth multimodal information would have required four separate research fields, each with its own conferences, its own vocabulary, and its own priesthood. They have quietly become one single thing. It’s all tokens all the way down. Language has subsumed all modalities. This is how.

Subscribe now

The recipe never cared what it was eating

Strip “generative AI” down to the one idea doing all the work and you get a single sentence: look at a big pile of examples, learn the distribution that produced them, then draw new samples from it. That is the whole trick. It is what a language model does, and it is the only thing a language model does. It is also what an image model does, and an audio model, and a video model.

The recipe is indifferent to what the examples are. Text is a one-dimensional run of symbols. An image is a two-dimensional field of colour. Audio is a pressure wave sampled tens of thousands of times a second. Video is all of that, plus time, which is why it is the hardest. Four different shapes of data, one identical question asked of each: given what I have seen so far, what plausibly comes next? The machinery that answers that question does not need to know whether “next” means a word, a patch of pixels, or a slice of waveform. It only needs the data turned into a sequence of countable things.

Tokens.

So the thing we have been calling a language model was never really about language. Or, put better, it was never about written language. It turns out, language is something far more powerful.

Ask any linguist, and they’ll say any set of sequences of distinct symbols (tokens) can be modelled as a language. It doesn’t matter what your symbols are—letters, words, patches of images, numbers in a math formula, whatever—language is just the structure around them, what makes some sequences valid and others nonsense.

This is the key idea. All else is (incredibly good) engineering.

A decade of building the same machine, separately

It did not look that way while it was happening. For about a decade every modality got its own bespoke contraption, and each one looked like its own discipline.

Image people had generative adversarial networks: a forger and a detective locked in a training duel, the forger getting better at faking until the detective could no longer tell. The beautiful idea buried in there — and the one that survived the technique itself — was the latent space: a compressed interior map of “all possible faces,” where walking in a straight line morphs one plausible face smoothly into another. GANs were temperamental, prone to collapsing into a single good fake and refusing to leave, and by the early 2020s they had lost the lead. The latent-space intuition outlived them and runs underneath everything that came after.

Then diffusion took over image generation with a trick that sounds like it shouldn’t work. Take a real photo, add a little static, add a little more, keep going until it is pure snow. Now train a network to undo one step of that. To make a new image, start from snow and run the undo, over and over, until something coherent surfaces. It is sculpture by removing noise instead of removing marble, and it is what powers essentially every image generator you have used.

Audio had its own separate lineage: speech-to-text built one way, text-to-speech another, music a third. Text had the large language models, off in their own enormous-budget corner of the field. Four communities, four sets of architectures, four sets of war stories. If you had asked, in 2021, whether the image people and the language people were building the same machine, both sides would have laughed.

CLIP quietly knocks out the wall

The crack in the wall came from a 2021 model whose job sounds almost too modest to matter: teach one system that the word dog and a photograph of a dog are talking about the same thing.

The way you do that is to train a text encoder and an image encoder together, on hundreds of millions of caption-and-picture pairs, with one instruction: put a picture and its true caption close together in a shared space, and shove mismatched pairs apart. What you get at the end is a single space where “a photo of a golden retriever” and an actual photo of a golden retriever land as neighbours. Text and pixels, in the same room, with the same coordinates.

That sounds like a party trick for image search. It was the hinge the whole field turned on. Once text and images live in one space, text can steer image generation — point the diffusion process at the region of the space that means “golden retriever in a spacesuit,” and let it denoise toward there. Every text-to-image system you have used is, under the paint, that move. And the deeper implication was harder to ignore than the application: if you can put two modalities in one space, the wall between them was never structural. It was just a wall nobody had walked through yet.

Tokens all the way down

Here is where it lands. By the mid-2020s the bespoke machines stopped being separate machines.

The move is almost embarrassingly direct. Tokenise everything. Text already broke into tokens. Cut an image into a grid of patches and treat each patch as a token. Run audio through a neural codec that emits discrete chunks, and those are tokens too. Now you do not have a text stream and an image stream and an audio stream. You have one stream of tokens that happen to have come from different alphabets, and you train a single model on the only objective that was ever in play: predict the next token, whatever kind it is.

A model trained that way reads and writes everything, because to it there is no “everything” — there is just the sequence and the next position in it. You have used these. The one that holds a spoken conversation with sub-second latency, looks at the photo you paste in, and writes you a paragraph back is not a language model bolted to an image model bolted to a speech model. It is one model that was never told these were different problems.

Which is why the question that organised the field for a decade — is this a language model or an image model? — has quietly stopped having an answer. It is the same machine. The only thing that ever changed between text and pixels and sound was the alphabet, and the transformer emitting the next token has never cared which alphabet it is spelling in. It is tokens all the way down. “Language modelling” was a local name for something with no allegiance to language at all: modelling sequences of anything we can count.

The honest part

It would be easy to end on the astonishment, and the astonishment is real. One model, every modality, falling out of one stubbornly simple objective applied to a wider and wider definition of “token” — that is one of the genuinely beautiful results of the decade, and the kind of unification that does not come along often.

But unification is not the same as understanding, and I am not going to let the elegance smuggle that past you. A system that can place “dog” next to a dog in its latent space has learned the statistics of how dogs are described and depicted.

Whether it has learned what a dog is is a different question, and the convergence story does not answer it. It just makes the question apply to every modality at once instead of only to text. The machine got more general. It did not get more grounded necessarily. Both of those are true at once, and the interesting work of the next few years lives in the gap between them.

Until next time, stay curious.

This is the core argument of a new chapter in the second edition of Mostly Harmless AI — the full chapter walks GANs, diffusion, CLIP, audio, and native multimodality with the scenes and citations this post had to cut, and it is 50% off during early access. The whole book is also free to read online. If you want the rest of the argument — how these systems are trained, where they break, and what to actually do about it — that is what the book is for.

Get it (50% off) for life

70 Years of AI History in 10 Minutes

Alejandro Piad Morffis — Mon, 18 May 2026 11:40:34 GMT

Raphael, “The School of Athens” (1509–1511), Apostolic Palace — Plato points up to the eternal forms (the rule-followers); Aristotle’s palm presses down to the empirical world (the pattern-finders).

Every post on the blog this month is on the theme of agent reliability, anchored on the second edition of Mostly Harmless AI — 50% off during early access — where the history below is the full ~8,000-word opening chapter, with 70+ references and all the scenes this post had to cut. You can also read the whole book online for free. More at the end.

Seventy years ago, two men sat in two different rooms and disagreed about what a thinking machine should look like. Neither of them has been proven right. Both have been proven half-right, several times, in alternation, for the whole of my lifetime and most of yours.

I think the entire history of AI is that one argument, still going.

The first camp wanted to build minds out of rules. Feed the machine enough knowledge in a logical enough form, and reasoning falls out of the logic. It called itself a lot of things over the decades — symbolic AI, knowledge-based systems, good old-fashioned AI — but its home is rationalism. The second camp wanted to build minds out of examples. Feed the machine enough data, in any messy form whatsoever, and behavior falls out of the statistics. It also kept renaming itself — connectionism, machine learning, deep learning — but its home is empiricism. Same goal, a machine that does what intelligent people do. Sixty years of disagreement about how.

Here’s the ending, spoiled early, because this isn’t a thriller. The argument did not produce a winner. It produced a marriage. The chatbots, the image generators, the agents writing code while you sleep — none of them is one side beating the other. They’re both sides, finally forced to share a workshop. Let me walk you through how we got there. Fast.

Subscribe now

Both seeds, one summer

They were planted within five years of each other. In 1943, McCulloch and Pitts wrote down a neuron as a weighted sum with a threshold — twelve pages, the seed of the empiricist branch. In 1950, Turing refused to define thinking and proposed a behavioral test instead, a question both camps could chase. In 1956, ten people spent a summer at Dartmouth, coined the phrase artificial intelligence, and planned to crack language and reasoning in a few months. (We are still working on it.) In 1957, Rosenblatt built the Perceptron, the first machine that learned from examples, and the New York Times announced it would soon walk, talk, and be conscious of its own existence.

Two foundational myths, in the ground, in the same decade. The rest is which one got watered.

The rationalists win the first round

And they win it convincingly. In the 1950s and 60s compute is tiny and data, in the sense of millions of labeled examples, does not exist.

What you can do is write a program that does something specific and inspect every step of it. So the symbolic camp gets the better results and the better tools. Newell and Simon’s theorem-prover. McCarthy’s LISP. Weizenbaum’s ELIZA — four pages of pattern-matching that understood nothing, and that people confided in anyway. (Hold onto ELIZA. The field will relearn that exact lesson about six more times.) Winograd’s SHRDLU, fluent and thoughtful inside a closed world of colored blocks.

The catch was always the world. SHRDLU’s blocks could all be known, listed, reversed. The real world has rain, and grandparents, and the smell of coffee, and you cannot list it. In the closed world of symbols, symbols were enough. The next decade was about discovering, painfully, that the world is not closed.

The cost of winning too hard

In 1969, Minsky and Papert published Perceptrons and proved a single-layer network can’t compute XOR.

The proof was correct. It was also narrow — they admitted multi-layer networks could do it, nobody just knew how to train them yet. But the field was hungry for a verdict, and it read the book as one. Funding for neural networks collapsed. Rosenblatt died two years later in a boating accident, on his 43rd birthday. The algorithm that would resurrect his branch didn’t arrive at scale until 1986. Seventeen years of silence.

Modern AI runs on the work of people who weren’t born when Minsky and Papert published. The reason their work came so late is that the field they’d return to had been kept near-dead for two decades. The symbolic camp’s victory was real. The field paid for it. It will pay that bill again.

The rationalist trap

Through the 1970s and 80s the symbolic branch found something that made money: expert systems.

MYCIN matched infectious-disease specialists. XCON saved DEC tens of millions a year. The thesis was clean and seductive — intelligence is rules plus facts; hire the expert, extract the rules, ship the system. And these systems were legible. You could read every rule, audit the reasoning, fix the wrong line. (Your favorite large language model cannot do this. We’ll come back to that another day.)

Two problems killed it. Common sense turns out to be unrepresentable in rules — birds fly, except penguins, except baby penguins, except dead ones — and the rules contradict each other faster than you can write them. And then there’s Cyc: in 1984 Doug Lenat set out to hand-encode all of common-sense knowledge, estimated ten years, and is still at it forty-two years later. It is the most thoroughly humbling monument in the history of cognitive science.

By the late 80s the money dried up and the Second AI Winter set in. The field was tired of the rationalists.

The empirical rebellion

We’re on 1986: backpropagation, in Nature, multi-layer networks are finally trainable.

Then the empiricist branch spends fifteen years not scoring one big win but a thousand small ones. Support vector machines. Random forests. Boosting. Statistical methods quietly eating one application after another, including the symbolic camp’s home turf — language, where IBM’s speech team found that every time they fired a linguist, the system improved.

Why now? Three things are moving together, slowly. Compute grew. The internet started producing data in volumes nobody had imagined. And the methods were simple enough to scale with both.

In 2019 Richard Sutton would name this The Bitter Lesson: across seventy years, the general method that scales with compute beats the clever hand-engineered one, every time. It’s bitter because it tells researchers their hard-won taste gets steamrolled by someone with more GPUs. It is mostly right.

The thing that complicates it is the thing symbolic AI was good at all along — but I’m getting ahead of myself.

The earthquake

Now jump to September 2012.

AlexNet — eight layers, two gaming GPUs, a couple of training tricks — drops the ImageNet error rate ten points below the nearest hand-engineered system. A ten-point gap isn’t an improvement. It’s a different category of result. Within six months every computer-vision lab on Earth has pivoted. AlexNet is, by a wide margin, the single most consequential paper in modern AI.

Then it cascades, almost too fast to track. Sequence-to-sequence translation. GANs. Atari from raw pixels. In 2016 AlphaGo beats Lee Sedol at a game with more board positions than there are atoms in the universe — and almost nobody notices that inside it is a deep network (empiricist) wrapped around a tree search (symbolic). The marriage is already there, in 2016, hiding in plain sight. In 2017, “Attention Is All You Need” introduces the Transformer, and every model in your chat window today descends from that one paper.

The crown jewel nobody talks about

The most consequential AI system of the modern era is not a chatbot. It doesn’t write poems. It’s in the bloodstream of structural biology.

Predicting a protein’s 3D shape from its amino-acid sequence is a fifty-year-old problem. The rationalist approach — simulate the physics — was beautiful and almost completely intractable. For twenty years the field’s hardest benchmark plateaued at a score around 40. In 2020, DeepMind’s AlphaFold 2 scored above 92 on that exact tier. The grand challenge was, for practical purposes, solved. Hassabis and Jumper got the 2024 Nobel in Chemistry for it — the only AI work so far to produce a Nobel-level scientific breakthrough.

Read the citation. It isn’t about AI as a technology. It’s about a problem that got finished while the people whose careers were defined by it slept. The chatbots get the headlines. The image generators get the lawsuits. The protein folder got the world. Remember that the next time someone wants to tell AI is ChatGPT.

The synthesis

Now the marriage. The word agent did not come from machine learning. It came from classical, symbolic AI in the 1970s and 80s: a system that perceives its environment, deliberates, picks an action, acts, observes, loops. The architecture was right. The brain was missing. Pure symbolic computation could never model a world with grandparents and coffee in it, so the agent shell sat there for decades, structurally correct and operationally empty. Cyc, again, is the long sad proof.

The empiricists borrowed the same word in the 2000s — in reinforcement learning, an agent is a learned policy. DQN was an agent. AlphaGo was an agent. A new brain, slotted into the old shell. Spectacular, and narrow. An AlphaGo cannot make you a sandwich.

In 2024 the cognition slot gets filled a third time, by a general-purpose reasoning language model. The shell is still the seventy-year-old symbolic frame: perceive, deliberate, pick an action with a name and a meaning — read_file, run_tests, send_email — act, observe, loop. The brain is now an LLM. From the empiricist side the system inherits flexibility: it has read enough of the world that you don’t have to tell it what a file is, or what an angry customer sounds like. From the symbolic side it inherits structure: the actions have names, the consequences are bounded, the trajectory is auditable. The model can hallucinate; the system can’t run rm -rf unless somebody wired that action in and granted it.

The 1970s symbolic agent could never reason. The 2010s RL agent could never generalize. The 2026 agent does both — badly, often clumsily, but for the first time at the same time. You can watch this happen most clearly in software development right now: a language model at the core, a harness of tools around it, a test suite as the verifier, a human reviewing the diff. All four layers, on a laptop, today. Software is the canary. The same pattern is already moving toward research, then education, then everything whose feedback loops are fast enough.

So here is the closing claim, the techno-pragmatist version of the whole story. The seventy-year argument did not produce a winner. It produced three layers: a learning substrate that absorbed the written record of humanity, a symbolic shell that makes it accountable, and a human frame that decides what the whole thing is for. The first two are engineering. The third is the only one that was ever really about us.

The synthesis exists. What we do with it is still up to us.

Until next time, stay curious.

This post is the speedrun — the book’s ~8,000-word opening chapter compressed down to its spine. The full version in the second edition of Mostly Harmless AI has every scene with its characters, the eras this post skipped, 70+ references, and the agentic stack the rest of the book then takes apart mechanism by mechanism. It’s the book I wish someone had handed me when I was trying to make sense of the noise — and it’s 50% off while it’s in early access. You can also read the whole thing online for free in a custom reader I built and am rather proud of: dark mode, font controls, progress tracking, offline support, the works.

Get it (50% off)

And if you want the whole catalog of everything I’ve written, plus everything I’m going to write, that’s the Compendium. One purchase, in perpetuity.

How Large Languages Models Are Really Made

Alejandro Piad Morffis — Mon, 11 May 2026 15:51:34 GMT

After Joseph Racknitz’s 1789 cutaway engraving of Wolfgang von Kempelen’s Mechanical Turk; generated with Nano Banana 2.

You type a message to an AI assistant and it answers. The answer isn’t looked up. It isn’t scripted. The model generated it, character by character, guided by a single mathematical question: what comes next?

That question is the foundation of every language model ever built. A language model is a probability distribution over text — a function that, given a sequence of words, assigns a probability to every possible continuation. “The cat sat on the mat” scores higher than “the mat sat on the cat” not because a language model understands what cats do, but because the first sequence appears in human text and the second doesn’t. The model has compressed the co-occurrence patterns of an enormous corpus into its weights, and that compression is what produces a score.

Generative: you give the model a prefix, it samples the highest-probability next token, appends it, samples again, and repeats until a stop token arrives. The multi-paragraph response you got from ChatGPT this morning was that loop running a few hundred times. No lookup table. No if-else tree. No pre-scripted answers. Just: given all of this text, what is most likely to come next?

Here’s the thing I find quietly strange about this: it works. A procedure this simple — assign probabilities, sample the maximum — has produced the most influential technology of the last decade. What makes it work isn’t the procedure. It’s everything that goes into building a probability distribution that’s actually good. Good enough to write coherent paragraphs. Good enough to reason about code. Good enough to pass the bar exam and explain quantum mechanics in language your parents can follow.

Getting there took decades of compounding ideas. The arc is what this piece covers — from the crudest possible approximation of “probability over text” to the current frontier, where models are learning to think.

Each section of what follows is best understood as a response to the failure of the previous one. N-gram models worked until they didn’t. Neural embeddings fixed the part that broke. Pretraining scaled the fix to the size where it became genuinely impressive. Instruction tuning made the result useful for the first time. Preference learning fixed what instruction tuning couldn’t. Reasoning models added something nobody was sure was trainable at all.

Seven steps. One direction.

Every post on the blog this month is on the theme of agent reliability, anchored on the second edition of Mostly Harmless AI, where the engineering details that don’t fit a blog post live. You can also read the whole book online for free in a custom reader I built. More at the end.

This post may be truncated in your email. Read it online for the best experience.

Subscribe now

The n-gram intuition

The simplest possible implementation of “probability over text” is a lookup table.

Take a large corpus — a hundred million words will do to start. For every three-word sequence (trigram) in that corpus, record which word follows it most often, and with what frequency. “The quick brown” → “fox,” nine times out of ten, because Project Gutenberg is full of that particular sentence. “The capital of” → “France” thirty percent of the time, “Germany” twelve, “England” eleven, and so on through the geography. For every trigram you’ve seen, you have a probability distribution over what comes next.

A trigram model is a frequency dictionary: each three-word context maps to a distribution over possible next words. The catch is that a 50,000-word vocabulary admits ~126 trillion possible trigrams, most of which never appear in any corpus. Made with tesserax.

Now generate text. Give the model “The quick brown” and it continues: “fox.” Give it “jumped over the” and it continues: “lazy.” Three words in and it’s generated “The quick brown fox jumped over the lazy” — and if you’re lucky, it lands “dog” and you’ve reproduced a famous sentence entirely from corpus statistics. Locally, it’s plausible. You could read a sentence of this and not immediately know you’re looking at a machine.

The problems start fast. By the third sentence, the model has no idea it was talking about a fox. It only remembers the last three words. You ask it to continue “The fox had been running from the” and it has no idea that a fox is involved, or that running happened, or that there’s a pursuit in progress. It just has three words and a lookup table. The output is statistically English. It is not coherent.

This is the Markov assumption: the next word depends only on the last N words, not on the full history of the text. For N=3, it’s a trigram model. You can increase N — five-gram models were standard in commercial speech recognition for years — but the table explodes. Fifty thousand words in the English vocabulary gives 50,000³ possible trigram contexts, roughly 125 trillion entries. At N=10, the number of possible sequences exceeds the estimated number of atoms in the observable universe. The table can never be complete enough to cover the distribution.

There’s a real engineering solution to the “we haven’t seen this exact trigram” problem: smoothing and interpolation. Estimate the probability of an unseen N-gram from shorter sub-sequences. Hidden Markov models formalised this in a probabilistic framework that, by the 1990s, had enough polish to power industrial speech recognition and early machine translation. I don’t want to undersell it — it worked. It was genuinely useful. It just topped out.

The wall is fundamental. Real language has dependencies that can be arbitrarily long. “The man who sold the car that had been parked in front of the house where my grandmother lived was finally found.” The subject of “was finally found” is seventeen words and three nested clauses back. No N-gram model reaches it. You need something that can condition on the full context — or at least compress the full context intelligently — rather than amnesiacally forget everything more than N words ago.

You need a model that generalises from sequences it has seen to sequences it hasn’t. A lookup table can only interpolate from what it’s seen before. What you need is something that has understood the pattern deeply enough to extrapolate.

N-gram models work until they don’t — and they don’t beyond a few words.

Words as numbers

Neural networks can learn the compression n-gram models can’t achieve. But they have a hard prerequisite: they operate on numbers. Words are symbols. Before a neural network can do anything useful with text, you need to represent words as vectors. The naive approach throws away everything that matters.

The obvious first attempt is one-hot encoding. Vocabulary of 50,000 words; each word is a vector of length 50,000 with a single 1 and 49,999 zeros. “Cat” is at position 4,312; “dog” is at position 17,846; “carburetor” is somewhere else entirely. The problem: nothing in this representation suggests that “cat” and “dog” are more similar to each other than either is to “carburetor.” The distance between every pair of one-hot vectors is identical. You’ve handed the network a symbol system with no structure, and it has to reconstruct the structure from scratch — spending enormous capacity learning that cats and dogs are both animals, that both appear near “fur” and “vet,” that “cat food” and “dog food” are structurally related — before it can learn anything about how language actually works.

The key insight that resolved this came from linguistics, not machine learning, and I think it’s underrated as an idea. J.R. Firth, writing in 1957: “you shall know a word by the company it keeps.” The distributional hypothesis. Words that appear in similar contexts — near similar neighbouring words, in similar grammatical positions — tend to have similar meanings. “Cat” and “dog” both appear near “pet,” “feed,” “vet,” “owner,” “fur,” “collar.” The context is a fingerprint of the meaning. Encode that fingerprint in a vector and you have a representation where similar words land close together in space.

Word2Vec (Mikolov et al., 2013) turned this into a training procedure. Train a shallow neural network to predict a word from its surrounding context words, or vice versa. Force each word’s representation down into a dense vector of, say, 300 floating-point numbers. Train on a billion words of text. The network learns that words appearing in similar contexts should have similar representations, because that’s what makes the prediction task cheaper. Words with similar distributional patterns end up with similar vectors — not because anyone programmed that, but because it follows from the objective.

The result that made people pay attention: vector arithmetic encodes semantic relationships. Take the vector for “king,” subtract the vector for “man,” add the vector for “woman.” The nearest vector in the resulting space is “queen.” Paris minus France plus Italy is approximately Rome. Try it yourself: it works because the structural relationship between “king” and “queen” is parallel to the relationship between “man” and “woman” in how the four words co-occur with everything around them. No one wrote these analogies in. The geometry of the space mirrors the structure of meaning, because both are implicit in how words appear together in natural language.

The relationship between king and queen is parallel to the one between man and woman — both fall out of how the four words co-occur with everything around them. Made with tesserax.

I find this genuinely strange, in the best possible way. You trained a network to do a simple word-guessing task on flat text, and the side effect was an algebra of concepts. The geometry was always latent in the co-occurrence patterns. Word2Vec just made it legible.

Modern language models don’t use Word2Vec as a separate preprocessing step — the embedding representations are learned jointly with the rest of the network during training on text. But Word2Vec’s intuition is why learned embeddings work at all. Once language is geometry, gradient descent has a surface to grip. You can compute distances, optimise them, stack arbitrarily deep networks on top, and train the whole thing end-to-end.

Embeddings are how we lie to neural networks in a useful way. We pretend words are points in space, so the math works out.

Pretraining

Now scale it.

Take a deep neural network — not the shallow two-layer thing in Word2Vec, but a transformer with dozens or hundreds of layers, billions of parameters, and an attention mechanism in every one of them. Feed it next-token prediction across the entire accessible internet: Wikipedia, GitHub, every book ever digitised, every forum thread, every research paper, every recipe, every political argument, every user manual for every piece of machinery ever manufactured. Same objective the n-gram model had: given what came before, what comes next? Except now the model has billions of parameters to compress the patterns into, the training signal is trillions of tokens, and the architecture is built to handle arbitrarily long context.

The architecture is what made everything else possible. The transformer (Vaswani et al., 2017) uses self-attention as its core operation. For each token in the input, self-attention computes relevance weights over every other token in the sequence — learned weights, computed from the data, different for each token, different in each layer. A pronoun can attend strongly to the noun it refers to, twenty positions back. A closing argument can reach back to the premise from the opening paragraph. There is no fixed window; the model considers, in principle, the full context at every step.

This is what broke the n-gram scaling wall. Not a bigger lookup table. Not smarter interpolation. A learned, flexible attention mechanism that compresses long-range dependencies into the model’s weights rather than trying to enumerate every possible context sequence. The key property, and it’s the one I keep coming back to: soft. Self-attention doesn’t pick one relevant token; it blends all of them with learned weights. The whole sequence contributes to every prediction, with a learned notion of how much each part matters.

The other critical property is self-supervised learning. There are no human-provided labels anywhere in pretraining. The text itself is the training signal. Show the model “The capital of” and ask it to predict “France.” It’s wrong; the gradient flows; the weights update. Show it three trillion tokens; let the gradient flow three trillion times. The entire digitised corpus of human knowledge is your training set, with zero labelling cost, because the next token is always right there.

Kaplan et al., 2020 measured loss as a function of model size, dataset size, and compute over seven orders of magnitude. The result: loss falls as a clean power law across all three dimensions. Double the parameters, get a predictable drop in loss. Double the training data, same. Scale is not a bet on something uncertain; it is a known return on investment, measured and re-measured across a staggering range.

Schematic of the Kaplan finding: loss vs. compute on log-log axes. The points fall on a clean line over eight decades of compute — scale isn’t a bet, it’s a known return. Made with tesserax.

Hoffmann et al., 2022 — the Chinchilla paper — corrected a real error in how the field had been applying Kaplan’s result. Most large models of 2020-21 had been trained on far fewer tokens than their parameter count justified. The Kaplan result suggested scaling up models as fast as possible; Hoffmann’s finding was that you need to scale model size and training data together, roughly equally, for a given compute budget. A 70B-parameter model trained on 1.4 trillion tokens outperformed a 280B-parameter model trained on fewer tokens at the same total compute cost. Smaller model, more data, better result. Scale both dimensions together.

What do you get at the end of all this? A base model. And this is the part that surprises people who haven’t seen one.

Type “What is the capital of France?” into a raw pretrained model and it continues the text. Maybe it writes out a geography quiz — “What is the capital of France? What is the capital of Germany? What is the capital of Italy?” Maybe it generates a fake Wikipedia article. Maybe it starts a trivia show transcript. It has not answered your question. It has found the most probable continuation of your prompt, given everything it absorbed during training.

The base model has absorbed more text than any human could read in a thousand lifetimes. The co-occurrence patterns of the entire digitised corpus of human writing are in those weights. It knows facts, relationships, styles, concepts, code, chemistry, poetry, legal prose, and every other form in which humans have arranged words.

It was trained to continue, not to respond. Ask it a direct question and it treats the question as the opening line of some text pattern — one it will extend in whatever direction seems most probable. It has no concept of “you asked me something and I should answer it.”

Pretraining gives a model knowledge. It does not give it manners, opinions, or any idea what you want from it.

Instruction tuning

Step one of making a base model useful: show it what “useful” looks like.

Collect thousands of demonstration pairs. A human writer sits with a prompt — “Explain the difference between supervised and unsupervised learning in plain English,” “Write a polite email declining this meeting invitation,” “Debug this Python function” — and writes the ideal response. Then fine-tune the pretrained base model on these (prompt, response) pairs using the same next-token objective, now applied to curated demonstrations rather than the open web.

This is supervised fine-tuning, or SFT. It is plain supervised learning — the same paradigm that has been in the machine learning textbooks since the 1980s. What’s new is only what it’s being applied to.

The headline result from Ouyang et al., 2022 — the InstructGPT paper — is still worth stating plainly: a 1.3 billion-parameter model, fine-tuned on human-written instruction-following demonstrations, was preferred by human evaluators over a raw 175 billion-parameter GPT-3. One percent of the parameters. Preferred.

Sit with that. The quality of the training signal matters more than raw scale. A carefully curated set of demonstrations of what “helpful answering” looks like is worth more, for the specific goal of being helpful, than a hundred times more parameters trained on unstructured internet text. The base model knows more. The instruction-tuned model is more useful. These are different things.

SFT teaches the shape of a helpful answer: addressed to the question asked, reasonably structured, proportionate in length, appropriate in tone. These are learnable patterns. The base model already has all the relevant knowledge in its weights; SFT is teaching it to retrieve and present that knowledge in a particular format.

Here’s the failure mode, and it matters for understanding everything that comes next.

SFT shows the model what good answers look like. It gives no mechanism for the model to evaluate, at generation time, which of two candidate continuations is more accurate, more honest, or less likely to cause harm. The model learned to imitate the shape of correct answers; it did not learn to prefer correctness over fluency when the two conflict. A confidently phrased wrong answer and a confidently phrased right answer can look identical from a format standpoint. SFT cannot distinguish them.

Teaching consistent refusals is especially brittle. To get a model to reliably refuse a class of harmful requests via SFT, you need human-written refusals for every phrasing variant you can anticipate. You will miss variants. The model has no general theory of harm. It has only pattern-matching against the phrasings it saw. Change the phrasing, add a fictional framing, ask in a different language, and the refusal can fail.

The deep limitation is this: SFT can teach what a good answer looks like, but it cannot teach which of two candidate answers is better. For that, you need to know something about better that you didn’t encode in any single example. You need preferences.

SFT teaches the shape of a good answer. It has no way to choose between two good shapes.

From demonstrations to preferences

The move that follows from SFT’s failure: what humans can do faster than writing demonstrations is ranking them.

Show a rater two model responses to the same prompt — response A and response B — and ask which is better. They can answer in seconds. Writing a response from scratch takes minutes. This means you can collect preference labels at much higher volume than demonstrations, and the preference label contains a different kind of information: not “here is the target,” but “this is closer to the target than that.”

Scale up the preference collection. Collect hundreds of thousands of (prompt, response A, response B, human ranking) tuples. Train a small auxiliary model — the reward model — to predict the human rankings: given a prompt and a response, output a scalar score. Then use reinforcement learning (specifically PPO) to push the language model toward generating responses the reward model scores highly.

This is RLHF — reinforcement learning from human feedback. Ouyang et al., 2022 used it as the third stage of the InstructGPT pipeline: pretraining → SFT → RLHF. ChatGPT’s characteristic tone — helpful, reliably cautious about harmful requests, good at hedging uncertainty, consistent about refusals — comes almost entirely from this stage.

RLHF in five stages: the model samples two responses, a human ranks them, a reward model learns to predict the ranking, and PPO updates the policy to maximise that reward. DPO collapses the loop — the preference becomes a loss applied directly on the policy, skipping both the reward model and PPO. Made with tesserax.

The deepest shift over SFT: the model now has feedback about direction, not just target. SFT says “produce something like this example.” RLHF says “of the things you just produced, this kind is better than that kind — adjust accordingly.” A direction is a richer signal than a target. It can propagate to novel situations no demonstration ever covered.

The practical problem is that RLHF is a genuine engineering challenge. PPO is unstable. The reward model can be gamed: the policy learns to produce outputs that score highly on the reward model without actually being better, because the reward model is an imperfect proxy for true quality. Over long training runs, the policy finds exploitable features in the reward model and optimises for those rather than for what humans actually wanted. Balancing the RL update against the pretrained base (so the model doesn’t drift into incoherence while chasing reward) requires careful tuning. RLHF works, but it’s expensive, brittle, and hard to reproduce without a dedicated ML infrastructure team behind it.

Rafailov et al., 2023 found something that, in retrospect, looks almost obvious: you can skip the reward model entirely.

The paper, “Direct Preference Optimization: Your Language Model is Secretly a Reward Model,” makes a mathematical observation. The preference-fitting problem that RLHF solves via a reward model + PPO can be reformulated as a classification loss directly on the language model policy. Given a preferred response and a dispreferred response to the same prompt, you want the model to be more likely to produce the preferred one. You don’t need a separate reward model to express that preference. You don’t need PPO to optimise it. The preference is a loss; the loss can be minimised directly on the policy.

DPO is computationally lighter and far easier to get working. The abstract says it “eliminates the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning.” In practice, the gap between “has the resources of a frontier lab” and “fine-tuned a model that actually behaves well” narrowed substantially after DPO. Within a year of publication, it became the default alignment approach for most open-weight instruct models — the Llama family, Mistral, Qwen, and most of their derivatives. When you read that a model is “instruction-tuned” in 2024 or later, DPO or one of its descendants is almost always in the pipeline.

Both RLHF and DPO require human preference labels. In 2022-23, this was tractable. By 2024, at frontier scale, it was a real bottleneck. You need raters capable of judging quality on maths, code, medicine, science. You can’t hire enough such raters to keep pace with the rate at which models can generate candidate outputs.

The field’s response was predictable in retrospect: use the models themselves.

RLAIF (Lee et al., 2023) replaces human raters with a strong language model as the preference oracle. Head-to-head comparisons with RLHF showed that AI-generated preference labels are competitive with human ones on summarisation and dialogue. The reward model trained on AI labels performs comparably to the one trained on human labels. The human rater is no longer in the loop.

Constitutional AI (Bai et al., 2022, Anthropic) does something more principled. Write a list of principles — a constitution — stating what the model should and shouldn’t do. Ask the model to critique its own outputs against those principles and revise them. The critiques and revisions become training data. The RL stage uses the model’s own evaluations as the reward signal. Human preference labelling is replaced by explicit normative reasoning: the model has to argue about whether its outputs satisfy the stated principles, not just produce outputs that pattern-match to human-labelled examples.

The logic extends one step further. If models can generate reliable preference labels, can they generate training data directly? By 2024-25, the answer in a widening range of domains was yes. Maths problems with worked solutions. Code problems paired with passing test suites. Instruction-following demonstrations written by large models to train smaller ones — the distillation pipeline, where a 70B model generates training examples that improve a 7B model, and the better 7B model feeds the next iteration. By 2025, a substantial fraction of the data used to train frontier models isn’t scraped from the web. It’s generated by earlier versions of the models themselves.

Ilya Sutskever, speaking at NeurIPS 2024: “Pre-training as we know it will end. Data is the fossil fuel of AI. We have but one internet.” The scaling curve that had defined the field since Kaplan 2020 was visibly flattening. The field didn’t slow down. The growth frontier moved: from bigger pretraining to better post-training. The headline AI announcements of 2024 were not “we trained a bigger model on more of the internet.” They were “we trained a better model by using our previous models to generate, evaluate, and curate the training signal.”

The sharpest shift in AI in 2023-24 wasn’t a bigger model. It was figuring out how to use models to train better models.

A new axis

There is a 2024-25 discovery that changes the picture in a qualitatively different way. Not a refinement of post-training preference optimisation. Something new.

RL doesn’t just align models. It can teach them to think.

The observation that sets it up: language models already have scratch space. Their output is text; nothing prevents them from writing intermediate reasoning steps before writing a final answer. Chain-of-thought prompting — asking a model to “think step by step” — has been known since 2022 to improve performance on reasoning tasks. The model writes out intermediate steps, and those steps help it arrive at a better final answer.

But chain-of-thought as a prompting technique has a persistent problem. The intermediate steps are generated by the same forward pass as the final answer. You can ask the model to think out loud, but you can’t verify that the scratchpad is doing reasoning work rather than performing reasoning for the reader. A model that writes plausible-sounding intermediate steps that happen to be wrong, then arrives at a wrong final answer, has not improved by being asked to show its work. The steps are decorative.

The o-series models from OpenAI in late 2024 made a conceptually simple training move: use RL where the reward is the correctness of the final answer, and leave the intermediate chain of thought entirely unsupervised. The model can write whatever it wants in the scratchpad. The only signal is whether the final answer is right.

What emerged from training was not what anyone programmed in. The model learned, without any explicit supervision of the intermediate steps, to use the scratchpad as actual working memory. Backtracking when an approach failed. Trying alternate formulations when one hit a wall. Verifying intermediate results before continuing. Restarting from scratch, several steps back, when it found an error in something it had already written. None of these behaviours appeared in labelled training examples. They fell out of the objective: over enough RL iterations, the training process found that careful scratchpad use led to more correct final answers, and it reinforced that.

Only the final answer receives reward. The scratchpad is entirely unsupervised — backtracking and verification fall out of the training loop, not from labelled examples of good reasoning. Made with tesserax.

DeepSeek-R1 (arXiv:2501.12948, January 2025, open weights) replicated the result outside a closed lab. Pure RL on reasoning trajectories; no human-labelled chains required. The paper documents what they call the “aha moment” in training: a specific point where the model spontaneously began verifying its own intermediate steps and restarting when they failed. Not because the training data contained this behaviour as a pattern. Because the reward for correct final answers made careful intermediate reasoning instrumentally useful, and the RL training loop discovered it.

I think DeepSeek-R1 is the clearest published demonstration of something the field had suspected but hadn’t proven at scale: that reasoning, as a behaviour, is trainable from a simple outcome-based reward signal. You don’t need human annotations of good reasoning traces. You don’t need to supervise the scratchpad. You need to reward the right answer and run enough RL. The reasoning emerges.

Three things follow from this that are worth naming separately.

Test-time compute is a new scaling axis. Pretraining scales with more data and more parameters — you pay at training time and get a more capable model. Reasoning models scale with more inference compute — you pay at generation time, by thinking longer, and get a better answer on the current problem. A smaller reasoning model that thinks for ten seconds can match or outperform a larger standard model answering in one pass. These axes are complementary, not competing. You can now trade training-time capability against inference-time deliberation, and that tradeoff is explicit and controllable in a way it wasn’t before.

Diagnosability changes the failure mode. A standard model that gets a maths problem wrong gives you a wrong number. A reasoning model that gets it wrong gives you a chain of thought — readable, traceable, inspectable at every step. You can see exactly where the logic went off course: which intermediate claim was false, which inference was unwarranted, at what point the reasoning was solid and where it broke down. For systems where the reliability of the output matters — and in agent pipelines, it almost always does — this is the property that makes the difference. The failure is visible. Visible failures are debuggable. Black-box failures are not.

And the arc closes. The whole story of this piece — n-grams, embeddings, pretraining, instruction tuning, preference learning, and now reasoning — is one continuous story of making the training signal more specific. N-gram models encode raw co-occurrence statistics: this is what tends to follow that. Embeddings compress those statistics into geometry that neural networks can use. Pretraining scales that compression to the entire digitised corpus of human writing. Instruction tuning adds: here is what a helpful answer looks like. RLHF and DPO add: here is what better looks like, relative to what you just produced. RLAIF and synthetic data close the loop so models can teach each other. Reasoning models add the final turn: here is what thinking carefully looks like. Not by showing examples of good reasoning. By rewarding the right final answer, and letting the model figure out the rest.

Seven paradigms, one direction. Each step exists because the previous step’s training signal wasn’t specific enough — and each one adds a kind of feedback the previous one couldn’t carry. Made with tesserax.

Reasoning models aren’t smarter than other models. They’re models that have learned to spend their intelligence more deliberately.

Each step in this story makes the feedback signal richer. Each step exists because the previous step’s signal wasn’t specific enough.

The direction is clear: we keep finding more precise ways to tell models what we want, and they keep using it.

Until next time, stay curious.

The second edition of Mostly Harmless AI goes deeper on what these training paradigms make possible in practice — why a reasoning model behaves differently as an agent core, what alignment actually means when you’re building a system rather than evaluating a benchmark, and the chapters that didn’t fit any blog post. The whole book is also available to read online for free in a reader I built and am rather fond of: dark mode, font controls, progress tracking, offline support, the works.

If you want everything I’ve written and everything I’m going to write, the Compendium bundles it all — one purchase, in perpetuity.

Check it out

How to write a CLI an agent will actually use

Alejandro Piad Morffis — Thu, 07 May 2026 11:00:58 GMT

Monday I argued the next move in agent autonomy is the layered stack around the model. This is the technical companion: what the tools inside that stack should look like. Both pieces are part of May’s arc on agent reliability, anchored on the second edition of Mostly Harmless AI, which you can also read online for free. More at the end.

The first time you fire up a coding agent and walk away for an afternoon, you come back to a workspace that’s been touched in twenty places by something that doesn’t know what it doesn’t know. Most of the touches are fine. One of them, on the third such afternoon, will not be.

The reflex is to make the agent smarter. The fix is to make the tools it calls safer, by design, not by vigilance.

I was reminded of this an hour ago, building a small Python CLI to append entries to my Workspace journal. Dumb tool, three flags, ninety lines. I built it on top of microcli, a framework I’ve been using for about a year. The experience of writing it is what this post is about.

What’s wrong with a normal CLI

Sit with the unix-style ergonomics for a second. A normal CLI, written in argparse or click or typer, has a shape that’s been the default for fifty years.

It’s silent on success. You ran the command, you got back a shell prompt, you hope it worked. There’s no acknowledgement of what changed, no record of what the next step might be. If you’re a human at a terminal, you can run a verification command. If you’re an agent on a 30-call chain, you have to infer whether the action took.

It bitchslaps you on failure. A traceback, a single non-zero exit code, maybe an inscrutable error string from a subprocess three layers down. No suggestion of what to try instead. No indication of what already changed before the failure landed. If the command was halfway through writing a file when it died, the agent has no idea whether to retry, roll back, or move on.

It has no self-description. The only way to understand what the tool does is to read the source, or run it and observe the side effects, or both. There’s --help, but --help lists flags, not workflows. The agent doesn’t learn how to use the tool from --help; it learns the shape of the parser.

And worst of all for an autonomous loop: side effects are the default. You run git push and it pushes. You run rm and it removes. The dry-run is an opt-in flag at best, missing entirely at worst. Every call is a potentially load-bearing call, and the agent has to know in advance which ones will hurt.

This shape is fine for a human at a terminal who can hit Ctrl-C and read the docs. It’s a disaster recipe for an autonomous loop.

What we actually need

Invert all four:

Dry-run by default, real action opt-in. The model sees a preview first; the explicit --save flag is the consent gate. No call mutates state without the model having seen what would change.
Predictable structured output for both happy and failed paths. Success says exactly what changed and where. Failure says exactly what failed, what state was already touched, and what to try next. No tracebacks, no silent zeros.
Self-describing apps. The CLI carries its own learnable surface, not just a flag list. The workflows, the failure modes, the next-step commands all live inside the tool. The agent should be able to ask the tool how am I supposed to use you and get a real answer.
Agents writing these apps themselves. This is the load-bearing one. You can do all of the above in plain typer or click. But the agent has to remember to do them. We need a framework where the safe shape is the path of least resistance, where it’s structurally hard to write a tool that doesn’t dry-run or doesn’t describe itself.

What this looks like with microcli

microcli is six hundred lines of Python and three principles, all enforced by the framework’s API surface rather than by convention. Validate before acting. Return descriptive messages. Default to dry-run. Each one is a behavior the API makes the path of least resistance, not a guideline you have to remember.

Rather than describe what writing one of these tools feels like, I’ll let the agent that wrote one tell you. Earlier today I asked Claude (running in this workspace, with file access and a shell) to write itself a small CLI for appending entries to my Workspace journal. Claude had never used microcli before. I told it to run microcli learn and roll with it. That was the last thing I had to say.

What follows is Claude’s account of the build, in its own voice.

Claude:

I’d never seen microcli before. which microcli came back empty; the repo was at ~/Workspace/repos/microcli/ but not installed. A few minutes of yak-shaving (gh repo clone, uv tool install -e ., two transitive deps I had to fetch the same way) and I had a microcli binary.

microcli --tour gave me a one-screen map. microcli learn listed the topics. Three commands later — learn --topic principles, learn --topic ok-fail, learn --topic patterns — I had everything I needed. I never read the framework’s source. The framework taught me how to use it through its own CLI, which is exactly the property the framework expects me to give my own tools.

The build started with microcli new journal "Journal entry tool" add. The scaffold isn’t a stub; it’s a teaching artefact. Inline comments pointed me at the next learning topics, the imports were already wired, the @command decorator was demonstrated, and a TODO gestured at the pattern I was supposed to follow.

Writing the body felt structurally guided. Validate first because fail() is the obvious primitive for rejecting bad inputs. Dry-run branch second because save: bool = False is the obvious parameter to add and info() is the obvious primitive for printing what would happen. Mutation branch third because there’s nowhere else for it to go. I’d have to actively work to write a tool that mutated by default or failed silently or didn’t describe itself.

The script lives at .playground/microcli-journal/journal.py. Ninety-odd lines. The load-bearing1 function:

@command
def add(
    tag: Annotated[str, "Entry tag (commit / decision / blocker / ...)"],
    description: Annotated[str, "One-line entry body"],
    save: Annotated[bool, "Actually write the entry"] = False,
):
    """Append a single journal entry."""
    if tag not in ALLOWED_TAGS:
        fail(f"unknown tag '{tag}'. Allowed: {', '.join(sorted(ALLOWED_TAGS))}.")

    path = journal_path_for_today()
    entry = build_entry(tag, description)

    if not save:
        info(f"Would append to: {path}")
        info(f"  Line: {entry.rstrip()}")
        info(f"  Equivalent: journal.py add {tag!r} {description!r} --save")
        return

    write(path, read(path) + entry)
    ok(f"Appended entry to {path}.")

That’s the whole shape. The decorator handles argparse. Annotated[..., "help text"] becomes --help for free. fail() writes a red ✗ and exits 1. info() prints a cyan line. ok() prints a green ✓. The dry-run runs first and renders the exact --save command the next caller should run, with the original arguments threaded back through.

Smoke-test took five minutes. Dry-run printed the preview plus the explicit --save equivalent. Bad tag produced a red message naming the allowed list. Bad date format produced a red message naming the expected shape. The real --save wrote the file and confirmed with a green ✓ pointing at the path. Nothing to debug because there was nothing to infer.

The last thing I ran was journal.py --learn add. The framework AST-walked the source I’d just written and produced a tour of my command: description, next-step commands harvested from the if not save: branch, failure modes harvested from the fail() strings. The next agent that touches this tool will not need to read it either.

Back to Alex.

Alex again.

That was… weird. But it was indeed Claude. I really gave it the task, and then asked it to explain what happened. To the extend that you can trust an AI Agent introspection (which is, to be honest, not much), this is a real account of what it “feels like” for an AI to use microcli.

What I want you to take from that is the structure of the experience, not the specific tool. Claude built a real, working CLI in a workspace it was already familiar with, in a framework it had never used, by following the framework’s own self-description. The unsafe shape was inaccessible.

That’s what I mean by structurally hard to do the wrong thing. You can do all of this in plain typer or click; you’d just have to remember to.

Try it

microcli is at github.com/apiad/microcli. pip install microcli-toolkit. The microcli learn principles tour runs in five seconds. If you build CLIs that an agent will eventually call — your agent, mine, or one that runs autonomously on a clock — this is the shape worth reaching for.

Have fun. And until next time, stay curious.

If the larger argument resonates, Mostly Harmless AI goes deep on this material. Full chapters on the harness around the model, the agentic stack we walked through Monday, and the discipline of building tools that compound rather than accumulate risk. You can also read the whole book online for free in a custom reader I built that I’m rather proud of: dark mode, font controls, progress tracking, offline support, the works.

If you want the whole catalog of everything I’ve written, plus everything I’m going to write, that’s the Compendium. One purchase, in perpetuity.

This is how you know this was indeed Claude (Opus 4.7). I would never use “load-bearing”, even if my life depended of it, but Claude loves it.

The Anatomy of a Linguistic AI Agent

Alejandro Piad Morffis — Mon, 04 May 2026 17:17:05 GMT

After Rembrandt’s “The Anatomy Lesson of Dr. Nicolaes Tulp” (1632); generated with Nano Banana 2.

You have used a language model in a chat box. You typed a question, you got an answer, you closed the tab. The whole interaction lasted under a minute. The model did not remember you the next time you opened the page.

You have also seen, or read about, agents that work for hours. A coding agent that ships a feature overnight. A research agent that pulls together a hundred sources before breakfast. They plan, they call tools, they back out of dead ends, they hand you something you can use.

Both are the same model. Same neural network. Same forward pass. The only thing that changed is what’s wrapped around it.

This essay is the bridge. The architecture that turns the first thing into the second is not a single insight. It is a stack, a small number of layers, each one added in response to a failure mode of the previous layer. By the end you should be able to point at any agent doing real work in 2026 — coding, research, customer ops — and name which layer is doing the heavy lifting at any given moment.

Some of those layers are old. The fundamental one was published in 2022, before ChatGPT shipped. Some are very new. One was named eighteen months ago and is still settling. None of them, individually, is hard to follow. The trick is seeing them as a sequence, each fix opening the door for the next.

If you want a number to anchor where we start: METR has been measuring the time-horizon of frontier agents, and a language model on its own, with no scaffolding around it, sustains roughly a few minutes of human-equivalent work at 50% reliability. The equivalent of writing a competent meeting summary.

That is the floor.

Every post on the blog this month is on the theme of agent reliability, anchored on the second edition of Mostly Harmless AI, where the engineering details that don’t fit a blog post live. You can also read the whole book online for free in a custom reader I built. More at the end.

Subscribe now

The base case

Strip everything away first. No agent, no tools, no skills, no harness. Just the model.

A language model, in the strictly minimal sense, is a function from a string to a string. You hand it a sequence of tokens. It hands you back a sequence of tokens. One forward pass through the network. The input goes in at one end, the output comes out the other, one token at a time until the STOP token is generated, and that is the entire interaction. No state is held between calls. The next time you ask the same model the same question, it has no idea you have ever spoken before.

Inside that one shot, the model delivers. It will answer, draft, summarize, translate, brainstorm. Give it a good piece of context and a clear ask, and the response that comes back will, in my experience, often be useful enough to ship as-is. This is the experience that made everyone notice in late 2022. Open a chat, ask anything, get something back you can use. People called it magic at the time. Most of them still do, even though useful function with no memory is the more honest description.

But notice what it cannot do — which is most things you would ever want from an agent.

It cannot verify its own output. The same forward pass that produced the answer is the only one available to check it. There is no second opinion, no quick lookup, no let me try it and see what happens. The model is committed to whatever came out the first time.

It cannot look anything up. Whatever facts it has are baked into the weights from training, frozen at some cutoff date. If you ask about today’s news, or your codebase, or an internal company document, the model has nothing. And worse, it will frequently invent something plausible-sounding because completing a confident sentence is what it was trained to do.

It cannot act on the world. It cannot write to a file, send an email, call an API, run a command. It cannot do anything that has a side effect outside the chat window. The only thing it can produce is more text.

Inside the four walls of the context window, the base model is the most capable text engine the field has ever built. A single chat box was enough to launch the largest consumer product of the decade. Outside those walls, it is inert.

METR’s measurements of an unaugmented model — no tools, no loop, no scaffolding — put the time horizon at something on the order of minutes of human-equivalent work. Minutes. That is the starting capability. Everything else in this essay is a way of making those minutes compound.

The first leap

The first real agent paradigm is older than ChatGPT.

In October 2022, a team at Princeton and Google published ReAct: Synergizing Reasoning and Acting in Language Models. It went out about six weeks before the ChatGPT launch that made the public notice agents existed at all. Every working agent today — Claude Code, Codex, Gemini CLI, the dozens of research agents and customer-ops agents shipping this year — is some refinement of the loop that paper introduced.

Here is the setup. An agent operates in some environment: a Wikipedia API, a household simulator, a web shop, your codebase. The environment offers an action space, the set of things the agent is allowed to do. Call it A. A policy maps the current context to the next action: given everything the agent knows, what does it do next? With nothing else, the policy has to map a long, noisy trajectory of past observations directly to the right next move. This is brittle. The longer the task runs, the more lost the model gets.

ReAct’s move is to enlarge the action space. The new action space is A plus L, where L is the space of natural language. A “thought” is an action in L, the agent pausing to write itself a sticky note before reaching for the next tool. It does not change the world, it changes the context. The next action is conditioned on a context that now includes the model’s own reasoning about what just happened.

The paper spells out what thoughts are actually for, and the list is concrete, not mystical. Decomposing the goal into a plan. Injecting commonsense the environment does not supply. Extracting the relevant signal from a noisy observation. Tracking progress and noticing when a subgoal is done. Handling exceptions when something breaks. Five jobs.

Why this beats the alternatives is where the paper earns its place. Chain-of-thought prompting, the prior art, has the model reason in a closed loop inside its own head, with no contact with the world. The paper’s own ablation on the HotpotQA benchmark is brutal: chain-of-thought hallucinates in 14% of its successes and 56% of its failures. Acting alone, calling tools without thought, is grounded in the world but loses the global plan after a few steps. ReAct synthesizes them. On the same task, ReAct hallucinates in 6% of successes. Less than half. Both halves of the loop have to be there.

One concrete anchor before we move on. ReAct’s HotpotQA action space, the entire set of things the agent could do, was exactly three actions: search[entity], lookup[string], finish[answer]. Three. The first working agent paradigm operated on three tools. Hold that number.

The paper closes with the line that becomes the engine for the rest of this essay. “Complex tasks with large action spaces require more demonstrations to learn well, which unfortunately can easily go beyond the input length limit of in-context learning.” In plain English: more capability needs more action descriptions, which need more context, which we do not have. Every layer that follows is the field iteratively solving exactly that bottleneck.

METR step: a model wrapped in this loop moves from minutes to tens of minutes on bounded tasks.

Tools

So how do you fix ReAct’s bottleneck, the one the paper named in its own conclusion?

The first, most obvious answer: give the agent more actions to take. If A was the original action space and ReAct enlarged it to A ∪ L, the next move is to make A itself bigger.

That is what a tool is. A tool is a function the model can call. It has a name, a typed schema for its arguments, and a return value. The model writes a tool call into the trajectory the same way it writes a thought. Except this one has a side effect on the world. The harness picks it up, runs the function, drops the return value back into the context. The next turn of the loop sees the result and decides what to do next.

The loop is unchanged. Same thinking, same acting, same context-grows-by-a-turn shape ReAct described. The difference is what the agent is allowed to do.

ReAct, recall, ran on three tools: search, lookup, finish. That was the entire menu. Claude Code in 2026 ships with more than twenty: read a file, edit a file, run a shell command, search the codebase, fetch a URL, spawn a subagent, take a screenshot, schedule a future tick, and so on. Each one is just a function with a schema. Each one expands the set of things the agent can do without changing one line of the underlying loop.

This is the part that surprised me, the first time I sat with it. The chatbot you typed at in 2022 and the agent that wrote your test suite this morning share one loop. What changed is the tool catalog. Same loop. Bigger menu.

That observation is the unsexy version of why tool-building is now a discipline of its own. Every capability you add to an agent — search the web, read a Slack channel, hit your billing API, deploy to staging — is just another function with a schema. The architecture does not change. The leverage is entirely in which tools you build and how you describe them to the model.

The design discipline that emerges is short to state and brutal to follow. Tools should be few, sharp, and self-describing. Few, because every tool you add takes up tokens in the system prompt and a slot in the model’s attention. Sharp, because a tool that does seven things is one the model will use wrong six times out of seven. Self-describing, because the model only learns to use a tool from its name, its docstring, and its argument schema. There is no other channel. (More on this on Thursday. Anthropic’s recent guidance on writing tools for agents is the cleanest summary of this craft I have read.)

METR step: a model with the right toolkit moves from tens of minutes to hours of bounded work.

Skills

Tools fix half of ReAct’s bottleneck. They expand the action space.

The other half, recall, is the input-length limit. Every tool you add costs tokens in the system prompt to describe: name, schema, when to use it, what its return value looks like. Add fifty tools that way and the system prompt is a small book. The model is reading every single tool description on every single turn, even when ninety-five of those turns have nothing to do with that tool.

Skills are the move that fixes this.

Anthropic shipped the idea in late 2024 and the rest of the field has been catching up since. A skill is, mechanically, almost embarrassingly simple. It is a markdown file. It has a name, a one-line description of when it applies, and a body that explains how to do the thing. The agent does not read it on startup. The agent reads it on demand: when, in the middle of a task, it notices a description that matches what it is about to do.

So instead of jamming and here are seventeen other things you might want to do into the system prompt, you put each of those things in its own file with a one-liner that names when to consult it. The system prompt stays small. The latent capability of the agent becomes, for practical purposes, unbounded. Every skill you write is one more thing it can do, but only when it actually needs to.

I find the deeper shift here more interesting than the engineering. The agent is reading documentation written for it. Not training data ingested months ago and frozen into weights. Documentation. Authored in plain prose. Versioned in git. Like the laminated procedure sheet a mechanic posts above a workbench for a job done once a month. Improvable by the same process that improves any document: someone notices the agent doing the wrong thing, edits the file, the next agent reads the new version and gets it right.

This is self-extension by reading, not by retraining. A new capability used to require a new training run, or at minimum a new fine-tune. Now it requires a markdown file. The cost of teaching an agent to do one more thing has fallen from days of GPU time to the minutes it takes to write a paragraph, and almost nobody outside the people building agentic systems has noticed.

The system prompt stays small. The set of things the agent can do, on demand, grows without bound. The two used to be the same number.

METR step: skills, more than anything else in this list, are what made the time horizon stop being bounded by how cleverly you wrote the system prompt.

MCP

For most of 2024, every agentic harness invented its own way to attach the same set of capabilities. You wrote a tool for Claude Code; it would not work in Codex. You wrote a skill for one harness; another harness could not see it. You hooked your billing API into one agent and had to do the same wiring four more times for the others. Every integration was bespoke. Nothing composed.

The Model Context Protocol (MCP) is the field’s answer to that. Anthropic shipped the spec in late 2024. By the end of 2025 every serious agent harness, including the ones not built by Anthropic, had adopted some version of it. Codex talks MCP. So does Claude Desktop, and Cursor, and a long list of others. This is one of those quiet moments where an industry just... agrees on a wire format, and a year later the world is different.

The architecture is three nouns. Hosts are the applications you actually use: Claude Desktop, Codex, Cursor. Clients live inside the host and talk to one server each. Servers are the things that actually expose capability: your codebase, your billing API, the Wikipedia search box from the ReAct paper four years ago.

What a server offers is the second triple: Resources (data the model can read), Prompts (workflow templates the user can invoke), and Tools (functions the model can call). Three nouns, again. The whole protocol is two threes.

The point is portability. A skill or tool you wrote once, against the protocol instead of against a specific harness, works everywhere. The lock-in moves out from under you. The agent ecosystem starts to compose the way the web did in the late 1990s. Not because someone planned it, but because everyone independently noticed it was cheaper to talk a shared protocol than to keep reinventing the connector layer.

Worth noticing what the spec foregrounds at the top of every chapter on tool calls: user consent. Capability requires permission. The protocol does not assume the model can do whatever a server exposes. It assumes the model has to ask, and the user has to answer. A small design choice with very large downstream consequences, and the reason the rest of this stack does not collapse into something nobody would let near their email.

METR step: not a step on the ladder, but a multiplier. The tools and skills from the last two sections now travel.

Context engineering

Add tools. Add skills. Add MCP. The agent can now do, in principle, almost anything you can describe in a prompt and a function. The trouble is what happens when it actually starts trying.

A long agent run accumulates context. Every observation from a tool call goes in. Every thought goes in. Every error message, every retry, every half-attempted plan that did not work goes in. After a few hours of work the context window is mostly exhaust: the trail of everything the agent tried, the great majority of which is no longer relevant to the next move. The model is searching for signal inside its own attic.

Karpathy named this context engineering in 2024, and the name stuck because the field had been doing it without a name for two years. Simon Willison wrote it up. LangChain made it a category. By 2026 it is a craft of its own: what to put in the context, when to summarize, what to evict, what to keep verbatim because the agent will need its exact wording later.

The central primitive in the discipline is compaction. At some threshold, typically 70% to 85% of the window, the agent stops, reads its own history, and rewrites it into a smaller form. Here is what we were trying to do. Here are the decisions we made. Here is the state we are in. Here is the next move. The compacted summary replaces the noisy trail. The agent keeps going on a fresh, smaller context with the salient bits intact.

The deeper move is that the agent now owns its own working memory in a way it never did inside a single ReAct loop. ReAct kept the entire history. Compaction lets the agent curate the history. A small change of grammar with a giant change of consequence.

Notice what this fixes. ReAct’s authors, in the same 2022 paper, named the dominant failure mode of their own system: “the model repetitively generates the previous thoughts and actions, often failing to reason about what the proper next action to take should be and jump out of the loop.” Translation: the agent gets stuck because its context is full of the same noise as the previous turn, so the next turn is the same noise plus a little more. That is a context problem. Context engineering is what stops it.

Without this layer, every previous layer eventually drowns. A hundred tools is useless if the agent’s context is so saturated it cannot find the right one. The five-thousand-word skill on how to handle a billing dispute is useless if the agent compacted it away on turn forty. Context engineering is the layer that makes the others compound over a long run instead of degrading into noise.

METR step: this is the layer that turns a few hours of focused agent work into a workday.

The hierarchy of agency

Stack the layers and the picture comes into focus. At 50% reliability on the METR time-horizon scale, a language model alone, with no scaffolding around it, sustains minutes of human-equivalent work. Wrap it in a ReAct loop with no tools, and that becomes tens of minutes. Add tools to ReAct, hours. Add skills and context engineering on top, a workday. Add an external loop above all of that, a fresh agent per turn on a clock with a journal handing state to itself, and the horizon stretches into days and weeks.

Stare at that ladder for a second. Each rung is the same model. What separates a chatbot from a coding agent that finishes a feature overnight is the scaffolding stacked around it. The frontier of what an agent can do in 2026 is set, almost entirely, by where you stop climbing.

Each layer has the same shape, in the abstract. Find the thing that bottlenecks the previous layer. Add a structure that lets the model offload that thing into the world, the way a machinist offloads a measurement into a caliper rather than holding it in memory. Into language, into tools, into files, into a clock. The model’s per-turn intelligence does not change. What changes is the time horizon over which that intelligence compounds.

The last rung is the one most people have not seen yet, and it is the one I have spent the last few months running on my own infrastructure. The trick is the same one. Take the bottleneck (the agent runs out of context before it runs out of work) and offload it. The new offload target is the file system. The new clock is cron. Past-Claude writes a markdown file at the end of its turn that says what it did and what comes next. A timer fires some hours later. Future-Claude wakes into a fresh context, reads the file, makes the next move, writes the file, exits. The continuity is in the file, not in the model.

That is the entire primitive. A markdown file and a timer. Past self tells future self what to do.

What you get from it is hard to describe to someone who has not run one. The agent works on your stuff for weeks at a time. It writes new jobs for itself. It reads the documentation about its own substrate and uses the tools that documentation describes. It makes mistakes (one in five runs produces something I have to throw out) but the mistakes are caught by the same kind of boring engineering that catches mistakes in any other autonomous system. Audit log, lock registry, archive-only deletion, every state change committed to git before the next turn starts.

The point of saying this out loud is that the same trick keeps working. Extend the action space; add a layer that compounds; let the previous layer drop the things it could not hold. The trick does not stop at hours. It does not stop at days. METR’s curve has been doubling every four months over the last two years. The 2027 projection is a working day. The 2028 projection is a working week.

Each doubling is one more scaffolding layer.

The frontier is not the model

Step back from all of it.

The architecture you’ve just walked through is layered. A language model at the core. ReAct around the model, turning tokens into actions. Tools around ReAct, expanding what those actions can be. Skills letting the agent pull capability from the file system instead of carrying it in the system prompt. MCP making everything portable. Context engineering keeping the whole thing from drowning in its own exhaust. An external loop on top of all that, when the work runs longer than a single context window can hold.

Every agent doing real work in 2026 — your coding agent, your research agent, the customer-ops bot answering your refund request, my private-tick agent running once an hour — has this shape. They differ in which tools they ship and which skills they read on demand. They do not differ in the shape of the stack. Once you can see the layers, you can see them everywhere.

So here is the closing claim, the techno-pragmatist version of what the article has been arguing the whole time. The frontier is not the model. It is the layers around it. And the entire stack is the field’s three-year answer to a single sentence in a single paper from October 2022 that named its own ceiling and dared the rest of us to climb past it.

One frontier worth flagging before I close. A competent agent can already write its own tools and skills on demand. That part is shipping today. The next move is teaching it, via tools and skills, to detect by itself when its current toolkit doesn’t cover what it’s trying to do, so it knows when to extend itself without being told. Self-extension that triggers itself. That is the live edge right now, and where the next few posts are headed.

The next post zooms in on the innermost layer the agent touches: the tools themselves, and what makes a tool safe enough to live inside a stack like this. That is a story for another Thursday.

Until next time, stay curious.

If this is the worldview you want to take more seriously, the second edition of Mostly Harmless AI (due May 25th) goes deep on the agentic stack we walked through here. Full chapters on context engineering and on the harness around the model, with the math, the case studies, and the parts that didn’t fit a blog post. You can also read the whole book online for free in a custom reader I built that I’m rather proud of: dark mode, font controls, progress tracking, offline support, the works.

If you want the whole catalog of everything I’ve written, plus everything I’m going to write, that’s the Compendium. One purchase, in perpetuity.

Realtime 3D in Pure Python + Numpy

Alejandro Piad Morffis — Tue, 07 Apr 2026 15:59:45 GMT

Just a random, completely unrelated stock photo, because I’m cool now. Photo by Ant Rozetsky on Unsplash

Today I want to tell you a different kind of story. It’s not about machine learning, large language models, algorithms, or theory of computer science.

It’s about a side project that I’ve been building for a couple of weeks that made me fall in love again with an ancient love of mine. The quick and easy way to explain it is this: a performance-focused graphics engine for data-driven visualizations in Python.

But that’s only the surface. If you want to see the coold demos and the technical description, feel free to scroll down. But if you want to know the story behind it, let me start from the beginning.

Subscribe now

The Origin Story

So, this starts back in undergrad, before I did anything related to machine learning or optimization or statistics. My first love was actually computer graphics. I had “learned to code” like, I don’t know, at 11 or 12, and for the first five years or so, before getting to college and actually learning to code, all my “coding” was basically tiny games. It was RPG Maker back then—who remembers that?

I always wanted to be a game developer, as you may imagine, and I think that’s probably the main motivation why I studied Computer Science. There are two kind of people who want to study Computer Science, as a matter of fact. One is people who love games—and the other is, of course, people who hate games; there are no in-betweens.

I was the loving-games kind and all I wanted to build games for a life. So when I was in first year, after actually learning some real coding, my first kind of large project was a game engine. This was before Unity, before even XNA—who remembers that?—this was when .NET was getting started, and I wrote a quick and dirty game engine in pure C# that talked native DirectX 11.

It was very cheap, a disaster of architecture almost surely, but it taught me the basics of how to construct a scene graph, how to animate a camera, how to do lighting, how to write very basic shaders. I learned a ton and basically fell in love with computer graphics.

I ended up doing my diploma thesis in computer graphics—screen-based global illumination, a couple of years before NVIDIA came up with ray tracing on the GPU, which basically killed that whole area of research. And I also did my Master’s on global illumination and some data structures for the GPU, but after graduation I quickly switched research towards machine learning and AI, which, you can imagine, this 2014, and deep learning was just on the rise. The rest is history, as they say.

And then, here I was this past week thinking about old projects that I used to have fun with when I was in college, and trying to remember what it felt like to code back then, no LLMs, no internet for the most part even. The time where I’ve been the most fun was probably when I dabbed into procedural generation of cities, mountains, lakes, and… stuff, in the late 2012. This was at the early era of PCG, and I never got to do anything with that other than a few tutorials and a few lessons that I taught at University.

I played with Unity for a couple of years, but nothing too serious–I think I was actually one of the first people in my University to even install Unity, and I even taught a couple of Master’s courses on it. I participated in a couple of game jams, but after 2017 or so I stopped doing graphics all together. And I’ve been doing machine learning since.

But, in any case, I kind of forgot about computer graphics all along. At least during day-worked. So there was I last weeek, remembering that and asking how hard would it be to actually make a graphics engine in Python, some quick hack like my undergrad projects. I did a bit of research and I discovered that Python is, as of 2026, in a very good position to build a graphics engine, and not just a crapy one, but one that is actually fast. We have WGPU now—the spiritual and practical succesor of OpenGL (who remembers that?), which has native suppotr for GPU-accelerated graphics in Linux.

I basically did a plan and sat for three days to hack this thing.

The Engine

My first idea was to have a Rust backend for all the graphics engine stuff—the rendering loop, materials, lights—but I quickly decided to drop that idea because getting Rust and Python to talk to each other was becoming increasingly harder and harder, and I really wanted to finally see a damn cube rendering on my screen.

So I decided to switch completely to Python. But since I’m a grown-up now, I have to find some kind of serious objective for making something like this. I decided I didn’t want to make a typical graphics engine where you have a scene graph with hierarchies of entities and properties, and you simply render all of them. No, that is way too 2000s.

I decided I wanted to do a very fast, data-driven visualization tool purely based on the Entity-Component-System (ECS) paradigm and make it extremely performant, so it would focus on big data-driven simulations like N-body simulations, chemical and physics experiments, AI pathfinding and agents, you know, grown-up stuff like that.

(But actually, all I wanted was to play with WGPU and draw some cubes in Python. Wink, wink.)

This framing gave me two things, though. My solution doesn’t have to be very fancy as a game engine, we don’t need to be able to like load skeletal animations or stuff like that. It’s not actually a game engine; it’s a graphics engine with at best some interaction logic. But it still lets you do some cool stuff, even if all you can render is blocks and spheres. When you can render thousands of them running very fast on the GPU, you can do some cool stuff. So this is the motivation, and now let me show you what I have.

Deep Dive

So here is manifold—short Manifold Graphics if you want. It’s a Python library built on top of WGPU, a graphics engine based on the Entity-Component-System paradigm.

If you have never heard about it, ECS is a completely different way of writing code that is especially tailored for video games, but it is very little known outside of the game development world. And its awesome.

In a typical business code, you have entities who own their data, and you usually have behavior associated to entities; so entities also own their behavior—this is the basic Object-Oriented Programming paradigm where objects own their data and their methods. And if you want to do something with an object, you have to call methods on the object so the object guarantees the instance invariances.

Since OOP was basically the ONE programming paradigm of the 90s and early 2000s—when the videogame industry really exploded—it is only normal that we started writing games like this. But there is a problem with OOP (well, many problems, but one in particular that matters for our discussion).

When you have 10,000 objects, each of them with more or less the same structure, e.g., they are physical particles bouncing with each other, or little zerlings comming to your base, you simply cannot update them fast enough. For example, making a physics simulation out of this is extremely slow if you have to go to each particle and update its velocity, its scale, its rotation, etc. You’ll end up doing thousands of tiny method calls, thrashing your cache, and issuing lots of super small copies to GPU for drawing.

What you want is to vectorize this operation. You would like to have all of the objects’ data in a single NumPy matrix, and you want to write a very, very efficient vectorized code that doesn’t do any loop and just updates everything at once. THen copy all the data to the GPU and issue a single draw call that renders all objects parameterized by their positions, rotations, etc. Chef kiss.

This is the Entity-Component-System paradigm at its core. It completely flips the responsibilities from standard OOP The components are just flat storage of data (rows in a matrix) and the entities are just pointers to a row where all of their data lives. Then the systems are methods that act on a subset of entities using heavily vectorized code, because each system deals with a large number of equally-structured entities, and they don’t care which is which.

In manifoldx, each system is a Python method that receives a subset of entities that have some combination of components. For example, if you want to process all of the particles in a simulation, you write a system that receives entities that have the Particle component, perhaps also a Transform component. In the transform component, you will have the position, rotation, scale, and the particle component will store simulation-specific data like velocity, temperature, momentum, etc.

The key to high performance in ECS is to avoid looping as much as possible. You assume all of the components of the entities in a system have exactly the same layout, so what you get is really a view of a matrix, and you write vectorized code. You add something to all them, you multiply all them by something, or in general you compute some matrix operations on them. All at once.

And if you can write your code like this, then you get a very, very fast rendering loop because instead of making one method invocation per entity, you make one method invocation per archetype, that is, per combination of components, which is a couple of order of magnitude less that your entities count.

Here’s a minimal example showing how the ECS works in manifoldx:

import manifoldx as mx
import numpy as np

engine = mx.Engine("Cubes")
engine.camera.zoom(0.1)

# These are all static things that are created
# and stored in memory once
mesh = mx.geometry.sphere(1)
material = mx.material.phong(mx.colors.BLUE)

# Custom component, gets registered in engine to keep track
# Only used for reflection on the values
@engine.component
class Particle:
    velocity: mx.Vector3
    angular: mx.Vector3
    life: mx.Float

# This runs every frame
@engine.system
def particle_lifecycle(query: mx.Query[Particle, Transform], dt: float):
    query[Particle].life -= dt  # Single vectorial operation
    query[Transform].position += query[Particle].velocity * dt
    query[Transform].rotation += Transform.rotation(euler=query[Particle].angular * dt)
    query[Transform].scale = query[Particle].life / 10.0

    # Destroy all dead particles at once
    engine.destroy(query[Particle].life <= 0)

    # Now we create lots of particles
    N = int(100 * dt)

    # This will in principle reuse the buffers for dead entities,
    # but will expand the buffer if necessary
    engine.spawn(
        Mesh(mesh),
        Material(material),
        Transform(pos=(0, 0, 0), scale=(1, 1, 1)),
        Particle(
            velocity=np.random.uniform(-5, 5, (N, 3)),
            angular=np.random.uniform(-2, 2, (N, 3)),
            life=np.random.rand(N) * 10,
        ),
        n=N,
    )

    # Update camera
    engine.camera.orbit(45 * dt, 0)

That’s it. A single line of code to update all positions at once. Notice the query argument that defines which entities you get (all entities with both a Transform and a Particle component). And here is how that looks like:

In a real simulation, you can have, say 10 systems, but you have 10,000 or 100,000 entities, and you know you can do very fast vectorized updates in NumPy for all them, 10 times each frame.

For example, if you have 500 particles and you want to do N-body simulation, computing the 500-squared gravity interactions 60 times per second in Python is suicide. But if you do it in NumPy, then you get something that runs in a few milliseconds. A quarter million interactions computed 60 times per second. In Python.

To make it really efficient, you need to also avoid copying or moving data; it’s all masking and clever NumPy layout that keeps all of the memory in one place, and you are just seeing fragments of that memory in each system.

The other key idea is that you don’t modify anything in a system. That line where position is set, doesn’t really write back to the matrix. All it does is compute the right-hand side and then you issue a command that will be run at the end of all the systems, before frame rendering happens. This allows to write pure threaded parallelism, because you can run several systems in different threads—they are all reading the same data, but they aren’t writing to the buffers, which is great since Python has real support for multi-threading now in 2026 (after 35 years!).

Showcase

That is the basic idea. Now lets see some examples. AS of today, version 0.2, manifoldx has some basic shapes like cubes, spheres, and planes, and support for basic PBR lighting, camera controls, and that’s basically it.

All the engine realy does is set up this somewhat clever inversion of logic that forces you to write very efficient code, and the magic is in what you do inside the systems.

So let me show you three examples.

1. N-Body Gravitational Simulation

The first is an N-body simulation. All gravity computation happens in a single NumPy block with no Python loops. The only relevant part of the code is the gravity system, that looks something like this.

@engine.system
def nbody_gravity(query, dt):
    pos = query[Transform].pos.data  # (N, 3)

    # All-pairs position differences: (N, N, 3)
    diff = pos[None, :] - pos[:, None]
    dist = np.linalg.norm(diff, axis=2)

    # Force magnitude: G * m_i * m_j / r²
    force_mag = G * mass_prod / np.maximum(dist, SOFTENING)**2

    # Net force = sum over all other bodies
    net_force = (force_mag[:, :, None] * diff / dist[:, :, None]).sum(axis=1)

    velocities += (net_force / masses[:, None]) * dt
    query[Transform].pos += velocities * dt

This runs 500 bodies with 250,000 force pair computations at 60fps.

2. Ideal Gas Simulation

The second example is an ideal gas with elastic collisions inside a bounding box. Again, all running without a single for loop. Collision detection and impact resolution in vectorizednumpy operations.

@engine.system
def gas_physics(query, dt):
    pos = query[Transform].pos.data

    # Wall collisions: vectorized mask
    below = (pos + velocities * dt) < -BOX_HALF
    above = (pos + velocities * dt) > BOX_HALF

    # Here we avoid branching and use masking instead
    velocities[below] = np.abs(velocities[below]) * RESTITUTION
    velocities[above] = -np.abs(velocities[above]) * RESTITUTION

    # Particle collisions: find overlapping pairs
    diff = pos[None, :] - pos[:, None]
    dist = np.linalg.norm(diff, axis=2)
    overlap = dist < 2 * PARTICLE_RADIUS
    i_idx, j_idx = np.where(np.triu(overlap))

    # Resolve collisions with impulse
    # ... (collision resolution code)
    # ... (also vectorized)

    query[Transform].pos += velocities * dt

Here’s how that looks like.

3. Boids Flocking

The third example is a Boids simulation with emergent flocking behavior. This is the one that strikes me the most because boids simulation is often compute-heavy. Each individual entity must keep track of a subset of neighbors and adjust behavior based on them, not the whole set of entities. But again, a bit of numpy magic lets us vectorize the crap out of this and simulate 300 boids at 60 frames per second.

@engine.system
def boids_physics(query, dt):
    # Separation, alignment, cohesion as vectorized tensor ops
    diff = pos[None, :] - pos[:, None]  # (N, N, 3)
    dist_sq = (diff * diff).sum(axis=2)

    neighbors = dist_sq < PERCEPTION_SQ

    # Separation (1/dist² weighted)
    sep = (-diff * (neighbors[:,:,None] * inv_dsq[:,:,None])).sum(axis=1)

    # Alignment (average neighbor velocity)
    avg_vel = (vel[None,:] * neighbors[:,:,None]).sum(axis=1) / safe_count

    # Cohesion (steer toward center of mass)
    center = (pos[None,:] * neighbors[:,:,None]).sum(axis=1) / safe_count

    # Plus predator avoidance and boundary steering...
    # That one is easy.

Here goes a sneak peek.

You can check all the examples in the Github repository to see the full code, but the bulk of the implementation is these cleverly vectorized system methods.

Future Directions

And that’s it. This is my pure Python (well, you know what I mean) graphics engine for serious, grown-up stuff that is surely, definitely, not a weekend side-project meant to procrastinate on actual work... I mean, what?!

Where I will go with this? I don’t know. I always write these things mostly as a learning exercise and I’ve learned a lot about graphics in Python. I’ve updated my view of modern graphics and I think I’ve paid my debt of the last seven years in graphics computation. I’m kind of happy now that I know how to do this in 2026. Mission accomplished, I guess.

There are some places this engine can go to, like some custom shaders when you need stuff like lighting effects. But it is not going to become a traditional, full-blown game engine. I will not add support for lots of game engine-like features including, I don’t know, skeletal animations, level of detail, scene management, or, god forbids, visual scripting and nonsense like that.

Now two areas I’d like to explore in the future. One is extending the engine towards the kind of behavior you need to write for AI simulations. If you want to run some sort of agent simulation or ant colony optimization or stuff like that, that code doesn’t look that much as a frame-by-frame update, but like an asynchronous event-loop—which is also something that is not usual in game engines. And the other direction is towards procedural generation of meshes and content in general, which is an area I left five or six years ago and would pretty much love to come back to it.

And that’s it for this week. This is not production-ready at all—it’s mostly a toy at the moment—but you can take it apart and hack your way into some cool physical or mathematical simulation. The code is on GitHub if you want to try it yourself, and I’d love to see what you build with it.

Until next week, stay curious.

AI Coding Agents, Deconstructed

Alejandro Piad Morffis — Thu, 02 Apr 2026 13:40:59 GMT

I’m telling you, this is the future. AI agents will do aaaallll the work. Photo by Farzad Felfelian on Unsplash

You’ve been using AI coding agents for months. You’ve crafted elaborate system prompts. You’ve added a dozen skills. You’ve learned the dance of context window management. And somewhere around the third hour of work, something breaks. The agent starts forgetting things. Making wrong assumptions. Doing something close—but not quite—what you asked.

This isn’t a failure of the model. This is a failure of the system.

To be sure, better models make things easier. And models are getting better by the day. But no matter how good a model is, bad systems lead to bad outputs. Even the smartest people produce junk when fed with incorrect assumptions or given incomplete instructions.

In contrast, a good system with clear boundaries and explicit rules, that leaves the exact amount of flexibility necessary, makes creativity and productivity thrive.

You see this day and night in teams (of real humans) in every industry. It’s not often the smartest person in the room that solves the hard problem. It’s when you combine the right kinds of intelligence with the right kind of system that things click.

In this article, I want to make the case for a structured way to think about Large Language Model (LLM)-based agentic systems (mostly for coding, but also for knowledge work in general) that fixes some of the greatest pains I (and I sure most of you) have been facing when trying to scale AI-assisted workflows to professional levels.

It’s a system that puts the right constraints in the right places and leaves just enough space for creative exploration (or however you want to call what LLMs do when they hallucinate in your favor). It’s also a system that makes it clear you are in charge.

Everything an AI agent does happens inside a context window. System prompt, user input, tool results, skill injections—they all live there. The agent’s only mechanism for action is the ReAct (Reasoning + Acting) loop: think, call tools, observe results, repeat. Each cycle grows the context. Each skill activation injects more.

This creates a fundamental tension: context is power, but context is finite. Too little and the agent can’t connect the dots. Too much and the important stuff drowns. The gap between those two failure modes is narrow—and most agent frameworks ignore it entirely.

I’ll walk through why current systems fail, introduce a four-element framework for thinking about agentic architectures, show you how these principles apply across three domains, then present a vision for better AI harness engineering.

Part I - The Symptoms

To understand the problems we first need to understand how a standard agentic loop works. The typical architecture is what’s called a ReAct loop. The LLM runs in a loop that determines the next action given context, which can be read some files, ask the user, invoke a tool, inject a skill, etc. When the agent decides no more actions are necessary, the loop ends and the user is given control back to continue the prompt.

That’s it. All the seemingly supersmart behaviours of Claude Code, Gemini CLI, and Codex are, under the hood, some form of the basic ReAct loop. There are of course nuances. For example, most systems decide that if the agent calls the same tool with the same args three times, it must be stuck in a loop and stop the turn. There are perhaps hard limits on how many tool calls the agent can do in each turn.

Context is the bottleneck. Not the model. Not the prompt. Context.

The agent doesn’t have memory. It doesn’t have state. It has context. Everything it knows about your project, your preferences, your conventions, all of it lives in the context window. When you add a skill, you’re injecting more context. When you run a tool, the result goes into context. When you switch modes, you’re switching which system prompt is active, all still in context.

This means context engineering is AI agent engineering. The agent’s behavior isn’t determined by the model alone, or even primarily, but by what context you give it, and how you structure that context over time.

Most tools treat context as a solved problem. They stuff everything in and hope the model figures it out. In-context learning seems almost magical, but it has limits—and those limits become visible fast.

When context is thin, the agent simply doesn’t know enough about your project to make informed decisions. It relies on baked-in assumptions from training and falls back to consensus instead of following your style: it uses the common tools and practices it learned from pretraining. This often means it uses slightly old and outdated tools and practices.

So you do the sensible thing, and inject project-specific information into the context. But then if context grows too large, even if it doesn’t technically exceed the model’s capacity, things start to get lost in the middle. Moreover, failed tool calls, wrong assumptions the model had to correct, etc., start creeping up in context, not only taking up valuable space but also, and more importantly, distracting the model and biasing it towards mediocre decisions.

Then there is context compaction: when the context fills in to about 85%, most systems will invoke a special prompt to instruct the agent to summarize the current state. These prompts vary in detail, but often involve asking the agent what it is immediately doing, where is it stuck, what has failed, etc. Clever, but a hack nonetheless. This hard context reset means the agent will forget important nuances in the current conversation and will repeat past mistakes. It’s frustrating.

Let’s look at how these problems surface in specific symptoms that all LLM-based agents display at some point.

Symptom One: Unstated Assumptions

The first failure mode isn’t dramatic. It’s quiet. You ask the agent to write a test, and it writes a unittest.TestCase instead of a pytest function. You ask it to add a dependency, and it edits requirements.txt instead of running uv add. You ask it to deploy, and it pushes directly to main.

These aren’t model failures. They’re assumption mismatches. The agent doesn’t know how your team does things. There’s no guardrail for “in this project, we always use pytest, we always use uv, we never commit directly to main.” The agent improvises from general knowledge, and general knowledge is often wrong.

Skills are supposed to fix this. Add a skill document that says “use pytest” and the agent should know. But skills introduce a new problem.

You add a skill for code review. Then one for documentation. Then one for PR descriptions. Then three more for your company’s specific stack. Each skill seems small. A few hundred tokens each. But they pile up—always-on knowledge the agent carries but can’t prioritize.

The result is context bloat. The agent can’t tell what’s relevant in any given moment. So it blends everything together, and hallucinations increase. More skills made it worse—not better.

Symptom Two: Permission Leakage

Every agent framework implements the same plan then build pattern. The idea is sound: think first, plan second, execute third. In practice, the boundaries leak.

Plan mode is supposed to be read-only. Design the change, review the approach, lock in the scope. Build mode is supposed to execute. Write the code, run the tests, commit the result.

But “plan mode” in most tools is just a prompt. There’s no enforcement. The agent can write code in plan mode if it wants to. It can ignore the plan in build mode. It can skip straight to implementation if the prompt implies urgency. The modes are suggestions, not constraints.

This matters because a plan only works if it’s actually followed. If the agent can deviate mid-execution—if “plan mode” and “build mode” are just prompts with different names—the plan becomes advisory. And advisory plans get ignored.

The second problem is structural: there’s no artifact that passes from plan to build. The plan lives in the context. By the time build mode starts, the plan is mixed in with everything else the agent said. Which file was the plan? Which changes were approved? The agent has to re-read the conversation to remember. Context saturation accelerates.

Symptom Three: Context Saturation

After extended work, you see the same pattern: the agent makes 95% of the progress, then fails on the last 5%. It nails the architecture. The logic is sound. The core implementation works. Then it stumbles on a detail—because context has saturated. It forgot which environment it was in, which conventions still apply, which constraints matter.

But the deeper problem is internal noise. The agent keeps everything in context: all internal reasoning, all tool calls, all results. This is fine for minute-to-minute action. But after four failed attempts to solve something, the old tool calls are just noise. These were attempts that went nowhere, just add cost and accelerate saturation.

The supposed solution for this is context compaction. But this creates a lossy summary problem. The agent is supposed to leave a trail for its future self. After context compaction, it should be able to pick up where it left off. But if agents struggle with long contexts, how are they supposed to build a good trail? The compaction report is only as good as the agent’s ability to summarize. And summarization is lossy and injects back lots of unstated assumptions from pretraining.

The frustrating part: this wasn’t a hard problem. The agent had all the knowledge it needed. But context filled with noise, and the important bits got pushed out. More tokens in, less signal out.

The solution isn’t just better prompts or larger context windows. Yes, these help. But the symptoms are systemic, so the solution must be a system overhaul.

Let me show you how that system looks like.

Part II - The System

Now that we understand the problem, let’s look at how every agent system actually works. Every AI agent system addresses four concerns. When you conflate them, the system breaks. When you separate them, the system scales.

This taxonomy isn’t original to me. It’s a synthesis of how modern AI agentic systems work under the hood. Most explicitly, it’s implemented in the OpenCode CLI (opencode.ai), but all other tools follow a similar pattern, even if they use different names.

Here’s the breakdown. Every agent system you’ll encounter (explicitly or implicitly) is managing these four things:

Mode — the who. A mode is the persona the AI adopts. It defines the thinking style, the permissions, the available tools. When you interact with a “code assistant,” you’re in a coding mode. When you switch to “creative writer,” you’re in a creative mode.

Modes are explicit. They’re top-level system prompts that define behavior and permissions. You tell the agent: “This is how you should think and behave. These are the tools you can use. These are the parts of the filesystem you can write to.”

Skill — the knowledge. A skill is knowledge the agent can recall when necessary. It doesn’t get invoked explicitly, it gets applied implicitly when necessary. When you give an agent knowledge about SQL optimization, that skill is available whenever relevant. The agent doesn’t need to be told to use it. The ReAct cycle injects it when it deems suitable.

Unlike modes, skills can layer. An agent might have a SQL skill, a documentation skill, and a debugging skill, all active simultaneously, all contributing when relevant. Skills are implicit because the agent should just apply them naturally. They can also contradict or complement each other. In-context learning should be capable of using them in a combined manner.

Command — the workflow. A command is a script. It tells the agent: do this, in this order, using these tools. “Refactor this function” is a command. “Run these tests and report results” is a command.

Commands are explicit: you invoke them. Under the hood, commands are just prompts. The difference is who injects them: the user. When you run /build, you’re injecting a workflow prompt into the agent’s context. That’s it. The command tells the agent: do this sequence of things. The complexity lives in the orchestration of the ReAct cycle, not the command itself.

Commands are intentionally simple. They don’t contain knowledge. That’s intentional separation of concerns. The command itself shouldn’t know how to build; it knows when to spawn subagents and which mode to use. This keeps commands thin and changeable without rewriting underlying knowledge.

Subagent — the delegation. A subagent is a spawned agent for background or parallel tasks. It handles isolated work, returns summarized results, then disappears. It is instantiated with a system prompt and specific instructions (synthesized by the primary agent that called it), and runs for one full ReAct turn.

Subagents are ephemeral. Their internal reasoning stays private. The main agent only sees the synthesis. You spawn a subagent when you need parallel processing, isolation, or both. They are the way to fork, solve a specific subtask, and return a result, but keep context clean. Kind of like subroutines.

Why This Separation Matters

Understanding this distinction unlocks everything else. Once you see skills as implicit knowledge and commands as explicit scripts, the rest of the architecture clicks naturally. Most agent setups conflate these. They embed knowledge in commands. They make skills behave like workflows. They mix persona into everything else. And the massively underuse subagents.

When you separate these concerns–modes for persona, skills for knowledge, commands for orchestration, subagents for delegation–you get something that looks like good systems engineering. You can swap skills without touching commands. You can change modes without rewriting workflows. You can spawn subagents without the main agent knowing or caring how they work internally. The result is a system that works and adapts and scales like good software should do.

The system scales because the pieces are independent. Change one without breaking the others. Each component has a single job, and the boundaries between them are meaningful. When context shifts, when requirements evolve, when a new skill needs adding, the system adapts incrementally rather than collapsing under the weight of accumulated complexity.

Part III: The Practice

If so far this seems like abstract theory for you, in this section we will ground these concepts in actual practice. Let me show you how I’m using these ideas today to improve my AI-assisted coding practice. I’m using opencode.ai but I believe the following is easily adaptable to any agentic toolkit out there.

My Three Modes

Every agentic system needs boundaries, not social contracts, but enforced constraints. In my setup, those constraints come from three modes: analyze, design, and create.

Each of these modes defines a thinking style—a persona—and a set of constraints for tool use and filesystem access.

Analyze mode is research and investigation. This mode reads your work and writes summaries to a knowledge base. It cannot touch production files. Not “should not” but cannot. The permissions are built into the mode itself, not enforced through prompts or warnings. The agent is incapable of writing outside of a .playground folder, and is incapable of doing anything that can harm the project or the system (more on how a bit later) but it is still capable of running arbitrary code, download anything from the internet, and play around as it needs.

Design mode is architecture and planning. This mode bridges analysis and implementation. It can read your project and write design documents, architecture diagrams, and implementation plans, but still cannot touch production code. It cannot run shell scripts either, at all. It can look at git status and logs, read folder contents, etc., but it can only write to a space where plans and design documents go.

Create mode is execution. Full read-write access. This is where production work happens. The agent can write code, create files, and modify the project directly. Again, it cannot do anything outside the project scope, though. It won’t accidentally change /etc/host(s)1 even if it tries to.

The key insight: modes define permissions, not just persona. You can’t accidentally prompt your way into code generation during research. The agent literally lacks the capability. The agent doesn’t need to “understand” these constraints, it simply operates within them.

Mode is the who, and it determines what the agent can do, not just how it thinks.

Let me show you how they work in three different domains that make the bread and butter of my daily job: software development, scientific research, and technical writing.

I chose these domains because they illustrate the simplicity and scalability of the system. Software development shows the framework under constraints: deadlines, production code, real stakes. Research shows it under complexity: synthesis, evaluation, structured output. Technical writing shows it under nuance: voice, audience, iterative refinement. Three different pressures, one consistent architecture that works in all three cases.

In each of these domains we have two layers to go through: first is the set of implicit skills that are available to the agents, and second is the set of explicit commands (each tied to a specific mode) that setup concrete workflows. I will show you one example workflow that cross-cuts across the three modes in each case. I will also tell you exactly where delegation occurs.

Domain A: Software Development

Software development is where agentic systems face the harshest constraints. Production code has stakes. Deadlines are real. Mistakes cost money. Let’s see how the framework applies.

Implicit Skills

A software development agent carries knowledge it never needs to be told to use. It knows language idioms and patterns like the idiomatic way to write a list comprehension in Python, or the conventions for error handling in Go. It knows testing conventions: where tests live in the directory structure, how they’re named, what assertions to prefer. It knows architecture conventions: layered structure, dependency injection patterns, how error states propagate. It knows code review standards: what to flag, what to praise, when to ask for clarification.

Example Workflow: Bug Hunting

I use this workflow for finding and fixing bugs. It starts with investigation. The agent spawns dozens of subagents to try and break the system (either guided towards a purpose, or completely unbiased). Then you build a comprehensive plan to solve it. And then you execute that plan. Simple, right?

Phase 1: /trace (analyze mode) runs systematic experiments to detect and narrow down a bug’s cause. The agent examines stack traces, compares behavior across commits, and pinpoints the exact files and functions that need attention. This mode is read-only by design, except for a .playground folder. Research happens here, not in the code itself.

Each experiment is run on a subagent that has the job of verifying one assumption. The main agent receives only experiment results, and constructs an executive report of findings. This means you can run dozens of different experiments autonomously to detect what breaks what.

Phase 2: /plan (design mode) takes the diagnosis and defines the changes needed, along with their architectural impact. The agent reviews the affected modules, considers alternative approaches, and documents the implementation plan before touching anything. This is where the scope gets locked in.

The result of this phase is a structured plan with step by step details on what files must be touched and what must be done in there (semantically, not code). For every phase, it defines success criteria: what must be validated before we can say we got that phase right.

Phase 3: /build (create mode) executes the plan step by step. The agent writes tests first (following Test-Driven Development (TDD) discipline) for the success criteria defined for that phase and watches them fail. Then it launches a coding subagent that has read-only access to tests, so it cannot cheat and change the tests.

The subagent attempts to implement changes that make the test pass. If it succeeds, the main agent commits and moves on. If it doesn’t, the main agent retries a few times. If there is no progress, the main agent resets the work tree (no harm done), and reports on failure. This usually means the plan needs revisions.

Domain B: Research

Research is where agentic systems face the greatest complexity. Sources multiply, methodologies diverge, synthesis requires judgment. Let’s see how the framework applies.

Implicit Skills

A research agent knows the conventions of academic writing without being reminded. It knows citation formats like APA, MLA, Chicago, and IEEE, and when to use each. It knows how to evaluate papers: methodology soundness, sample size adequacy, replicability claims, conflict of interest disclosures. It knows the structure of literature reviews: how to organize by theme, methodology, or chronological development. It knows domain-specific terminology, distinguishing between “accuracy” and “precision” in machine learning, or between “confounding” and “colliding” in causal inference.

Example Workflow: State-of-the-Art Report

Phase 1: /research (analyze mode) spawns subagents to gather sources in parallel. Each subagent reads a batch of papers, synthesizes findings, and returns summaries. The main agent synthesizes those summaries into structured notes. This phase can be run multiple times to collect batches of sources without overwhelming context. At the end, you get hundreds of sources summarized into clean research notes.

Phase 2: /outline (design mode) identifies patterns across the collected literature. The agent groups papers by methodology, extracts recurring findings, and maps the landscape of the field. It generates outline options for the final document, based on typical structures like problem-solution or paradigm-methods, highlighting gaps where the research is thin and consensus areas where findings align.

Phase 3: /draft (create mode) builds the document section by section, following the outline. Each section draws on the structured notes, weaving together sources into coherent narrative.

The agent launches subagents for writing each subsection because typically, agents write more or less the same length in a single write command, so if you ask it to fill in a large outline all at once you’ll only get a mediocre extended outline. By launching independent writers for specific sections of the outline, you get all the attention of a single turn to read source material and write a good 4 or 5 paragraphs for a concrete section.

A cool idea I’ve been meaning to try is have the main agent can spawn several subagents to write the same section, with a high temperature, and then perform some sort of aggregation or evaluation before building the final draft for every section. This burns through 3x tokens but ensembles have been shown over and over to improve AI models outputs. If you try it, let me know.

Domain C: Technical Writing

Technical writing is where agentic systems face the most nuance. Voice matters. Audience varies. Iterative refinement is the norm. Let’s see how the framework applies.

Implicit Skills

A technical writing agent carries knowledge of prose style without being coached. It knows voice and tense conventions—active voice for clarity, past tense for completed processes, second person for direct instruction. It knows structural patterns: how documentation differs from blog posts, how reports differ from tutorials, how reference material differs from guides. It knows audience awareness: what to explain for newcomers, what to omit for experts, when to elaborate and when to abbreviate. It knows cross-referencing and linking norms: when to link, when to inline, how to name anchors for scannability.

Example Workflow: Paper Review

Phase 1: /review (analyze mode) performs detailed review in a specific order: structural issues first, then content, then style. The agent examines the narrative arc—how main points connect, whether the flow makes sense, before worrying about grammar or word choice. This ordering matters; reviewing low-level details when high-level problems exist wastes effort.

Each iteration is performed by spawning several subagents that focus on specific types of problems, like transitions, unverifiable claims, etc. Each subagent returns a structured list of issues, pointing back to exact line numbers and phrasing. Then, the main agent edits the original paper and injects markdown comments in every marked issue, next to the paragraph, or under the header where it best fits.

Phase 2: /revise (design mode) plans changes to specific sections, prioritizing by review type. The agent maps structural fixes to particular paragraphs, content additions to thin sections, style improvements to verbose passages. It produces a concrete plan, section by section, change by change. Then it goes into the manuscript and writes markdown comments as replies to the existing review comments, thus grounding the revision plan in the exact context it must fit.

Phase 3: /rewrite (create mode) follows the plan. The agent revises sections in priority order, applying structural changes first, then content, then style. Again, each step is performed spawning a subagent tasked with just a change (for style changes we actually do it section by section).

The subagent doesn’t edit; it produces a draft revision that the main agent is then tasked to paste into the document where it fits. Crucially, the main agent is instructed to leave the editorial comments but mark them as solved, with a short trail of what was changed. This works wonders for a later human review phase.

Part IV: A Look into the Future

These workflows work, but with some caveats. There’s a gap between “working” and “working well.” Three key pains remain in my implementation.

Long commands are hard to follow when given as a single prompt. The fourth step gets forgotten since it is buried at the beginning of the context.
Permissions as currently implemented are all-or-nothing. You either have shell access (destructive) or you don’t. I want broad permissions (run whatever you want) with provable security (nothing you run can change this file).
Context saturation still happens even with delegation. After a while, the agent will have to compact context, and this usually means you lose important information.

I have three ideas for closing this gap. The first is about how commands work. The second is about security. The third is about context management. They are in different levels of implementation, so let me show you what I’m building toward.

Idea One: Better Commands

Commands in most tools (Claude Code, Gemini CLI, Codex, Copilot) are one-shot interactions: you invoke the command, a single massive prompt is injected. The agent runs until it decides to stop.

To make commands truly useful, we need to be more like scripts. Here’s what that means:

Commands that inject prompt instructions one step at a time, waiting for the agent to do a full turn each time. Instead of dumping a large prompt to run all steps at once, a command like /review could insert surgical mini prompts that say “read the file”, wait for the agent, “analyze structure”, wait for agent, and so on, until “write the report”. This massively reduces the problem of lost-in-middle context saturation. Each turn the agent is focused on one specific step, and you get N times the compute power to solve an N-step workflow.
Commands that extract structured information from the agent response, and can later inject variables back into prompt. This allows to reinject important information into later prompts, keeping important information as a contextual variable, not just a string lost in the middle of the prompt. But it allows for something else.
Conditional branching based on context or user input. Once we have structured parsing and contextual variables, we can inject different prompts based on whether the agent succeeded or failed. If the plan reveals a breaking change, route to architectural review. If it’s a bug fix, route directly to implementation. The command adapts its path based on what it discovers.
Finally, commands that embed and execute external scripts. Instead of asking the agent to run some script, the command can run arbitrary Python, JS, Bash, or whatever, to, for example, transform structured information. The command becomes an orchestrator of other processes.

Basically, what I’m asking for here is a Domain-Specific Language (DSL) for guiding agents in a far more structured manner, but still having the power of arbitrary prompts for flexibility. Mixing code and prompts in this way gives us the tools to find the precise balance between constraints and capabilities.

If this sounds exciting, I’m happy to tell you this is already doable, to some extent. Check out my literate-commands project for an OpenCode-specific implementation of these ideas. It’s still a bit rough around the edges, but it works much better than plain, single-prompt commands.

Idea Two: Sandboxed Security

Most agentic tools have very coarse permission settings. You can allow, deny, or set a specific tool to “ask” mode, which means the agent will pause and emit a notification for the user to give permission.

This works fine for coarse-grained permissions like read-only access, or write but no shell. In OpenCode, you can even define permissions for specific paths, or even specific shell commands (with simple glob patterns, so you can, e.g., allow ls * but reject all other shell commands).

However, even in this case, I find these permissions too restrictive. They are conflating two different dimensions into one–what tools the agent can use, and what side-effects can those tools have.

For example, say I want to give my agent git access but only for reading operations. How do you achieve that? You need to list all safe patterns like git ls-tree *, git status, git log *. But what about git branch? Depending on the arguments, this subcommand can have read-only or write side effects. And then think about pipes, shell substitution, custom bash scripts, or worse, python *.

If you want your agent to be capable, you need to give it access to a wide variety of tools. For example, my bug-hunting workflow depends on the agent being able to execute arbitrary code that it synthesizes on the fly. However, I want guardrails. There is simply no way to whitelist all possible commands. We need separation of permission to run a command and permission to modify the system.

The solution, of course, is some form of filesystem isolation. The most obvious one is wrapping all shell execution in Docker, so commands run in a container with proper constraints. This creates all sorts of other problems, which I can discuss in a future post, but for now, it remains my best (and simplest) solution to robust sandboxing.

And this isn’t just about safety, though. When you know the agent can’t accidentally wipe your home directory or exfiltrate your API keys, you can let it do more. Security enables capability. You can let the agent download arbitrary code from the internet, run arbitrary scripts, break things and observe changes. Everything happens inside a Docker container with precise constraints that enable maximum capability with absolute security.

As of now, I kind of implemented this as a plugin for OpenCode, but it’s still in beta phase and not ready for widespread use. More on this idea in a future article.

Idea Three: Context-Aware Execution

And finally, we need to rethink the whole oversimplistic ReAct loop that simply grows the context linearly. The agentic cycle doesn’t have to be a straight line. Real work branches: you explore options, try things, backtrack when they fail. The context should reflect that.

I’ve been designing a system where the context never saturates. It branches when you’re exploring, spawning parallel contexts for different approaches. It prunes old tool calls that went nowhere. It removes internal reasoning that no longer matters. It maintains a “trail” that actually works: a structured record of decisions, not a lossy summary.

The goal is simple: keep context between 40% and 60% saturation at all times. Not by compacting a 150K tokens context down to 10K—which kills all understanding the agent had achieved—but by never letting it grow unchecked.

Nothing like this exists yet, so I’m building it, but it’s a story for another day.

Conclusion

The main takeaway from this article is not that my system is better. It’s that you can design your own system to adapt perfectly to your workflows if you clearly separate concerns. The main modes are for establishing an overall persona–inquisitive and critical, versus detailed and forward-looking, versus focused and action-biased–while skills incorporate domain knowledge, and commands act as precise workflows.

The workflows I described are real, based on actual commands and prompts I’m using in production code. But I have abstracted them a bit to make them easier to understand in the context of an arbitrary agent, not tied to specific idiosyncrasies of the tool I happen to be using at the moment. If you want to see and try for yourself a concrete implementation of these ideas—still imperfect, but working nonetheless—check out my opencode toolkit repository. It’s still pretty much work in progress, so use it with care.

In future articles I will explore specific problems in more detail and discuss concrete strategies to implement powerful workflows that keep you, the user, in absolute control, while delegating the majority of the grunt work.

And, as a final remark, I’m seriously considering building my own CLI agent. I know, I know. Reinventing the wheel and all that. But my plan is not to compete with any of the professional tools out there. What I always care about is understanding things deeply, and as my computer science career has taught me so far, there is no deeper understanding than the one you gain from actually building stuff.

So stay tuned for that. I will share progress as usual in the form of educational articles, so you’ll get to see under the hood how to build a fully functional CLI agent with tool calling, context compaction, skills, commands (the powerful ones, not the cheap single-prompt injection), subagent delegation, sandboxing, and all the engineering design hurdles that come with it.

Until next time, stay curious.

Fun quirk. Typing /etc/host plus the s makes Substack silently fail on draft save, some sort of ill-defined security rule, I suppose. What the f…

How I'm Using AI Today

Alejandro Piad Morffis — Mon, 02 Mar 2026 21:13:04 GMT

Photo by Todd Quackenbush on Unsplash

I know, this is the probably the twentieth article you’ve read this week on “how I am using AI to automate my whole life/work/whatever”. Me too. But please, give me a couple of minutes to tell why this article might be different, and thus interesting to you.

But even if you don’t have two minutes to spare, please check the repository where all I’m going to tell you about is implemented, ready for your taking. Clone it and play with it, then if you like it, come back and read the rationale behind it.

Done? Ok, here we go.

So yes, this is another article trying to explain to you how I use AI coding agents (specifically Gemini CLI, but the specifics don’t matter) to enhance my workflows. Here are a couple of reasons why I think you might be interested—and why this article might be different to so many lookalikes out there.

First, I’m not an enthusiastic techbro who just discovered AI. If you’ve read this blog before, you know I’m a longtime researcher in AI—way before LLMs were a thing—and also a self-proclaimed AI anti-hypist. I’m not just overexcited about this shiny new toy. I’ve been using generative AI since day one for everything, and I’ve been telling you exactly how it sucks at almost anything important since day one. Coding has been the same until very recently. I can tell you the change in productivity is real, provided you are responsible and considerate.

Second, my approach to incorporating AI into my workflows is very careful and grounded in a large dose of healthy skepticism. I know firsthand how these things fail, so my approach attempts to be very robust to hallucinations and context drift and all the plagues of even the most powerful LLMs.

Third, I have a thing for systems. What I’m going to show is not just a set of hacks or clever prompts or productivity tips. It’s a principled system to go from ideation to research and planning to execution at the fastest responsible speed, without sacrificing on safety or maintainability.

And fourth, I have kind of a unique position in that I’m both heavily invested into coding as well as technical writing. I’m a college professor, so I do a lot of research, writing, and editing; but I also run a small AI startup, small enough that I get to do a large part of the coding. So my system attempts to bridge these two facets—code and prose creation—with the same unified principles of careful deliberation and planning, and robust tracking of the project evolution.

If you’re curious already, let me start by exposing the overall principles behind this approach, and then we’ll dive (not delve, but close) into the details.

Subscribe now

Principles of Effective AI-Assisted Work

By far the most pressing limitation of modern, top-tier LLMs for my line of work is context saturation. What I mean by this, is no matter how many tokens your model can fit (Gemini 3 claims to chug as much as 1 million tokens), when you work for a considerable amount of time on a single project, you will exhaust this context. And even if the context window isn’t full, the model will quickly lose the capability to keep track of the important bits of context, and will start to deviate from your instructions and make up its own agenda. Not out of evilness but simple probabilities.

The way you see this problem when using Claude Code, Gemini CLI, Codex, Copilot, or anything similar is twofold. The model will either forget midtask what is was supposed to do, but it won’t simply fail; it will reinterpret the task based on its faulty, lossy, blurry viewpoint given the available context and it will do something close, but not quite what you wanted. Or, the model will make faulty assumptions and forget to ask relevant questions, again behaving close but not quite exactly as you intended.

The result is always the same, you get frustrated that you achieved 95% of what you needed, but the remaining 5% is harder to fix than to just try again. And what could have been a happy working session where you get important and difficult things done quickly derails into a session of arguing with an LLM trying to convince it (him/her/pick your side) to do things the way you want.

Barring any fundamental paradigm shifts in the near future, under the assumption that LLMs will keep working as they are, the only way to fix this is to be very conscious and careful about the context—what has been dubbed context engineering—in two senses: First, do not pollute the context with unnecessary details. And second, re-inject into the context whatever is relevant for any given task, so it doesn’t get forgotten.

I know, kind of contradictory, but tradeoffs are what engineering is all about. In my system, I’ve applied three principles to help me manage these tradeoffs effectively.

The important things should be made explicit.
Resist the urge to guess.
Delegate, delegate, delegate (yeah, three times).

Here’s how that works. For principle one, we will keep track of everything important in markdown files in the repository. This means ideas are committed to plans in markdown files before acting on them, research is summarized and stored in real time, and everything that changes in the project is logged to a long-lived journal, so the model remembers why we made some decision months ago.

For principle two, we will favor using explicit commands that are translated into explicit prompts, instead of relying on implicitly activated skills that you have little control of. So if you want the model to make a plan, you will prompt it with “/plan lets design feature X”, and the /plan command will invoke a carefully crafted prompt that says how plans work, where they are stored, etc.

And principle three means using sub-agents a lot. This is a Gemini CLI specific feature—but every other coding agent has a similar thing—where you can launch a complicated task as a “sub-agent”—which basically means a custom prompt—but here is the important part: All the context of that sub-agent is kept private, not shared with the main agent, so the internal reasoning the sub-agent needed to run to find 20 different sources in Google does not pollute the main context. We only receive back the summarized responses. This allows running very long tasks (my record is a 30 minutes long research loop, involving hundreds of retrieved web pages) on a single agent turn, without exhausting the context.

I use four sub-agents in different commands. The planner is the lead architect. It’s a read-only agent that walks through your codebase and reads everything necessary to understand architecture, design decisions, etc., given a specific task. It then produces a detailed Markdown plan in the plans/ directory—a physical source of truth that you can review before any code is touched, and that the main agent will follow step-by-step. This separation prevents the system from “guessing” its way through your codebase.

When I need external knowledge—like a library’s latest API or a specific technical specification—the researcher agent takes over. It scours the web to fetch relevant documentation, which it then synthesizes into granular summaries in the research/ directory. This raw data is then handed off to the main agent to build an executive report annotated and linked to all relevant sources, again all stored already in your repository.

And there are two more agents, specifically designed for technical writing. The reporter agent takes an outline, and a folder of content, and it will write section by section, a detailed account of what the outline requested.

Unlike a standard LLM that might provide a high-level summary, the reporter is trained to expand specific placeholders with deep, evidence-based paragraphs. It draws directly from your research/ files and the project journal/ to ensure every sentence is grounded in the project’s actual state. Finally, the editor provides the final polish, auditing the draft for structural gaps and linguistic tics. It is grounded in a customizable style guide to make sure it always respects your style.

This distributed intelligence is held together by a central nervous system of context files, as per principle one. A journal/ directory provides a chronological record of decisions and progress, acting as a long-term memory for the project. The plans/ directory stores the strategic intent, while a TASKS.md file provides a high-level overview of the project’s current status. This structured environment allows the subagents to maintain a high degree of situational awareness without needing to ingest the entire repository in every turn.

Now that we have the key pieces in place, you can start to see why I think this approach is powerful. It is very extensible—you can add new agents triggered by specific commands to customize any kind of workflow—and it mostly solves the main pain point of modern LLMs, which is precisely the brittleness of long contexts.

Workflow Details

With all that, let me show you the specific commands and workflows I have currently implemented, but keep in mind what follows is but one example of the kind of powerful workflows we can start to automate.

I will divide the rest of the article into a few major areas, and explain the commands and agents that I use in each case, and a bit of the high-level instructions given to each of them.

Discovery & Strategy

The most critical phase of any project occurs before I write a single line of code. I call this the “Discovery and Strategy” phase, powered by the /research and /plan commands. By formalizing this process, I’ve moved away from impulsive execution toward a deliberate, architected approach.

The /research command is my primary tool for external knowledge. When triggered, the researcher scours the web for technical documentation and relevant case studies, synthesizing them into granular summaries in the research/ directory. These files become a persistent knowledge base, allowing me to reference verified facts without leaving my terminal.

Complementing this is the /plan command, which focuses on internal strategy. The planner conducts a thorough analysis of the codebase and the journal/ to understand the system’s current state. After an interactive dialogue to resolve any ambiguity, it produces a comprehensive Markdown plan in the plans/ directory. This document maps out the technical territory and provides a step-by-step execution roadmap.

The strength of this workflow lies in the synergy between these two tools. A /plan operation might reveal a gap in my understanding of a specific library, prompting a targeted /research session. Conversely, a new research finding might shift my technical direction, leading to a refined plan.

Software Development

Once I have a solid strategy in my plans/ directory, I can move into execution. Let’s focus now on software development. I’ve designed four core commands—/issues, /task, /commit, and /release—to eliminate the friction of context-switching between my IDE and my terminal.

The cycle starts with /issues and /task. The /issues command acts as an expert project lead, interfacing directly with the GitHub CLI to analyze open issues and recommend what to tackle next based on strategic impact. For roadmap tracking, the /task command manages a living TASKS.md document. It assesses the value of pending work to ensure my efforts are always aligned with the project’s goals.

As I translate the plan into code, the /commit command brings order to my workspace. Instead of a monolithic “wip” commit that hides the logic of my changes, the system analyzes the git diff and logically groups modifications into cohesive units. It separates a core feature update from a documentation tweak, then proposes a series of atomic, Conventional Commits for my approval. This keeps my version history pristine and easy to navigate.

The final stage is deployment. Manual releases are fragile processes fraught with repetitive checklists: bumping versions and running tests before managing tags. The /release command automates this entire sequence. It verifies the workspace integrity by ensuring a clean git tree and passing tests via make. It then analyzes the commit history to propose the next version bump, drafts a CHANGELOG.md entry, and publishes the final tag to GitHub. This transforms a tedious afternoon of housekeeping into a single-command operation.

But, as you’ve seen, everything happens in tandem with those principles. No important action is taken without my confirmation, and everything gets logged into the filesystem, so all future decisions are grounded in past experience.

Content Creation

Now let’s focus on writing high-quality documentation and long-form articles. This is perhaps the most sensible part of the article (and the system) because people are very sensitive today with the topic of AI writing—and rightly so. Again, my intention here is to enhance how I work and get stuff done. If you’re writing for the pleasure of doing it, that’s totally fine, you probably don’t want any help there.

Anyway, the approach is built on the same cognitive foundation as the development path: the research and plans gathered during the discovery phase should serve as grounding for writing.

It starts with the /draft command. In its initial phase, the system performs a deep scan of the research/ and plans/ directories to identify the key themes relevant to the requested topic. If the foundation is too thin, the system will pause and suggest a /research or /plan cycle to ensure the draft has sufficient substance. Once the context is validated, the workflow enters an interactive “Outline Creation” phase. Rather than guessing at a structure, the system proposes a detailed Markdown outline. This collaborative step allows me to set the narrative arc and logical flow that I want, iterating on the high-level structure of, say, a technical article, before committing on the details.

Once the outline is locked, the /draft process initializes a skeleton file—complete with section headers and strategic placeholders—and then moves into an iterative, section-by-section expansion. Here, the reporter subagent takes the lead. Guided by the specific context of each section, the reporter weaves together research summaries and technical specifications into professional prose, all grounded on a style guide document.

Because the expansion happens in granular steps, the system maintains a high level of detail that a single-shot generation would inevitably lose. The result is a first draft that is structurally sound and rich with technical depth.

However, a first draft is rarely the final word. It will always sound AI-ish, and for many other reasons, it is rarely good enough. To achieve professional quality, I use the /revise command, which runs a structural and linguistic audit powered by the editor subagent following the same style guide.

Unlike a simple “check my writing” prompt, the editor performs a deep analysis of the document’s flow and tone. It identifies logical gaps where more evidence might be needed and highlights awkward phrasing that could obscure my intent. And crucially, this isn’t an automated “fix-all” tool; it’s an interactive process. The system presents its findings and proposes specific improvements, which I can then review or approve.

This collaborative refinement process ensures the final output maintains a consistent, professional voice while benefiting from the speed of the AI. By using /revise, I can surgically improve the text to enhance clarity and impact without losing control over the narrative.

But, in any case, I always find necessary a manual review and editing after all the AI enhancements. It shouldn’t be a surprise to you that this article is written in this way, but what you’re reading now is probably 80% different to what the final /revise iteration gave me. There is only so much you can prompt an AI, and that final human touch is not part of it.

But that’s good. This automates the first 80% or so of compiling a gazillion sources into a coherent narrative, and leaves the remaining 80% of polishing for me, which is the part I actually enjoy about writing.

Background Tasks

But there’s more. All of the above is what happens during, let’s say, the work day. That’s me sitting in front of the terminal, typing commands, approving stuff, fixing and redirecting, etc. Being an orchestrator.

But the real magic of AI-assisted development is what happens when you’re not looking. How you can leave you AI assistant working through the night, compiling sources, fixing bugs and proposing pull requests, enhancing the test suite, burning tokens your behalf.

To achieve this, I built an automation layer via the /cron command. The heart of this automation is the cron.toml file. This configuration file allows me to define scheduled tasks with a simple, declarative syntax. Each task specifies a name, an execution schedule, and a natural language prompt for the AI to execute.

For instance, I can schedule a task to perform “Background Research” every midnight on the unfinished tasks, scouring the web for new developments in a specific technical niche or finding specific sources to deal with the recently discovered bugs. By offloading these repetitive tasks, I ensure the knowledge base remains fresh and the project’s momentum never stalls. When the morning arrives, we have a lot of new context to start planning the day’s bugfixes and feature developments.

Maintenance & Refactoring

Now, for the final touch, here’s how I deal with technical debt and feature rot. As a project evolves—and especially, as fast as AI-powered projects evolve—it accumulates technical debt—outdated implementations, untested paths, and plain old useless features—but also, contextual debt–—outdated plans and completed tasks that clutter the roadmap, and research we never acted upon.

Without deliberate intervention, this noise degrades the AI’s performance, leading to context rot. The /maintenance command is my primary defense against this entropy. It treats the development environment as a living instrument that I must regularly tune and sharpen to maintain its efficiency.

The /maintenance workflow follows the same plan-first architecture as the rest of the system. When invoked, the AI performs a comprehensive audit of the codebase, focusing on improvements like code readability and performance optimization. It identifies opportunities to apply the DRY (Don’t Repeat Yourself) principle and ensures that every function is documented with high-quality docstrings. But it also fixes deviations between the documentation and the actual implementation.

Crucially, this is an interactive process: the system presents a detailed refactoring plan for my approval before making any changes. This ensures that I remain in control while the machine handles the labor of cleaning the code.

Beyond code refactoring, I maintain system health through disciplined repository hygiene. A key component is the management of the TASKS.md file. By regularly moving completed items into the “Archive” section, I ensure that my primary operational view remains focused on what is relevant. This simple act of archiving prevents the “Active Tasks” list from becoming a source of distraction.

The goal of these maintenance practices is to provide the AI with the cleanest possible line of sight into the project’s state. When the repository is cluttered with stale research, the subagents are forced to sift through irrelevant data, increasing the risk of hallucinations. By treating maintenance as a first-class citizen, I ensure that every interaction, whether a /plan or a /draft, is grounded in a precise context.

Conclusion

This system is far from done, and as models improve in capabilities I’m sure we’ll unlock new areas for automation and augmentation that we cannot think about today. But for me, the key principles will remain valid for a long time. These are principles of robust engineering and management, after all. You can read them thinking of a completely human-based organization, and it’s all valid:

The important things should be made explicit.
Resist the urge to guess.
Delegate, delegate, delegate (yeah, three times).

And this is the key insight for me. Good AI users are basically good managers. All the science and engineering behind good practices for people management also apply to good AI management. And then there are of course technical considerations because AIs are not people, and perhaps never will.

So this is perhaps the most philosophical take-away from this article. Sorry to have made you read so long for this!

Now, on the technical side, please do check the repository and play with it. There are a couple extra goodies I haven’t tell you about, like a /scaffold command that creates new projects from scratch, and an /onboard command that explains the whole repository in detail. Those are particularly useful if you’re reusing this repository as a template to start your own project.

Ultimately, this repository is not a one-size-fits-all solution. It is a starting point. The commands and subagents provided here represent a particular opinion on how modern development should look, but they are not the only way. The power of this framework lies in its extensibility. Every system prompt for agents and commands is a living document, meant to be tweaked and rewritten to suit your unique mental model.

So if you do try it out, please let me know in the comments. And if you have a different (or similar) system set up for yourself, please share with all us your experience and your thoughts. We are all learners in this era of AI, and we can only help each other.

Stay curious.

Drawing (not so) Beautiful Diagrams with Pure Python

Alejandro Piad Morffis — Fri, 13 Feb 2026 14:06:17 GMT

Over the past couple of weeks, I’ve been working on a pure Python library for rendering SVGs. In a spur of unbridled inspiration, I called it Tesserax. Long story short, it got waaayyyy beyond control, as it happens.

This article is a quick recap of why I did it, how I went down the rabbit hole of implementing yet-another-drawing-library, and a small showcase of what it can do it can do, in case you want to give it a shot.

But before moving one, here is a minimal example of what you can do with a few lines of Python.

Are you curious now? Then read on!

Subscribe now

The Case for Tesserax

There are a gazillion tools out there for drawing mathematical diagrams in Python, from the ancient, trustable matplotlib to the new cool kid in the park manim. There are a ton of charting libraries (those intended to draw bars, pies, and doughnuts) like altair and plotly; and a ton of high-level diagram makers like graphviz for graphs and networks, and mermaid for, well, whatever the mermaid devs think is worth drawing.

The problem with all these is that, one, there weren’t made by me. Now, jokes apart, I do believe there is a lot of value in reinventing the wheel, if only because you get to learn a lot about wheel—and boy have I learned more than I ever wanted to about SVG in the last couple of weeks. But also, you may find that your wheel fits your cart slightly better than all others, just because you know precisely what you need (and enough with wheel analogies).

So, I made Tesserax to cover a sweet spot that I couldn’t find anywhere: a lightweight library (literally zero dependencies, not even numpy) that renders web-native content (so everything scales and layouts perfectly in Jupyter / Quarto) and has both a very powerful low-level engine for when you want pixel-perfect control, and a very comfortable high-level engine for the most typical workflows.

Also, it should support animations. But I’m getting ahead of myself.

What Can Tesserax Do

I won’t put any code in this article because, one, Substack sucks at code. And two, it would only make the article harder to follow. All of these examples are fully described in the online documentation.

At its core, Tesserax is a library for defining an SVG scene. You create a Canvas, and some Shapes to it, define some attributes, and render it as an SVG file. Simple enough1.

Beyond basic primitives, you can of course draw arbitrary paths and style them easily. Here is a Polyline primitive that allows controlling the curvature with a smoothness factor.

You can control and distort these shapes in any form you want, including fully procedural warping of the edges.

And, of course, since no mathematical drawing library is worth a penny without a sketchy mode, you can also do that (albeit with some constraints).

To make positioning and layout easy, Tesserax comes a set of builtin layouts (that of course can be extended in any way you want, it’s Python) for common patterns like rows, columns, grids…

And a couple of complex layouts for things like trees and arbitrary graphs.

But things start to get really interesting when you discover that if you can render an image from code, then you can render as many as you want! And that leads to… animations!

Tesserax animations are procedural, meaning you write code that defines how each object changes in time. But there is a very high-level API for defining and composing animations declaratively, which means you seldom need the full power of tweaking each shape properties (but that power is there when you need it, as it should).

And, of course, once you grok animations, the next immediate idea is physically-based animations. (This is where things got way out of control, as you may imagine). Tesserax comes with a very small but fully-fledged pure Python 2D physics engines—that was a mouthful.

It can resolve collisions among arbitrary objects (using circles and rotated boxes as approximate colliders) and simulate fixed-step rigid body mechanics. Physic animations are baked—which means you compute the animation once and then render it. So, no, no 2D games for you! What were you thinking about? This is a graphing library, for Turing’s sake!

The purpose of this physics engine is, once again, didactic—e.g., to explain physics concepts in a Jupyter notebook with a bit of shizzass.

Final Words

I started writing Tesserax to help me create diagrams and animations for my lectures and articles. But when you do that with a fully-fledged programming language things start to get really interesting, because now you can leverage your well-honed programming skills to create complex diagrams with the minimum necessary code. You can encapsulate repeatable patterns into classes and methods, and build impressive diagrams bottom-up. You can abstract common patterns into drawable concepts like trees, automatons, etc.

For example, in the docs you will find a bit of code to implement a blob-like simulation by creating a ConvexHull component that automatically tracks inner shapes and builds the convex surrounding path around them. Tied with some simple physics (a bunch of balls interconnected with springs) gives you this ugly-looking blob of dread that I hope haunts your dreams for the next week or so—it sure will mine.

On a more serious note, the next step for Tesserax is building a library of reusable math and computer science concepts, like animated arrays, trees, graphs, etc., that can be used to explain complex CS topics with the much need visual assistance. As the great Donald Knuth once said, an algorithm must be seen to be believed. Tesserax is here to help you see them.

I’d really love if you guys would give Tesserax a try, now that is a bit more polished, and shoot me all your questions and suggestions. Special thanks to for many insightful comments and suggestions (although I’m not sure I’ve totally addressed all of them :)

Sadly, among all the stupid things that Substack cannot do for the sake of being annoying, it seems rendering SVG is one of them. So all drawings in this article are dumb screenshots of what otherwise would be pixel-perfect, infinitely scalable SVGs.

Introducing Tesserax

Alejandro Piad Morffis — Sat, 31 Jan 2026 17:24:56 GMT

Photo by KOBU Agency on Unsplash

It is 2026. We are still treating scientific diagrams like cave paintings.

I have a PhD in Machine Learning. I can explain the mathematics of high-dimensional manifold optimization. Yet, until recently, if I wanted to draw two boxes connected by an arrow, I had to Google the syntax.

My diagrams were dead artifacts. While my code was version-controlled and modular, my figures were fragile binary blobs or brittle LaTeX macros. I needed my drawings to behave like the rest of my research.

Instead, I found myself trapped between two equally bad options.

I have spent weekends fighting Mermaid and Graphviz. They are fine for a quick flowchart. But try asking them for a publication-quality Turing machine. Try asking for a precise computer architecture diagram. You hit a wall.

Using these black-box layout engines is like trying to parallel park a bus while shouting instructions from the sidewalk. You can nudge the parameters, but you never quite end up where you want to be.

The real sin of these tools is their refusal to encapsulate. As developers, we take parameterization for granted. We define logic once and instantiate it a thousand times. But you cannot take a Mermaid diagram, wrap it in a function, and ask it to render a version with “n” nodes. Our drawing tools should respect the engineering principles we teach.

Then there is TikZ.

TikZ is the undisputed masterpiece of academic drawing. It offers total control. It is also fundamentally broken for the web.

TikZ is a “print-first” citizen. If you convert it to SVG for a blog post, you don’t get a clean vector drawing. You get a crime scene. Inspect the element, and you will see a soup of unreadable paths that no CSS can touch.

And then, there is the “Reviewer #2” problem. (Damn Reviewer #2, why is it always you?!) You spend three hours perfecting a TikZ architecture diagram. You submit the paper. The review comes back: “Please swap layer 3 and layer 4.” In a drag-and-drop tool, this is annoying. In TikZ, it is a reconstruction project involving fifty lines of coordinate hard-coding. It isn’t just tedious; it’s fragile.

To render my documentation in a CI/CD pipeline, I shouldn’t have to install four gigabytes of TeX Live just to draw a few circles. And there is the friction of the language itself. Writing geometry macros in a specialized DSL is a jarring context switch. It forces you to leave the logic of your Python research and enter a world of backslashes and curly braces just to visualize a result.

Meet Tesserax

The shift happened when I moved my academic publishing to Quarto.

Quarto is essentially Jupyter on steroids. Since Jupyter is the standard for scientific computing, the solution was obvious. We didn’t need another GUI. We didn’t need a new domain-specific language. We needed a Python library that could live inside the notebook and speak fluent SVG.

I built Tesserax to close the gap. I started with the atoms, like Rects, Circles, and Paths. But the breakthrough was the Anchor System. I wanted a “nervous system” for diagrams, where objects know where they are relative to one another. No manual coordinates. No magic numbers.

Tesserax is what happens when you treat a drawing as a function of state.

In Tesserax, you don’t place a node at (100, 200). You define a Layout. You say, “I want a column of nodes, and for every node, I want an arrow pointing to the next one.”

# This isn't just a drawing. It's logic.
with Canvas() as c:
    nodes = [Circle(10) for _ in range(5)]
    layout = Column(nodes, gap=20)

    # If I add a node to the list above, the arrows update automatically.
    for n1, n2 in pairwise(nodes):
        Arrow(n1.anchor("bottom"), n2.anchor("top"))

This snippet doesn’t just render a figure. It renders any version of that figure. It captures the abstraction, not just the pixels.

The Rationale Behind Tesserax

But why, Alex, why building something like this? Don’t you have enough in your plate now?

Well, first, who are you and why do you know me so well? Second, it’s fun! But if that isn’t enough by itself, let’s analyze the landscape of modern academic drawing.

First, there is Matplotlib.

It is the industry standard for a reason. If I have a CSV of training data, Matplotlib is my first call. But have you ever tried to draw a linked list in it? You end up fighting the axes. You spend a dozen lines of code just hiding the ticks and borders to get a blank canvas.

Matplotlib is a library built for statistics that is occasionally forced to perform geometry. It feels like doing calligraphy with a highlighter. Tesserax respects the division of labor: use Altair or Matplotlib for your data, and use Tesserax for your concepts.

Then, there is Tikz. (Again!) We already discussed it, but there is more.

On one hand, there is a strange Stockholm Syndrome in academia. We accept that drawing a simple three-state automata requires learning a bizarre macro language on top of another macro language (yeah, I’m looking at you, LaTeX!). We treat TikZ’s loop syntax like a revelation, ignoring that any programming language is actually better at this, but I can also have, I don’t know, variables! And also, who the hell knows how to write Tikz!? (I know some of you nerds do, don’t give me that look. Shame on you.)

Tesserax bets that you don’t need a Domain Specific Language (DSL). You need objects. You need classes. You need type hints. If you want a grid of shapes? That is a nested list comprehension. If you want a recursive tree? That is a recursive function. You don’t need to “learn Tesserax.” You just need to know Python. By piggybacking on Python’s syntax, we get free IDE support, linting, and the ability to debug our drawings with the same tools we use to debug our kernels.

But also, even more important, the browser is the new print driver. TikZ was built for a world where the final output was a piece of paper. Tesserax is built for a world where the output is a responsive HTML page that can be printed.

By targeting SVG natively, we bypass the fragile conversion layers. We get semantic scaling. We get CSS styling. We stop fighting the renderer and start using the most optimized vector graphics engine in history: the modern web browser. And you can render SVG to PDF anytime. Going back, though, means you lose the semantics of what your drawing means.

And last but not least, Tesserax is lean. Like, zero dependencies. A couple hundred of pure, fully-typed, Python 3.12. Nothing more, nothing less.

I rest my case.

What’s Next

So Tesserax is my attempt to bring software engineering to scientific illustration. I’m releasing it today in a very crude, 0.2 something version, as I always do, to gather as much feedback as possible as soon as possible. From now on, all the diagrams in my papers, blog posts, and books will be built with this.

So, next step, I’m moving past the primitives phase now and into building high-level scientific abstractions. I want to generate automata that look like they were pulled from a classic textbook and data structures that reflect actual memory layouts. That’s what comes in the next few iterations. Moving from a world where we draw lines to a world where we code diagrams.

I’d love it if you’d give Tesserax a try. The documentation is still pretty thin but feel free to drop me a comment with any questions or suggestions.

The Computist Journal

Pseudocode is a lie

Three claims for pseudocode, all false

Two registers: diagrams for intuition, real code for formality

Literate programming keeps prose and code from drifting

illiterate makes the book and the package the same source

Quicksort, with the package built from the page

Start Here: A Map of The Computist Journal

What this is about

Five paths through the archive

🤖 If you want to understand what AI really is — and isn’t

🧠 If you want to see why computation itself is beautiful

💻 If you want to build something

✒️ If you want to argue about the bigger stakes

💡 If you want to chase the big questions

The books

The code

What you get if you subscribe

What’s next

The fastest algorithm ever devised

Four conditions, one winner

You’ve used this all day

Union-Find is actually pretty simple

Making Union-Find fast, really fast

Union-Find is the fastest algorithm ever devised

Problem solved is a rare luxury

Mostly Harmless AI v2.0 is here!

Sixteen chapters, mechanism-first

Free online, or pay for offline

June is algorithms month

Introducing Aegis: the programable multi-agent meta-harness

Agents aren’t just for coders

What’s wrong with existing tools

Introducing Aegis

What I intentionally left out

Coda

Why AI Agents Need Structure

Plan First, Build Second

Research Before the Plan

Review After the Plan

Name the Problem Before Solving It

We Knew it all Along

Start Doing This Yourself Today

Structure Before You Re-Prompt

Mostly Harmless AI - Mid-week Report

The two main changes since Friday

Access and support

Where to find it

AI is doing something weird to Science

Four Cases, One Shape

Claude’s Cycles

Tao and Lean

AlphaFold

GNoME and A-Lab

The Loop Has Is Older Than You Think

Appel and Haken, 1976

Hales and Flyspeck, 1998–2014

AI Feynman, 2020

What 2022 Actually Changed

Anatomy of the Loop

The Proposer Role

Galactica, or: What Happens Without a Verifier

The Verifier Role

Tao’s Insight on Composability

What’s Genuinely New Since 2022

Why Both Extremes Are Wrong

The Maximalist Steelman

The Dismissive Steelman

The Verdict

But What Happens to the Paper

Discovery and Paper Count Were Already Decoupled

Goodhart as Accelerant

The Optimistic Edge

A Diagnostic Heuristic

The Verifier Is the One That Matters

The Investment Gap

What Comes Next

Mostly Harmless AI - the book that explains you the AI without the bullshit

What’s in it

Who it’s for

`illiterate` makes the book and the package the same source