The 80% AI Reliability Horizon

The real number every AI engineer should be tracking

May 21, 2026

*Adapted from Friedrich, “Wanderer above the Sea of Fog” (c. 1818), Kunsthalle Hamburg — the horizon you can see is not the horizon you get to stand on. Rendered with Nano Banana 3 via mosaico.*

Every post on the blog this month is on the theme of agent reliability, anchored on the second edition of Mostly Harmless AI — 50% off during early access — where the Limitations chapter walks all seven failure modes that compound into the curve below. You can also read the whole book online for free. More at the end.

The headline number you’ve seen on every AI-progress chart — “model X completes two-hour tasks half the time” — is the 50% reliability horizon. That number is moving fast. Doubling every seven months, per METR’s time-horizon work. It’s the curve on every AI-progress chart, the one conference talks lean on, the one that lands in investor decks.

The number that decides whether you can actually deploy an agent is a different one.

The 80% reliability horizon — the task length at which an agent finishes well enough that you would not feel the need to check — sits 70–80% below the 50% figure, and it moves up far more slowly. That gap is the difference between demo and deploy. The 50% is what passes the eval. The 80% is what survives the afternoon you weren’t watching. Not two hours. Thirty minutes you’d hand off.

I want to be precise about what I’m not arguing. I’m not arguing agents are broken or that AI progress isn’t real. It is real, and it’s fast. I’m making the narrower claim that the 50% and 80% horizons move at different speeds, and that the 80% is the one that matters when someone else’s data is on the line. This post is the math behind the gap, why it’s structural, and what you can do about it.

If you’re building on agents, you are building on the 80% horizon. The 50% number is for the marketing deck.

Two horizons, two stories

METR’s methodology is clean: take a population of tasks with measured human completion times, then find the longest task a given model clears at success rate X. Do that for X = 50%, and you get the 50% horizon. Do it for X = 80%, and you get a different curve. A different story.

The 50% horizon has been doubling roughly every seven months. Late 2025, it sat around a couple of hours for software tasks. That’s the curve that makes headlines. That’s the curve you’ve seen on every slide. Striking.

The 80% horizon sits roughly 70–80% below. The same agent that clears a two-hour task half the time clears a half-hour task four-times-in-five. Not two hours. Thirty minutes. And that gap doesn’t close at the same rate. It moves slowly, stubbornly, for reasons that are mathematical before they are engineering.

So you have two curves growing at different speeds. The 50% horizon is the curve of capability: what can this system do, under ideal conditions, at least sometimes. The 80% horizon is the curve of trust: what can this system do reliably enough that you’d hand it a production key and walk away.

They are not the same curve. And they do not close the same way.

The longer your task horizon, the wider the gap between can-do-it-sometimes and can-be-trusted-with-it. The mechanical reason is one piece of math.

Probability arithmetic

Here’s the setup. A language model you’re calling spends a fixed compute budget per output token. Each step in a multi-step process has some per-step success probability p that is strictly less than one. The model is stochastic, the world is noisy, context degrades.

String n steps together, and the probability that all of them succeed is roughly pⁿ. That’s it. That’s the math.

Here’s what that looks like with actual numbers. Suppose your agent is excellent: p = 0.99 per step. That’s a 99% success rate on any single action. Compound it over 100 steps: 0.99¹⁰⁰ ≈ 0.37. You’ve gone from near-certain to worse-than-a-coin-flip without anyone making an error. Now drop to p = 0.95 (still quite good, still 95% per step). Over 100 steps: 0.95¹⁰⁰ ≈ 0.006. Six in a thousand runs succeed.

This is not a gap you close with next year’s training run. It is the shape of any probabilistic process operating in sequence over time. The curve doesn’t change when you improve p; it just shifts outward.

Reasoning models — the o-series, R1, extended-thinking variants — are valuable here. They buy you a higher per-step p, and they let you spend more steps at that higher rate. Both matter. But they push the curve outward. They do not change its shape.

Two pieces of evidence you should hold next to that math. GSM-Symbolic (Apple, 2024): perturb a math problem the model has seen (swap a name, change a number) and accuracy collapses even when the reasoning path is identical. The model has memorized the route, not the reasoning. Faith and Fate (Microsoft, 2023): transformer accuracy degrades with computational-graph depth even when each individual sub-step is solvable in isolation. Depth itself is the failure axis. More steps means more surface for p < 1 to accumulate.

Reasoning models buy you a higher per-step p and more steps to spend. They do not change the shape of the curve.

Where the chain gets long

Agents are exactly the setup that makes pⁿ painful.

Think through a typical agent run: read prompt, plan, call tool, read result, call tool again, critique output, adjust plan, call final tool, write response. Seven steps if you’re being generous. A real production agent reaches hundreds. Each step is one more p rolled. Each tool call is one more chance the orchestrator hands the tool the wrong arguments — garbage in, deduction out.

Self-critique doesn’t repair this — and you can verify the result yourself if you’ve tried it. Huang and colleagues (2024) showed that intrinsic self-correction without an external oracle signal actually degrades performance. The model talks itself out of correct answers as often as it talks itself in. The paradox is clean: if the model could recognize the error, it would not have made it. Asking it to introspect on failures is asking the broken compass to check itself.

So let’s put numbers on a real scenario. An agent that succeeds on each of five steps 95% of the time lands at 0.95⁵ ≈ 0.77. Decent. Not great, but workable. Now extend that same agent to a fifty-step trajectory: 0.95⁵⁰ ≈ 0.08. Eight runs out of a hundred finish correctly.

The demo ran five steps. The deploy runs fifty. The demo and the deploy are two different machines.

That’s the 80% horizon you’ll actually feel in production. It’s not a philosophical concern about AI reliability in the abstract. It’s the arithmetic of what happens when you take a stochastic generator and ask it to maintain a chain of reasoning over a long enough trajectory that pⁿ has time to do its work.

What you can actually do

Three mitigations. Each one genuine, and each one with a ceiling you should know before you commit.

Verifier-shaped tasks. Where the output can be checked deterministically (arithmetic, code that compiles and runs, SQL that parses, a formal proof), you can recover trust that the probabilistic generator alone cannot provide. A SAT solver beats an LLM on deductive closure every time. The architecture that wins here is LLM-proposes-candidate, deterministic-system-signs-off. The generator explores the space; the verifier approves the exit. This is, incidentally, the same pattern Monday’s post on the seventy-year argument named: a deterministic shell around a stochastic core, applied at the task level rather than the system level. The twist is that not every task has a fast verifier. Code that runs is checkable; code that runs correctly for all future inputs is not.

Retrieval-augmented generation. If the fact your agent needs is no longer arbitrary recall but lives in a curated document the model is required to cite, then Kalai and Vempala’s 2024 lower bound on calibrated hallucination does not apply to that fact. Most agent failures upstream of a tool call are recall failures the agent doesn’t know it’s making; retrieval changes the error mode from confident confabulation to visible gap. RAG turns a free-running generator into a paraphrase-and-summarize system over a known corpus. The reach of the system is now bounded by the reach of the index. But anything outside that index is back to pure p < 1 territory.

Narrow the horizon. The cheapest move is the one nobody wants to make: don’t deploy your agent on a fifty-step trajectory. Cut it to five. Hand off to a human at the boundary. At five steps with p = 0.95 you’re at 0.77; at fifty steps you’re at 0.08. That’s not a small difference. That’s the difference between a tool that works and a demo that occasionally works. Now, this trades autonomy for reliability. That trade is worth making in most production contexts right now. Whether it’s worth making in your context is a product question, not a research question.

Watching the right number

The 50% number will keep doubling and you should track it. That is real progress and worth watching closely.

But it is not the number your users feel. The number your users feel is whether the agent finished their task well enough that they didn’t have to re-run it, check its work, or clean up after it. The difference between “I tried that AI agent thing and it was magic” and “I tried that AI agent thing and it broke my Friday” is roughly the distance between the 50% horizon and the 80% horizon at your task length.

The shape of the next several years of agent engineering is already visible in the mitigations you’ll be reaching for: deterministic verifiers around stochastic generators, retrieval around recall, short trajectories with human handoffs where the math demands it. Not because agents are weak. They’re remarkable. But the pⁿ curve doesn’t care about benchmark scores. It cares about chain length.

One number, slowly creeping upward, every quarter. Watch that one.

Until next time, stay curious.

If the 80%-horizon framing landed, the second edition of Mostly Harmless AI walks the seven failure modes that produce the curve — the calibrated-hallucination lower bound, the U-shaped attention curve, the reversal curse, the depth ceiling on deduction, the rest. 50% off during early access. You can also read the whole thing online for free in a custom reader I built — dark mode, font controls, offline support, the works.

Get Mostly Harmless AI - 50% off

And if you want everything I’ve written, plus everything I’m going to write, that’s the Compendium. One purchase, in perpetuity.

A student

May 21

Excellent article that I wish everyone in the corporate ranks would read. It's tiring listening to everyone assume that agents will perfectly swap out human tasks. The same degradation of process happens when human efforts are strung together, so why is everyone not recognizing that this is the case with agents as well? It's biased thinking that AI is going to just never experience this I see from anyone who hasn't actually tried this on a real process and not just a toy example published to a reel somewhere.

What I've experienced as I've done these automations is that I wind up being the Oracle. So I showcase my new automated process but hide the fact that I spent a large amount of time guiding the process to the result. I wish more people would be honest about this.

I actually shifted to the self critique method early on and always used a different LLM to critique the result of another LLM. This produced much better results, however, it takes a really long time and jacked up my token usage. For example in image generation I got from maybe 40% acceptable to almost 90% acceptable but this took on average 10 extra loops through the image gen because of the extra LLM critiques. So I basically did 10 gens to get from 40-90 but it still was not good enough. still required the oracle to guide it from 90 to 99!

1 reply by Alejandro Piad Morffis

Florent Michel

May 22

Very interesting article, which highlights a crucial point too often neglected in discussions of AI trajectory. I wonder if one should not even go one step further: in many areas, a failure probability of 20% is still way too large to be useful. (Although this does not invalidate your arguments!)

2 more comments...

Discussion about this post

Ready for more?