The 80% AI Reliability Horizon
The real number every AI engineer should be tracking

Every post on the blog this month is on the theme of agent reliability, anchored on the second edition of Mostly Harmless AI â 50% off during early access â where the Limitations chapter walks all seven failure modes that compound into the curve below. You can also read the whole book online for free. More at the end.
The headline number youâve seen on every AI-progress chart â âmodel X completes two-hour tasks half the timeâ â is the 50% reliability horizon. That number is moving fast. Doubling every seven months, per METRâs time-horizon work. Itâs the curve on every AI-progress chart, the one conference talks lean on, the one that lands in investor decks.
The number that decides whether you can actually deploy an agent is a different one.
The 80% reliability horizon â the task length at which an agent finishes well enough that you would not feel the need to check â sits 70â80% below the 50% figure, and it moves up far more slowly. That gap is the difference between demo and deploy. The 50% is what passes the eval. The 80% is what survives the afternoon you werenât watching. Not two hours. Thirty minutes youâd hand off.
I want to be precise about what Iâm not arguing. Iâm not arguing agents are broken or that AI progress isnât real. It is real, and itâs fast. Iâm making the narrower claim that the 50% and 80% horizons move at different speeds, and that the 80% is the one that matters when someone elseâs data is on the line. This post is the math behind the gap, why itâs structural, and what you can do about it.
If youâre building on agents, you are building on the 80% horizon. The 50% number is for the marketing deck.
Two horizons, two stories
METRâs methodology is clean: take a population of tasks with measured human completion times, then find the longest task a given model clears at success rate X. Do that for X = 50%, and you get the 50% horizon. Do it for X = 80%, and you get a different curve. A different story.
The 50% horizon has been doubling roughly every seven months. Late 2025, it sat around a couple of hours for software tasks. Thatâs the curve that makes headlines. Thatâs the curve youâve seen on every slide. Striking.
The 80% horizon sits roughly 70â80% below. The same agent that clears a two-hour task half the time clears a half-hour task four-times-in-five. Not two hours. Thirty minutes. And that gap doesnât close at the same rate. It moves slowly, stubbornly, for reasons that are mathematical before they are engineering.
So you have two curves growing at different speeds. The 50% horizon is the curve of capability: what can this system do, under ideal conditions, at least sometimes. The 80% horizon is the curve of trust: what can this system do reliably enough that youâd hand it a production key and walk away.
They are not the same curve. And they do not close the same way.
The longer your task horizon, the wider the gap between can-do-it-sometimes and can-be-trusted-with-it. The mechanical reason is one piece of math.
Probability arithmetic
Hereâs the setup. A language model youâre calling spends a fixed compute budget per output token. Each step in a multi-step process has some per-step success probability p that is strictly less than one. The model is stochastic, the world is noisy, context degrades.
String n steps together, and the probability that all of them succeed is roughly pn. Thatâs it. Thatâs the math.
Hereâs what that looks like with actual numbers. Suppose your agent is excellent: p = 0.99 per step. Thatâs a 99% success rate on any single action. Compound it over 100 steps: 0.99100 â 0.37. Youâve gone from near-certain to worse-than-a-coin-flip without anyone making an error. Now drop to p = 0.95 (still quite good, still 95% per step). Over 100 steps: 0.95100 â 0.006. Six in a thousand runs succeed.
This is not a gap you close with next yearâs training run. It is the shape of any probabilistic process operating in sequence over time. The curve doesnât change when you improve p; it just shifts outward.
Reasoning models â the o-series, R1, extended-thinking variants â are valuable here. They buy you a higher per-step p, and they let you spend more steps at that higher rate. Both matter. But they push the curve outward. They do not change its shape.
Two pieces of evidence you should hold next to that math. GSM-Symbolic (Apple, 2024): perturb a math problem the model has seen (swap a name, change a number) and accuracy collapses even when the reasoning path is identical. The model has memorized the route, not the reasoning. Faith and Fate (Microsoft, 2023): transformer accuracy degrades with computational-graph depth even when each individual sub-step is solvable in isolation. Depth itself is the failure axis. More steps means more surface for p < 1 to accumulate.
Reasoning models buy you a higher per-step p and more steps to spend. They do not change the shape of the curve.
Where the chain gets long
Agents are exactly the setup that makes pn painful.
Think through a typical agent run: read prompt, plan, call tool, read result, call tool again, critique output, adjust plan, call final tool, write response. Seven steps if youâre being generous. A real production agent reaches hundreds. Each step is one more p rolled. Each tool call is one more chance the orchestrator hands the tool the wrong arguments â garbage in, deduction out.
Self-critique doesnât repair this â and you can verify the result yourself if youâve tried it. Huang and colleagues (2024) showed that intrinsic self-correction without an external oracle signal actually degrades performance. The model talks itself out of correct answers as often as it talks itself in. The paradox is clean: if the model could recognize the error, it would not have made it. Asking it to introspect on failures is asking the broken compass to check itself.
So letâs put numbers on a real scenario. An agent that succeeds on each of five steps 95% of the time lands at 0.955 â 0.77. Decent. Not great, but workable. Now extend that same agent to a fifty-step trajectory: 0.9550 â 0.08. Eight runs out of a hundred finish correctly.
The demo ran five steps. The deploy runs fifty. The demo and the deploy are two different machines.
Thatâs the 80% horizon youâll actually feel in production. Itâs not a philosophical concern about AI reliability in the abstract. Itâs the arithmetic of what happens when you take a stochastic generator and ask it to maintain a chain of reasoning over a long enough trajectory that pn has time to do its work.
What you can actually do
Three mitigations. Each one genuine, and each one with a ceiling you should know before you commit.
Verifier-shaped tasks. Where the output can be checked deterministically (arithmetic, code that compiles and runs, SQL that parses, a formal proof), you can recover trust that the probabilistic generator alone cannot provide. A SAT solver beats an LLM on deductive closure every time. The architecture that wins here is LLM-proposes-candidate, deterministic-system-signs-off. The generator explores the space; the verifier approves the exit. This is, incidentally, the same pattern Mondayâs post on the seventy-year argument named: a deterministic shell around a stochastic core, applied at the task level rather than the system level. The twist is that not every task has a fast verifier. Code that runs is checkable; code that runs correctly for all future inputs is not.
Retrieval-augmented generation. If the fact your agent needs is no longer arbitrary recall but lives in a curated document the model is required to cite, then Kalai and Vempalaâs 2024 lower bound on calibrated hallucination does not apply to that fact. Most agent failures upstream of a tool call are recall failures the agent doesnât know itâs making; retrieval changes the error mode from confident confabulation to visible gap. RAG turns a free-running generator into a paraphrase-and-summarize system over a known corpus. The reach of the system is now bounded by the reach of the index. But anything outside that index is back to pure p < 1 territory.
Narrow the horizon. The cheapest move is the one nobody wants to make: donât deploy your agent on a fifty-step trajectory. Cut it to five. Hand off to a human at the boundary. At five steps with p = 0.95 youâre at 0.77; at fifty steps youâre at 0.08. Thatâs not a small difference. Thatâs the difference between a tool that works and a demo that occasionally works. Now, this trades autonomy for reliability. That trade is worth making in most production contexts right now. Whether itâs worth making in your context is a product question, not a research question.
Watching the right number
The 50% number will keep doubling and you should track it. That is real progress and worth watching closely.
But it is not the number your users feel. The number your users feel is whether the agent finished their task well enough that they didnât have to re-run it, check its work, or clean up after it. The difference between âI tried that AI agent thing and it was magicâ and âI tried that AI agent thing and it broke my Fridayâ is roughly the distance between the 50% horizon and the 80% horizon at your task length.
The shape of the next several years of agent engineering is already visible in the mitigations youâll be reaching for: deterministic verifiers around stochastic generators, retrieval around recall, short trajectories with human handoffs where the math demands it. Not because agents are weak. Theyâre remarkable. But the pn curve doesnât care about benchmark scores. It cares about chain length.
One number, slowly creeping upward, every quarter. Watch that one.
Until next time, stay curious.
If the 80%-horizon framing landed, the second edition of Mostly Harmless AI walks the seven failure modes that produce the curve â the calibrated-hallucination lower bound, the U-shaped attention curve, the reversal curse, the depth ceiling on deduction, the rest. 50% off during early access. You can also read the whole thing online for free in a custom reader I built â dark mode, font controls, offline support, the works.
And if you want everything Iâve written, plus everything Iâm going to write, thatâs the Compendium. One purchase, in perpetuity.


