Harvey AI is one of the best-resourced legal AI companies in the world. Hundreds of millions in funding. Elite law firm customers. A team that has been building in this space longer and harder than almost anyone.
In April 2026, they published a post-mortem on their document review algorithm.
The algorithm was working. It was being used by lawyers at major firms. It was passing tests. And they rebuilt it anyway — because it was insufficiently verifiable.
The original system produced citations attached to whole cells, not individual statements. Harvey, as they put it, "didn't reveal its line of thinking." Lawyers are paid to exercise diligence: to check sources, to understand reasoning, to stake their professional reputations on what they submit. An algorithm that delivered conclusions without per-statement reasoning made that impossible.
The rebuild produced answers with per-statement citations and exposed reasoning at each step. Not because the original was broken. Because verification requires that you can see the work.
That's the insight worth taking seriously: insufficiently verifiable is a failure mode distinct from incorrect. Most organizations haven't built their verification infrastructure for the first problem because they're focused entirely on the second.
Anthropic ran an AI safety evaluation using a benchmark called BBQ — designed to measure whether their models exhibited harmful social biases. The results came back encouraging. Bias scores close to zero. Progress appeared.
One of the benchmark's original developers asked a simple follow-up question: were the models actually answering the questions?
They weren't. The models were deflecting — refusing or giving non-answers — which registered as technically unbiased because a non-answer can't be biased in any direction. The safety score looked good. The underlying behavior was useless.
Anthropic named this the core failure mode of automated evaluation: "All evaluations are subject to the failure mode where you overinterpret the quantitative score and delude yourself into thinking that you have made progress when you haven't."
This isn't a niche problem. Scale AI published HiL-Bench in April 2026 — a benchmark specifically designed to test whether agents know when they're missing context. Results: frontier agents solved up to 89% of software engineering tasks when given full context. Remove a few realistic details — an ambiguous requirement, a buried constraint, an unresolved contradiction — and the best model's completion rate dropped to 24%. More troubling: the agents didn't ask for help. They guessed confidently and shipped wrong outputs.
An 89% benchmark score describing a system that would fail 76% of the time under realistic conditions. The score didn't lie — it measured something other than what mattered.
The natural response to unreliable benchmarks is to evaluate dynamically — use AI to evaluate AI outputs in real time. LLM-as-judge. A second model assessing the first.
In April 2026, a thread in r/LLMDevs titled "LLM-as-judge is not a verification layer. It is a second failure mode" gathered 62 comments from practitioners actively shipping AI in production.
The most-upvoted reply: "Two probabilistic systems do not add up to a deterministic one. You have not added a verification layer."
Anthropic named this the "ouroboros" problem. When AI systems generate evaluations for other AI systems, those evaluations inherit the same tendencies — the same biases, the same hallucination patterns, the same blind spots — as the systems being evaluated. The evaluator eats its own tail.
LLM-as-judge isn't useless. It's valuable for semantic evaluation at volumes no human review process could cover. But practitioners who get the most from it have learned to split: compliance checks (schema validation, scope boundaries, authorization rules) stay deterministic. LLM judge handles only what genuinely needs contextual judgment. Most teams default to LLM-for-everything because it's the easiest reach. That's where cost and reliability problems compound.
There's a failure mode more dangerous than hallucination that most verification frameworks aren't built for: correct content, wrong scope.
A performance coaching agent is given a task: draft feedback notes for a manager's direct reports and share them. The agent drafts thoughtful, accurate notes. It shares them — to an org-wide Slack channel. Private performance feedback surfaces to 400 people.
The agent didn't hallucinate. The content was reasonable. The scope was catastrophic.
Standard verification asks: did the model hallucinate? Here the answer is no. But the output was still wrong — in a way that neither the model nor the retrieval system could detect, because the error lives in the action's scope, not its content.
Harvey understood this when they rebuilt their algorithm. The question isn't only whether the content is accurate — it's whether a lawyer can trace the reasoning, statement by statement, and stake their professional judgment on it. Verifiability is a property of the whole output, not just the factual claims inside it.
The security field had a version of this problem twenty years ago. The solution was organizational before it was technical: they invented the CISO. Someone with budget, authority, and accountability for security posture. Someone whose specific job was to ask, continuously, whether the systems were actually secure.
AI verification doesn't have an equivalent role at most organizations. When something goes wrong, accountability diffuses across the engineer who built the prompt, the PM who defined the feature, the legal team that approved the use case, and the executive who mandated the AI transformation.
The organizations ahead on this have assigned a specific person — often an engineer or technical PM — to own the evaluation pipeline. Production-derived test cases. Sampling protocols for reviewing live outputs. An escalation path when something looks wrong. Reporting to someone above the team that built the system, with authority to slow deployment when verification coverage is thin.
The title varies. The function is: who answers the question "is what our agents are producing, correct?"
Most organizations don't have a name for that role. That's the gap.
This post expands on Chapter 3 of Wrong by Default: What AI Builders Know That Everyone Else Doesn't by Alokit. Available on Kindle ($7.99): amazon.com/dp/B0GZCY9CGF