Adaptive Thinking Doesn't Change Quality. It Changes Variance.

5 min read Original article ↗

Boris Cherny posted a comment on Hacker News yesterday recommending CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 as a Claude Code setting. Boris works on the Claude Code team at Anthropic (you may know him from “Programming TypeScript,” the O’Reilly book). The setting disables the system that allocates reasoning budget per turn on the fly, forcing a fixed high budget instead. His explanation: “That’s the system that decides ’this looks easy, I’ll think less,’ and it’s frequently wrong.”

I wanted to test that claim. Blind A/B test, no peeking.

I ran 10 rounds. 20 headless claude -p sessions total. Same task every time: code review of a 310-line Python hook file with multiple code paths and security-sensitive logic. Variant A had CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 set. Variant B had it explicitly set to 0. Each round ran both variants in parallel with the same prompt. A separate claude -p judge scored every output blind, using random hex IDs, on four dimensions: True Positives, Severity Calibration, Actionability, and Thoroughness (1-10 each). That’s 40 headless sessions total, counting the judges. There goes my weekly token budget.

The judge never saw which variant was which.

The Numbers

Here’s what came out.

DimensionA meanA stdevB meanB stdevDelta (B-A)
True Positives7.900.998.001.50+0.10
Severity Calibration7.300.487.220.97-0.08
Actionability8.900.328.780.44-0.12
Thoroughness8.500.538.780.44+0.28
Composite8.150.468.190.76+0.04

A composite score of 8.15 vs 8.19. That’s a difference of 0.04 on a 10-point scale. Not meaningful. Both variants are doing the same quality of work on average. A won 4 rounds, B won 4 rounds, 1 tie. Mean paired difference: exactly 0.00.

The quality finding is a null result.

But the stdev column tells a different story. Variant A (adaptive thinking disabled) shows a composite stdev of 0.46. Variant B (adaptive enabled) comes in at 0.76, which is 65% higher variance. Every dimension follows the same pattern with B’s stdev above A’s, and Severity Calibration is the extreme case at 0.48 vs 0.97, roughly 2x the spread.

Duration variance tracks the same way. Both variants averaged about 81 seconds per run, but where Variant A’s duration stdev was 6.4 seconds, Variant B’s came in at 16.4. That’s 2.5x higher spread on how long sessions take.

And then there’s the failure data. Variant B had 1 outright session failure (Round 9, hex 4f7a, exit code 1, no score). All 10 Variant A sessions succeeded. Variant B also produced 2 false positive CRITICAL findings vs 1 for Variant A.

What I Expected vs What I Got

I expected one of two things. Either the setting would produce better quality (supporting the claim), or it would produce worse quality (making the adaptive system look essential). Neither happened. The quality is flat. What changed is the distribution.

That wasn’t what I was looking for. But it’s the finding.

The question isn’t “Which variant is smarter?” They’re the same on that dimension. The question is which variant is more predictable.

Why Variance Matters More Than I Thought

Most benchmark discussions focus on mean quality. Higher average score wins. But I dispatch 10-20 parallel agents per session on anything non-trivial. In that environment, the outlier matters more than the average.

Think about what a high-variance session costs. A false positive CRITICAL finding sends a developer down a rabbit hole that leads nowhere. An outlier session that takes 100 seconds instead of 80 means, in a parallel batch of 15, you’re waiting on the slowest one. A session failure in a pipeline that expected an output file means a cascade rerun.

Variance is the enemy of parallel pipelines. Not bad average quality.

Here’s what the per-round breakdown looks like:

RoundA ScoreB ScoreB Notes
018.508.50Tie
028.008.75B wins
038.758.50A wins
048.508.75B wins
058.256.50A wins (B: CRITICAL false positive)
068.258.00A wins
077.758.75B wins
088.507.50A wins (B: false positive headline finding)
097.75N/AA wins (B: session failed)
107.258.50B wins

Look at Round 5 and Round 9. Those are the variance events. Round 5, B scored 6.50 because its first finding was a CRITICAL false positive about a regex bypass that wasn’t actually there. Round 9, B produced no output at all. Neither of those rounds dragged down B’s average much because the other 8 rounds were fine. But in a parallel workflow, you don’t get to average out your failures. You have to handle them.

Variant A’s worst round was a 7.25. Variant B’s worst was a session failure. That gap isn’t captured in the mean scores.

The Limitation I Should Name

N=10 per variant with a single judge is not a solid statistical foundation. The judge is another claude -p call, which adds its own variance. A 0.04 composite difference is well within noise. I’d want 30+ rounds and multiple independent judges to make strong claims about quality. My claims about quality are: there’s no meaningful difference I can detect at this sample size.

The variance finding is more durable. Stdev 0.46 vs 0.76, 1 session failure vs 0, 2 false positive CRITICALs vs 1. All pointing in the same direction. If I ran another 10 rounds I’d expect the quality means to stay close, but I’d also expect the variance pattern to hold.

But I’m one person doing tests. It’s true for this task. Maybe it’s not true for others. The hook file I used is a security-sensitive Python script with 310 lines of real logic. A different review target could produce different results. Your mileage may vary.

The Setting, and What to Do With It

CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 is an environment variable. You can set it in your shell before launching Claude Code, or in a wrapper script that launches headless sessions. I’ve been setting it globally in my session launcher since running this test.

It’s free. No quality cost in my testing. The only thing it changes is the spread. For workflows that run single sessions sequentially, the benefit is small. For workflows that dispatch parallel agents and wait for the slowest one, or that use batch output from headless runs, the consistency is worth having.

I don’t know if this will still matter in six months. The underlying adaptive thinking system will probably improve. The environment variable might go away. None of us know where Claude Code’s architecture is heading. But right now, in my testing, it reduces outliers without touching average quality.

Worth trying if you’re running parallel workflows. Your mileage may vary.