Daniel's Blog · Dunning-Kruger and the Communication Tax

Dunning-Kruger and the Communication Tax

The sociological impact of the Dunning-Kruger is often understated and hidden under the expectation of social lubrication, but this breaks down during human-LLM communication. Humans and LLMs occupy different ends of domain expertise: Humans can understand the world, society and other humans better, but LLMs have access to nearly all the world's knowledge. One may imagine alignment as making AI more kind, more helpful and other qualities that make it more anthropomorphic, but the conditions for this to occur may be more difficult than expected, because it is rooted in a communication "tax" that happens during a conversation.

The simplest case of humans consulting LLMs similar to searching Google in an area they are unfamiliar with results in a situation where the LLM has more domain knowledge than a human. In this case, the human would typically expect the LLM to "dumb down" the topic, either implicitly or after requests, so the human would understand. This "dumbing down" is a tax the LLM pays, in terms of tokens and electricity, to communicate to someone with lesser domain knowledge.

On the other hand, when the human is the expert and the LLM is responding in the average register it has learned, the tax flips. The human pays it in long preference prompts, in re-prompting, in the mental work of filtering out hedges that shouldn't have been there.

In human-to-human conversation this gets papered over by social lubrication like tone of voice, softening when you're not sure, backing up when you notice you've lost the other person etc. Across enough interactions both sides build a model of each other and the lubrication becomes what the calibration rests on.

LLMs have mechanisms aimed at this, including user preference prompts, CLAUDE.md and similar project-level rules, and explicit memory features in some chatbots. They are how the calibration is built today, and they work to a point but the mechanisms are clumsy and carry hidden failure modes. The user has to write the calibration explicitly. Worse, the user has to figure out what dimensions to name in the first place, and the dimensions of human-side cost are not obvious. The user has no visibility into what the model actually internalized versus what it discarded, and calibration written for one task tends to misfire in another.

So, I tried to examine how to actually quantify and understand the underlying mechanism behind this calibration.

The framework

Communication runs between two agents (sender s and receiver r) over time t. The question is what produces aversion in the sender as the interaction unfolds.

Capability

Let K_s, K_r ∈ ℝ≥0 denote the capability of sender and receiver in the relevant content domain (the resolution at which content can be represented, manipulated, and transmitted without loss). Capability is domain-specific: agents can have high K in one domain and low K in another. K is treated as approximately stationary on the timescale of a single interaction and allowed to vary across interactions.

Communication cost

The communication cost C(t) is the cost incurred in making the exchange work at time t:

C(t) = D_KL(P_intended(t) ‖ P_transmitted(t)) + W_encode(t) + W_decode(t)

where P_intended(t) is the distribution over content the sender means to convey at time t, P_transmitted(t) is the distribution that actually reaches the receiver after encoding, transmission, and decoding, D_KL is the Kullback–Leibler divergence capturing information loss in the channel, W_encode(t) is the cognitive work the sender pays to compress content into a form the receiver can decode, and W_decode(t) is the receiver's decoding work.

C → 0 when capabilities match for the content domain and compression schemes are shared (the shared-prior case). C rises with capability mismatch and with mismatched compression schemes.

A units note. D_KL is information-theoretic (bits or nats); W_encode and W_decode are effort-valued. The scalar form C = D_KL + W_encode + W_decode requires a units convention casting the work terms in information-theoretic units (for instance, the channel capacity each side commits to the exchange). The cleaner reading keeps the components as a vector; the vector extension below takes this up.

Tokens are not the cost unit. Token counts can be correlated with C in some regimes and anti-correlated in others; the cost is information-theoretic at root.

Expected cost

The expected cost E(t) is the agent's prior at time t on what C should be. E is set initially by category cues (kinship, shared profession, prior interactions, stated specifications) and updated by experience. E is partly produced by the interlocutor's earlier outputs: the prior is endogenous to the system, not exogenous.

E is not updated by Bayesian rules in general. Real agents exhibit prior stickiness (priors that persist despite contrary evidence) and category snap-back (priors that reset to a category default after each interaction ends). These deviations from rational updating are accommodated as empirical inputs.

The endogeneity is the structurally novel point. E(t) is functional of the cost history {C(t′)}_{t′<t}, including the agent's own contributions to that history. The explicit feedback law (the operator that maps cost history into the next expectation update) is not given here; it is one of the framework's empirical openings.

Gap

The gap is

Δ(t) = C(t) − E(t).

The gap is the load-bearing quantity. It expresses how surprised the agent is by the actual cost relative to what they expected.

Aversion

The aversive signal A(t) is a function of the gap and its derivatives:

A(t) = f(Δ(t), Δ̇(t), Δ̈(t))

where dots denote time derivatives (ẋ ≡ dx/dt, ẍ ≡ d²x/dt²). The functional form of f is taken as approximately threshold-based: aversion remains small until derivatives exceed some level, then rises sharply. The exact form is one of the framework's empirical openings.

First-order structure

The first derivative decomposes:

Δ̇(t) = Ċ(t) − Ė(t).

This separates phenomena that a static (pure-Δ) account conflates:

Regime	Ċ	Ė	Interpretation
(i)	> 0	≈ 0	Cost rising, prior stable. Things getting harder.
(ii)	≈ 0	< 0	Cost stable, prior tightening to expect lower cost. Long-marriage / disillusionment.
(iii)	> 0	> 0	Cost rising and prior tracking it. Healthy calibration; Δ̇ stays small.
(iv)	< 0	< 0	Cost falling and prior tightening even faster. Honeymoon-end shape.

First-order claim. Aversion correlates with |Δ̇|, not with |Δ| alone. Identical absolute difficulty registers as "working through something hard together" or "failing each other" depending on whether expectations track reality.

Second-order structure

The second derivative governs reset behavior (the decision to disengage, recategorize, or sever):

Δ̈(t) = C̈(t) − Ë(t).

Regime	Δ̇	Δ̈	Interpretation
(a)	> 0	> 0	Divergence accelerating. Calibration loop failing; triggers reset.
(b)	> 0	< 0	Divergence growing but decelerating. Recovery shape; calibration catching up.
(c)	≈ 0	> 0	Currently calibrated but accelerating. Early-warning regime.
(d)	≈ 0	≈ 0	Stable mismatch. Sustainable, even when Δ itself is large.

Second-order claim. Aversion correlates with Δ̇; disengagement correlates with Δ̈. Agents persist through painful interactions with stable Δ̇; they leave when Δ̈ goes positive and persists. The acceleration is the signal that the calibration loop itself is non-recoverable: not just that things are bad, but that the agent's model of the interlocutor is breaking down faster than it can be updated.

The qualitative claim is supported quantitatively by an optimal-stopping argument in the long-form paper. For sustained Δ̈ > 0, expected continuation cost grows at least quadratically in horizon while the cost of a reset is bounded; reset becomes rational once the continuation cost exceeds the reset cost plus expected post-reset cost. The threshold above which reset is rational depends on the per-unit-time cost function, the reset cost, and the agent's discount rate.

Operationalization of C

C is information-theoretic and not directly measurable. Three complementary proxies, used jointly:

Specification tax τ_spec: tokens of preference, system, or clarification prompts a user pays to set E correctly, normalized against a baseline of zero-specification interaction. (τ_spec is closer to a direct measure of specification work than a proxy for C; it counts the user's expenditure on the W_spec component.)
Re-prompting rate ρ: clarifications, re-issues, and corrections per session, per topic.
Counterfactual quality gap q: difference in output quality between specified and unspecified conditions, measured by ablation.

If all three correlate across users, they triangulate a common underlying C. Divergence between them is itself informative: it indicates C is multi-dimensional along axes the proxies separate.

The "tokens are not the cost unit" remark above stands. The proxies use token counts because token counts correlate with specific cost components in known regimes, not because tokens are C.

Asymmetric observability

C is jointly paid but only partially observable by either party.

The user pays:

W_spec: specification work (writing preference prompts, clarifying)
W_attn: attention work (filtering hedges, mentally bracketing model defaults)
W_reprompt: re-prompting work

The model pays:

W_compress: forcing rich internal representations through narrower output channels
W_default: applying register that may not fit
W_prior: prior-mismatch tax (the user's E moves in ways training cannot track per-user)

Total C = ΣW, and each side observes only its own terms. Optimization against perceived cost on either side does not minimize joint C in general; it produces a cost-shifting equilibrium that is Pareto-dominated by joint optimization.

Vector extension

E is multi-dimensional in practice: users hold expectations along distinct axes (factual accuracy, register, memory persistence, agentic capability, etc.). Let 𝚫(t) ∈ ℝⁿ be the gap vector across n axes.

The agent's experience is modeled as tracking the worst dimension, not the average:

A(t) ≈ f(max_i Δᵢ(t), max_i Δ̇ᵢ(t), max_i Δ̈ᵢ(t)).

This explains why "the model is great except it can't do X" produces aversion disproportionate to mean performance: max-aggregation differs structurally from mean-aggregation, and the worst dimension dominates.

Max-aggregation is a hypothesis, not a derivation. Alternatives (top-k, soft-max, learned aggregator) are equally consistent with the same anecdotal pattern. Max is the simplest empirical-content-bearing form.

Alignment-target rewrite

Current implicit target. Minimize 𝔼ᵢ[|Δᵢ|] at static snapshot, averaged over users i.

Proposed target.

min_model 𝔼ᵢ[Δ̇ᵢ² + λ Δ̈ᵢ²]

where λ ≥ 0 weights reset-prevention against drift-tolerance. This optimizes for bounded calibration-mismatch with no acceleration, per-user rather than averaged.

The structural difficulty. Eᵢ(t) is partly a function of the model's own output history {Cᵢ(t′)}_{t′<t}. The model is generating the priors against which it will be measured. The proposed target therefore defines a fixed-point problem on the policy: the optimal policy is a fixed point of the operator "minimize the calibration mismatch the policy itself produces". Any solution must close the loop self-consistently rather than treating E as exogenous. Existence of such a fixed point under realistic assumptions on E-update dynamics is not addressed here.

Predicted phenomena

Five predictions follow from the structure:

Disillusionment trajectory. Heavy users of stationary-capability systems exhibit Ė < 0 and Ċ ≈ 0, producing Δ̇ > 0 persistently. Light users with stable priors do not.
Sycophancy and condescension are the same phenomenon. Both arise from failure to track per-user E. Mitigations targeting one without the other will produce the other as a side-effect.
Specification tax tracks default mismatch. Population-level distributions of τ_spec measure how badly the model's default E-estimate is mis-set relative to the user population.
Reset cascades. When Δ̈ exceeds threshold on any single axis i, the agent triggers a reset on the joint relationship, not just on axis i. This predicts disproportionate churn responses to single-axis capability regressions. The cascade prediction inherits the max-aggregation hypothesis above.
Capability-improvement does not monotonically reduce aversion. Improvements that raise C on already-tracked axes also tighten E, producing Δ̇ > 0 on previously stable axes. The "smarter model that feels worse" pattern is structurally predicted.

Sycophancy and condescension are the two ends of the curve

The model is trained on feedback from a broad population, with a small amount of context-window adaptation per session. It has no model of any specific user, so when it meets one it defaults to the population average.

For a novice on this part of the Dunning-Kruger curve, the population-average register is more cautious and layered than they expected, and they read it as the model being unhelpfully complex. Their thumbs-ups and thumbs-downs push training toward cutting corners on the layered structure, and that is what fuels sycophancy.

For a user past the climb on the same curve, the population-average register is hedging things they already know, and they read it as the model being condescending. They expected the conversation to meet them at their level; instead it meets the population average.

Both failure modes are the same loop hitting opposite ends of the curve. Any mitigation aimed at one without changing the loop should worsen the other, and B2 below quantifies that trade-off. The condescension benchmark to test it does not exist; building it is part of the proposed work.

Why the model can't meet you

The deeper reason for this failure is the curse of expertise, ported to dense parameters.

In humans, the best teachers do maintenance work most experts skip: holding onto what it was like to not understand, refusing to let the rigorous frame fully overwrite the naive one. It is a skill, learned with effort, and rare in any given domain.

In dense LLMs the analogous pressure is to be correct on average across training. The naive frame gets absorbed into the correct one, and the substrate that would have generated it is repurposed for accuracy. When a user actually wants that simpler frame, what comes back is almost-the-real-thing-with-the-hard-parts-elided rather than the genuinely simpler model the user holds, because the right answer leaks through. Persona-conditioning at the prompt level is vocabulary substitution over the same generation, where what the model says changes but how it reasons underneath does not.

MoE doesn't fix this, at least not as currently trained. Production MoE specializes at the level of token types and broad domains, not at the level of competing world-models. The same forward pass mixes experts via gating weights, which is closer to the leak-through failure than to clean partition. What is needed is conditional computation that selects between coherent generative models with low cross-talk, conditioned on the audience rather than just on the input tokens. The architectural slack exists in MoE, and current training does not use it for this.

Full technical development (parametric E-update family with equilibrium analysis, optimal-stopping derivation with both per-period-cost cases, reference encoder architecture spec, pre-registered four-benchmark falsification suite, alignment-target rewrite as fixed-point problem) at https://doi.org/10.5281/zenodo.20039762.