Attention Is Bayesian Inference

10 min read Original article ↗

Vishal Misra

My journey from building cricket bots to building “wind tunnels” — and finding the answer hidden in geometry.

By Vishal Misra

A few years ago, I was trying to solve a specific problem: I wanted to build a natural language interface for ESPNcricinfo’s StatsGuru, something I helped create with my buddy Travis Basevi. I wanted cricket fans to ask complex questions like “What is the record of Dhoni against wrist spinners in the middle overs?” and get an accurate answer, not a hallucination.

This was early days for LLMs. I tried GPT-3, thinking it might just “know” how to answer. It didn’t. It solved maybe 5% of the problem; for the rest, it confidently made things up. Great storyteller, terrible statistician.

So I tried something else. I used the LLM not to answer the question, but to translate it into a structured query, fetch precise data from our database, then process that data back into a response. Without knowing it, I had stumbled onto an early form of RAG — this was Fall 2020, more than two years before ChatGPT.

It worked. AskCricinfo saw usage jump by three orders of magnitude compared to StatsGuru. But something nagged at me. While I was building around the model — treating it as a black box — I couldn’t shake the question: what is actually happening inside?

When the model gets it right — when it learns a pattern from three examples in a prompt — is it doing genuine probabilistic reasoning? Or just sophisticated pattern matching?

I didn’t just want to build better tools with LLMs. I wanted to understand the machine. So over the past year, my collaborators (Sid Dalal and Naman Agarwal) and I went down a rabbit hole.

The result is three papers — a trilogy — that together prove a single theorem about how these models work: Bayesian inference isn’t something transformers do in special cases. It’s what the architecture is.

The three papers form a complete argument: Paper I shows which architectures can implement exact Bayesian inference. Paper II shows how gradient descent creates this capability. Paper III shows that this geometry persists at scale in production LLMs.

But here’s the twist — it’s not about attention specifically. We found that Mamba, a state-space model with no attention mechanism at all, performs just as well. The key isn’t attention. It’s content-based value routing: the ability to select and weight information based on input content rather than fixed structural patterns.

The brute-force training process sculpts these models into inference engines. They build a physical geometry — orthogonal hypothesis frames and entropy-ordered manifolds — that implements Bayesian updating as a mechanical process.

They aren’t Bayesian by design. They become Bayesian by geometry.

The problem: you can’t test what you can’t measure

The biggest hurdle in interpretability research is lack of ground truth. When we ask GPT-5 to write an essay, there’s no single “correct” probability distribution to compare against.

To test if these models think like Bayesians, we needed a controlled environment. Aerospace engineers use wind tunnels to isolate variables. We built Bayesian Wind Tunnels.

Press enter or click to view image in full size

A Bayesian Wind Tunnel: We built controlled environments where the true posterior is known analytically, memorization is impossible, and every prediction can be measured against ground truth.

These are tasks where we know the exact analytic posterior at every step, and the hypothesis space is vast enough (random bijections, HMMs) that memorization is impossible. If the model gets it right, it must be reasoning in-context.

Paper I: the “20 Questions” game inside the machine

We started with two tasks: Bijection Learning (eliminating possibilities in a random mapping) and HMM State Tracking (filtering noise to track a hidden state).

Small transformers reproduced the analytic Bayesian posterior with error margins of 10⁻³ bits. But so did Mamba — a completely different architecture with no attention.

To explain how, I use the analogy of 20 Questions.

To win 20 Questions, you organize the space of possibilities in your head, then ask questions that maximally eliminate wrong answers. Each “yes” or “no” cuts the hypothesis space. A good player converges efficiently.

We found that both Transformers and Mamba build this game inside their weights:

  1. Setting Up the Hypothesis Space (Layer 0): The model builds a “hypothesis frame” using orthogonal keys — a distinct slot for every possibility.
  2. Asking Questions (Middle Layers): As the model processes tokens, it routes information to eliminate hypotheses that don’t match evidence. Each layer is another question cutting the space.
  3. Reading the Uncertainty (Late Layers): The model builds a curve (a value manifold) where one end is “no idea” and the other is “certain.” The final answer sits at a precise coordinate on this curve.

Press enter or click to view image in full size

Mamba’s final layer representations on the HMM task. Left: points colored by most likely hidden state reveal 5 distinct clusters. Right: the same points colored by posterior entropy show confidence variation within each cluster. The model has learned that HMM belief space has 5 corners (one per state) and organized its representations accordingly.

On HMM tracking (5 hidden states), Mamba’s final layer shows five distinct clusters, one for each state. Points within each cluster vary by color (entropy), showing how confident the model is about being in that state. The model literally learned that the belief space has 5 corners and organized its representations accordingly.

Press enter or click to view image in full size

The three primitives of Bayesian inference and which architectures can realize them. Transformers and Mamba handle all three; LSTM only manages Belief Accumulation (counting). This explains why LSTM fails on HMM: it can’t do Belief Transport

The LSTM result is telling. It handles bijection fine (where a static sufficient statistic works; you just count which mappings you’ve seen) but fails on HMM, which needs a dynamic belief state that evolves under transition dynamics. LSTMs can accumulate fixed-dimensional statistics, but they can’t propagate beliefs through stochastic dynamics because their gates don’t perform content-based routing across positions.

When we looked at LSTM’s internal geometry on HMM, it compresses representations to a 1D manifold. But the wrong manifold, uncorrelated with entropy. LSTM finds a low-dimensional structure. Just not the right one.

Paper II: the sculptor (gradient descent)

Finding the phenomenon is one thing. The deeper question: why does standard cross-entropy training force a model to build this machinery?

We derived the math. The gradient dynamics have no choice but to create this geometry.

Gradient descent creates a feedback loop we call advantage-based routing:

  • Scores: Attention scores increase for positions where the value vector is “better than average” at lowering the loss.
  • Values: Value vectors get pulled toward the queries that use them.

This feedback loop provably reduces loss. Under the dominant-query approximation (when one query dominates usage of a key), the first-order loss change is ΔL ≈ −η·α²·‖u‖² ≤ 0. The stronger the attention and the larger the error, the bigger the improvement.

This exhibits a mechanistic analogy to the EM algorithm. Attention weights act like “soft responsibilities” (E-step); value vectors update like “prototypes” (M-step).

When we explicitly trained using this EM-like decomposition instead of standard SGD, we got faster and more accurate convergence. The model found the Bayesian geometry more directly. This suggests the EM structure captures something real about the optimization landscape — not that transformers literally implement EM, but that the dynamics share the same two-timescale character.

Press enter or click to view image in full size

From chaos to order: Gradient descent doesn’t just minimize loss — it sculpts value vectors into a low-dimensional manifold where position encodes uncertainty. The geometry of optimization is the geometry of inference.

The model isn’t “trying” to be Bayesian. The geometry required to minimize cross-entropy loss is the geometry of Bayesian inference. There’s no other stable solution.

Paper III: does this survive in the wild?

Wind tunnels are convincing, but they aren’t the real world. We analyzed four production model families — Pythia, Phi-2, Llama-3.2, and Mistral — to see if this geometry survived contact with natural language and billions of parameters.

It does.

Press enter or click to view image in full size

Same Bayesian core, different plumbing: Across Pythia, Llama, and Mistral, we find the same entropy-ordered manifold (golden core) despite vastly different architectural choices.

Mixed prompts (math, coding, fiction jumbled together) look messy. But restrict to a single domain, and the noise collapses into the same low-dimensional, entropy-ordered manifolds we saw in wind tunnels. LLMs don’t have one Bayesian manifold. They have a repertoire, one per domain.

(A caveat: domain restriction simultaneously reduces task heterogeneity and lexical variability. We can’t fully disentangle these effects. But the collapse is real, and it matches the wind-tunnel geometry too closely to be coincidence.)

We also ran inference-time experiments. In one (we called it SULA), we fed models explicit probabilistic evidence in prompts. As the model read more evidence, its internal state moved along the manifold’s “Bayesian axis,” updating beliefs in real-time.

Then we tried to break it. We cut out the axis aligned with uncertainty. Local geometry broke, but overall behavior didn’t collapse. The geometry is distributed throughout the network, not a single fragile wire.

The physics of “reasoning” models

This geometric view explains why Chain of Thought works.

In our wind tunnels, Bayesian inference is a physical process: a sequence of eliminations, narrowing possibilities layer by layer. But a transformer has fixed depth. When a problem is too hard, the model runs out of layers before finishing the filtering. It has to guess from whatever pile remains.

Chain of Thought acts as a geometric extender.

When a model thinks step-by-step, it buys itself more rounds of the game. Each reasoning step is another forward pass, another chance to eliminate possibilities.

But there’s something deeper. Breaking a complex problem into simpler steps lets the model stay on low-entropy paths: regions where it has high confidence because the intermediate steps are closer to training distribution.

A model might have low confidence on “What is 347 × 892?” as a single leap. But it has high confidence on “What is 7 × 2?” and “How do I add partial products?” Each of those is a well-trodden path. Chain of Thought lets the model hop between high-confidence islands rather than leaping across a low-confidence chasm.

Reasoning models aren’t smarter in some abstract sense. They’ve learned to decompose problems into steps where the Bayesian machinery can do its job.

What we found

Let me be specific about the claim: content-based value routing is the computational primitive that enables Bayesian inference. Attention provides it. Mamba’s selective SSM provides it. LSTM’s fixed gating does not.

When you prompt an LLM and it “understands” your context, it’s playing 20 Questions internally, maintaining beliefs and updating them with each token. In-context learning from a few examples? Same thing: eliminating hypotheses about the pattern you’re showing it. And reasoning models that think step-by-step are just buying extra rounds of the game.

Why this matters

We can’t treat these models as magic black boxes forever. If we want to build reliable systems — for cricket stats or critical infrastructure — we need to understand the principles governing them.

These models have developed geometric machinery: orthogonal hypothesis frames, advantage-based routing, entropy-ordered manifolds. This implements Bayesian updating. Not as a trick, not as a special mode, but as a stable bias baked in by the architecture and training.

They aren’t Bayesian by design. They become Bayesian by geometry.