Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

¹UC Berkeley ²Technical University of Munich, MCML ³University of Tübingen, Tübingen AI Center ⁴TTIC

The Platonic Representation Hypothesis:

The Platonic Representation Hypothesis, explained

The idea is appealing: as neural networks get bigger and train on more data, they all converge toward the same representation of reality, regardless of whether they process images, text, or audio. Huh et al. (2024) called this the Platonic Representation Hypothesis, drawing on Plato's idea that behind the world of appearances lies a single set of ideal forms.

If true, this would mean there's one "correct" way to represent the world, and all our models will eventually find it. But we examined the experimental evidence closely and found it to be surprisingly fragile. The alignment was measured on just 1,024 samples with one-to-one pairings. Under more realistic conditions, cross-modal representational alignment decreases dramatically.

Did you check the experimental evidence?

TL;DR for you: The evidence was measured on 1,024 samples with one-to-one pairings. That's not how most real data works. When we test under realistic conditions, cross-modal representational alignment decreases significantly.

The experimental evidence comes from a highly constrained setting: only 1,024 samples, one-to-one image-text pairings, and a metric evaluated at a small scale. These conditions are far removed from how real-world multi-modal data looks: it is large-scale and inherently many-to-many.

When we relax the experimental setting (scaling to millions of samples, allowing multiple captions per image, testing with newer models), the alignment trends used to support the hypothesis are no longer observed. This suggests that rather than finding a shared representation, each modality may yield a representation specific to its own Umwelt.

You're not alone. We stress-tested the experimental evidence

TL;DR for you: We use the same metric and same models as the original paper, and show that their experimental evidence breaks at scale, with more realistic data, and newer models. Others have found problems with the hypothesis from different angles.¹

Specifically, we take the original experimental setup and show that alignment degrades when scaling from 1,024 to millions of samples, breaks when the data is not bijective, captures only coarse categorical agreement rather than fine-grained structure, and the predicted trend that stronger models align more fails for newer LLMs. This suggests that rather than finding a shared representation, modalities may yield representations specific to their own Umwelten.

Why Question Platonic Convergence?

Large language models keep solving tasks we thought required vision, like visual question answering, spatial reasoning, even robotic manipulation. This has led some to ask: do we even need pixels, or is language all you need?

The Platonic Representation Hypothesis says it doesn't matter. As neural networks scale, their representations converge toward the same one regardless of modality. If true, you might just use text, since it's a convenient source of data.

But there is a reason we visit art museums rather than just read descriptions of paintings in a catalogue. And when we looked closely at the experimental evidence for convergence in the Platonic Representation Hypothesis, we found that it came from a surprisingly limited setting. The evaluations were done on just 1,024 samples with bijective pairings. Under more realistic conditions, what looked like little alignment turns out to be shallow. Both models sometimes agree on broad categories but organize the finer details differently.

How is Alignment Measured?

The core metric used by Huh et al. to measure cross-modal alignment is mutual k-nearest neighbors (mutual k-NN). Given paired image-text data, find the (k) nearest neighbors for each sample in both the vision and the language embedding space.

Use the slider below to grow the dataset size. As it gets denser, both models find closer neighbors, but they stop agreeing on which one:

Dataset size: 1,024

sparse dense

Interactive illustration of the mutual nearest neighbor metric (k=1). Each dot represents an image-text pair in the dataset, shown in image embedding space (left, DINOv2) and text embedding space (right, OpenLlama3b). The blue dot is the query, the other colored dots are the nearest neighbors in each space. On a small dataset (1,024 samples), both models agree on the same NN (green). As the dataset grows denser, each model finds a closer match in its own space, but they are not consistent. Hover over dots to see their image and caption.

Alignment Degrades at Scale

The original experimental evidence for the Platonic Representation Hypothesis used a dataset of just 1,024 samples. We systematically scaled up to 15 million, and found that alignment degraded.

13.5% Alignment on
WIT-1024 (k=10)

0.8% Alignment on
LAION-15M (k=10)

16× Drop in alignment
when scaling up

Why does this happen? In a sparse dataset, both modalities tend to retrieve the same neighbors, not necessarily because they agree, but because the pool is too small to reveal their differences. As the dataset gets denser, each modality can find neighbors that are closer in its own space, and the overlap vanishes.

Huh et al. themselves ask whether the obtained mutual kNN score is “indicative of strong alignment with the remaining gap being ‘noise’ or does it signify poor alignment with major differences left to explain?” When scaling from 1,024 to 15 million samples, the alignment score drops from 13.5% to just 0.81%, leaving very little room for a convergence narrative.

Alignment scores on WIT across dataset sizes

Scaling the dataset to 1M (WIT) shows a large drop in mutual k-NN alignment for both k=1 and k=10.

Alignment scores on LAION-15M across dataset sizes

We see a similar alignment degradation on LAION-15M when scaling to 15M samples.

Measured Alignment Doesn't Hold Up for Real Data

The original experiments used one-to-one image-text pairings. But real data is many-to-many: a single image can be described in countless ways, and a single caption can match many different images. When we progressively add more captions per image or more images per caption, mutual k-NN alignment drops consistently.

Alignment when adding more images per caption

Adding more images per caption using CycleReward data. Mutual k-NN alignment decreases consistently for both k=1 and k=10.

Alignment when adding more captions per image

Adding more captions per image gives the same pattern. Alignment drops as the one-to-one setting is relaxed.

Coarse Agreement, Not Fine-Grained Convergence

Both models might retrieve a "stone wall", but the vision model finds one with a similar texture, while the language model finds interlocking concrete blocks. Same category, different items.

Images per class:

Each dot represents an image-text pair, shown in both the image embedding space (left, DINOv2) and text embedding space (right, OpenLlama3b). As the number of images per class grows, both models find closer neighbors in their own space, but they no longer agree on the same item. Hover over a dot to see its caption.

To quantify this phenomenon, we turn to ImageNet, where every image has a class label. We decompose the mutual k-NN metric into three questions:

Does each model individually retrieve a correct-class neighbor? Mostly, yes! (red and blue lines)
Do both models retrieve a correct-class neighbor? At increasing rates. (orange line)
Do they agree on the exact same item? Mostly, no! (dark green line)

Per-modality retrieval accuracy vs cross-modal alignment

Per-modality retrieval accuracy and cross-modal mutual k-NN alignment (k=1) as images per class increase. As the dataset densifies, both DINOv2 and OpenLlama3b individually retrieve correct-class neighbors at rising rates, but cross-modal alignment remains flat: the models agree on the category but not on the specific instance.

The limited alignment we do observe is just coarse categorical agreement. Both models know what a stone wall is, but they have fundamentally different ideas about which walls are most similar to each other.

Stronger Unimodal Models Do Not Seem More Aligned

One argument for convergence was that stronger language models align better with vision models. We tested this across 55 language models and multiple benchmarks.

The trend (shown as dashed lines) from the models used in the original Platonic Representation Hypothesis experiments does not seem to hold for recent models (diamonds that are off the lines). Indeed, newer models appear to be specializing in their own modality.

Trend analysis on GSM8K

GSM8K

Trend analysis on ARC

ARC

Trend analysis on MMLU

MMLU

Trend analysis on LogiQA2

LogiQA2

Alignment vs. language model capability across four benchmarks. Dashed lines show the original trend from the Platonic Representation Hypothesis; diamonds represent recent models that fall off the predicted scaling curve. Stronger LLMs don't appear to align better with vision.

Back in the Cave: Models in Their Own Umwelten

Nearly a century ago, biologist Jakob von Uexküll argued that every organism inhabits its own perceptual environment, or Umwelt: a tick lives in a world of thermal gradients, a bat in a world of echoes. The different Umwelten might have only little overlap with each other.² The same, we believe, might hold for our models: each constructs its own representational structure, determined by its modality and training data, rather than converging toward a shared model of reality. Though it is still early days, we suspect future evidence will favor von Uexküll over Plato.

BibTeX

@article{koepke2026cave,
  title   = {Back into Plato's Cave: Examining the Evidence for Cross-modal Representational Convergence},
  author  = {Koepke, A. Sophia and Zverev, Daniil and Ginosar, Shiry and Efros, Alexei A.},
  journal = {arXiv preprint arXiv:2604.18572},
  year    = {2026}
}

🙏 This work was in part supported by the BMFTR (FKZ: 16IS24060), the DFG (SFB 1233, project number: 276693517), NSF IIS-2403305, and ONR MURI. This research utilized compute resources at the Tübingen Machine Learning Cloud. The authors thank all Efros group members for valuable discussions that shaped this work, and particularly Tyler Bonnen and Amil Dravid for proofreading the draft. Lastly, we thank Phillip Isola for feedback and for sparking this conversation by inviting us out of the cave, which ended up motivating us to go back in to examine the shadows in a new light.