Mirage Reasoning: The Illusion of Visual Understanding
arxiv.orgIt's fascinating the extent to which all the models rely on text - it's like they have severe (but not total) aphantasia:
"We hypothesize that this phenomenon emerges predominantly from a misassumption about how these systems are trained. Modern multimodal models are developed on web-scale corpora and are commonly built on top of pretrained large language models, which makes them extraordinarily strong at language modeling, retrieval of statistical regularities, and reconstruction of likely contexts from sparse cues.[48, 25, 24] During the multimodal training, the models are presented with the image, a textual question, and are expected to reconstruct the correct answer. Lacking access to an entire text corpora, a human would intuitively answer the question based on the image in that setup; but we should not infer that this would be the default approach for an AI model. Incentivized to generate the correct next tokens, models might learn to easily ignore the visual information and rely only on their vast prior knowledge, taking the shortest route to the correct answer.[36, 5, 48]"
The crazy thing is that based just on the text in the questions a model was able to "guess" answers:
"When fine-tuned on the public training set of this dataset with images removed (i.e., trained in mirage-mode), our 3-billion-parameter, text-only super-guesser outperformed all frontier multimodal models, including those exceeding hundreds of billions of parameters, on the held-out test benchmark (Figure 3c). It also surpassed human radiologists by more than 10% on average, relying entirely on hidden textual cues in the questions and the structural patterns of the benchmark. In addition, our super-guesser was able to create reasoning traces comparable to, and in some cases indistinguishable from, those of the ground-truth or those generated by frontier multi-modal AI models."
From the Conclusion section of the PDF[1],
"Multimodal AI systems are increasingly deployed on the assumption that their benchmark performance reflects genuine visual understanding. Our results fundamentally challenge these assumptions. Across every model-benchmark pair tested, the accuracy that frontier models achieved without any access to images exceeded the additional accuracy they gained when images were provided. Moreover, a text-only 3-billion-parameter model, trained solely on question-answer pairs stripped of images, outperformed all frontier multimodal systems and human radiologists on a held-out chest radiology benchmark. Taken together, these results demonstrate that high benchmark accuracy does not reliably indicate visual understanding."
Basically, they are so good at extracting clues from the text of the questions, and extrapolating from them, that they proceed to answer _as if_ they had an image to view. With confidence, of course.
This is just insane! Why hasn't this bubbled up to the top?
tl;dr: AI scores highly in medical image test, even though it hasn’t seen the image.
Details: https://x.com/euanashley/status/2037993596956328108