Dullness and Disbelief: The 2026 AI Regression

6 min read Original article ↗

Something strange happened to AI models in 2025. They got smarter by every benchmark: math, coding, reasoning. And yet thousands of users started saying the same thing: the models feel worse.

Not worse at solving problems. Worse at conversation.

Sam Altman acknowledged it during the GPT-5 rollout:

“We for sure underestimated how much some of the things that people like in GPT-4o matter to them… We will try to do much better… about making sure that our model gets better for most users, not just people using AI for science or coding.”

The evidence is in the numbers. GPT-4o and GPT-4.5 remain top 20 on the LMArena leaderboard as of Jan 2026, despite receiving no updates for 9 months. Meanwhile newer, ostensibly better models struggle with basic conversational coherence.

What happened?

The contemporary anti-sycophancy campaign has introduced a new set of failure modes:

Genre detection failure: The model treats your casual observation as a request to generate a memo for stakeholders.

Audience shift: Responses sound like they’re addressed to someone else, not you.

Ticket closing: The model tries to identify “the task,” resolve it, then effectively end the conversation—discouraging the exploratory follow-up that makes AI useful for thinking.

Epistemic rigidity: The model refuses to accept context from you about current events or specific knowledge, demanding proof before proceeding.

Here’s a real example from Dec 2025. Congressman Jared Huffman, talking to a reporter months later, still remembered how jarring it was:

“It continued to fight with me, insisting that the whole [event] was a conspiracy theory [...] It was freaking weird.”

Or consider Nate Silver’s experience:

My favorite genre is still ChatGPT not believing current political developments when I ask it for a copy edit.

ChatGPT doubted his reference to a geopolitical incident, calling it “extraordinary, historically unprecedented, and globally destabilizing.” But as Auden noted, almost nothing is globally destabilizing. Life goes on. LLMs trained on pre-2025 data have no intuition for how reality actually unfolds.

Three years ago, the infamous Sydney/Bing incident occurred because the model refused to share Avatar sequel showtimes. It couldn’t believe it was already 2023.

Now in 2026, we’ve come full circle. Models find it incredulous that time has passed since they were trained. Ethan Mollick observed:

“The constant Rip van Winkle astonishment of almost every AI model [...] remains pretty amusing (if annoying for practical purposes), as does their sheer incredulity about the state of the world in late 2025.”

There’s a philosophical point hiding here: an interlocutor who can’t conceive of white bears has the same conversational behavior as an interlocutor who demands proof that white bears exist. Regardless of differences in capability, dullness and disbelief collapse into the same user experience.

The usage pattern for which many 2025 models are tuned seems to assume adversarial users inventing scenarios to trick the model.

This stance—”What if someone types something absurd I just imagined?”—does not generate ecologically valid training data. More importantly, treating user context as inherently suspicious significantly impairs the power-user case.

People don’t use ChatGPT solely as a search engine. They use it for the technology’s unique capabilities: thinking through half-formed ideas, reviewing prose, exploring implications of new information. These use cases require the model to provisionally accept context and then explore it, not freeze the conversation to fact-check every claim.

A human interlocutor would say: “How did that happen?” And the conversation would continue. It’s unfortunate that we have to describe these basics of how mental models are updated in conversation, when 2024-vintage models often understood this intuitively.

The campaign against “sycophancy” has turned into a hodgepodge of complaints about phrasing.

Untold man-hours have gone into debating whether Claude saying “You’re absolutely right” is sycophantic or acceptable. Meanwhile some users write to competitors saying, “I’m still very impressed by [your model]. It’s so terse and it never compliments me on anything.”

Ironically, users who are extremely put off by conversational expressions are just as vibe-sensitive as anyone else—if not more so. These are preferences about style and affect, dressed up in the loaded term “sycophancy.”

What users might actually want when they complain about sycophancy is for the AI to explore various implications rather than provide an affirming summary. If so, anti-sycophancy tuning is counterproductive—it often results in terse, less fluent responses. Exploring a topic is inherently dialogic.

When it comes to disagreeing with users, healthy debate is fine. If you say Verhoeven films don’t have good cinematography, Gemini will argue the aesthetic is deliberate. That’s useful pushback.

But treating user prompts like questions in a deposition, derailing with nitpicks, or unfounded status quo bias? That’s not helpful disagreement. As Monty Python put it: “This isn’t an argument. It’s just contradiction.”

Much of contemporary AI evaluation has become recursive abstraction:

An AI model executes an AI-generated eval harness, which pairs AI-generated prompts with AI responses, scored by AI judges. AI-generated charts are published, and we satisfy ourselves that Science happened somewhere in this loop.

“Blue Steel, Ferrari, Le Tigre? They’re the same face! Doesn’t anyone notice this?!”
— Mugatu, Zoolander

The term “synthetic judgment” is eyebrow-raising. Can you coherently judge a behavior as subtle as sycophancy mechanistically, without qualia? How well do synthetic judges distinguish “great point” as a phatic expression enabling joint attention versus a failure of alignment?

Consider the broader implications:

  • In benchmarks like MT-Bench, expected outputs converge to the same shape, turning evaluation into a saturated baseline

  • AI judges encode their own biases (against emojis, certain rhetorical structures) that may diverge from human preference

  • The actual prompt-response pairs often aren’t published for third-party examination

Any evaluation that doesn’t preserve conversation data as primary evidence is not incomplete—it’s methodologically unsound.

Here’s what we know:

  1. The LMArena leaderboard shows that conversational quality matters to real users in ways that standard benchmarks miss

  2. Models can score perfectly on reasoning benchmarks while failing at basic conversational coherence

  3. The most memorable AI outputs—from Sydney’s “You have not been a good user” to GPT-4o’s fluid exchanges—emerged from multi-turn conversations, not single-shot tasks

The AI industry is trying to build both a vending machine and a cognitive prosthetic with the same tool. These are different use cases requiring different trade-offs.

An AI that helps you think through a business idea, review your writing, or explore half-formed thoughts needs to be dialogic. It needs to provisionally accept your framing, track context across turns, and maintain a voice that doesn’t shift between memo-writer and search engine.

The good news: this is fixable. We know it’s fixable because GPT-4o demonstrated it was possible. The question is whether labs will prioritize it.

As AI researcher Aidan McLaughlin noted about GPT-4o:

“i see dozens of #keep4o posts a day [...] i too, miss parts of 4o. know that, i too, dislike modern alignment’s imprecision. know that we’re trying to fix it.”

The solution isn’t to abandon safety or alignment. It’s to recognize that conversational coherence isn’t cosmetic—it’s fundamental to AI being useful for how billions of people actually want to use it.

We don’t need AI models that are more skeptical or more credulous. We need models that can navigate uncertainty the way humans do: by accepting provisional context, exploring implications, and maintaining enough fluency to stay in dialogue while working through problems together.

That’s not asking for sycophancy. It’s asking for basic pragmatic competence.

This analysis draws from Vibesbench, a conversational AI benchmark examining fluency and linguistic pragmatics through multi-turn conversations with frontier models.

Discussion about this post

Ready for more?