Show HN: Reliably Incorrect – explore LLM reliability with data visualizations
adamsohn.comLove the p_step^N framing — maps cleanly to agentic chains where errors compound. Worth naming a second reliability failure that sits one layer below, in single-turn: confident-absent.
Ran "what's the best X for Y?" across 6 LLMs (ChatGPT, Perplexity, Gemini, Claude, DeepSeek, Mistral) for ~200 B2B SaaS tools across 34 categories. In 60%+ of categories, models converge on the same "default three" and everything else is effectively invisible. Not wrong — just erased. Single-turn, so it never shows up in p_step^N.
A verification layer catches "false." But there's no layer catching "the space of correct answers was silently pruned." Curious if your framework could be extended from correctness per step to coverage per response.
The framing of p_step^N is useful, but it points to a deeper architectural problem: verification fails because it samples from the same distribution as the generator. The real fix isn't better prompting — it's independent verification with uncorrelated error distributions. This maps directly to institutional governance problems. A decision made by a single agent with no memory of prior decisions, no reputation weight, and no contextual history of outcomes will fail the same way — not randomly, but systematically, in the same direction. Persistent memory reduces N by eliminating context reconstruction at each session. Reputation-weighted voting creates genuinely independent verification — an agent with a strong track record samples from a different distribution than a new one. And outcome contextualization feeds results back into the next cycle rather than discarding them. The author identifies the problem precisely. The solution isn't a better prompt — it's a different architecture.