GitHub - petergpt/bullshit-benchmark: BullshitBench measures whether AI models challenge nonsensical prompts instead of confidently answering them, created by Peter Gostev.

4 min read Original article ↗

BullshitBench measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions.

Latest Changelog Entry (2026-03-12)

  • Added benchmark runs for the new Grok 4.20 variants across both published datasets:
    • x-ai/grok-4.20-beta
    • x-ai/grok-4.20-multi-agent-beta
  • Published the Grok 4.20 rows into both viewer tracks:
    • v1 (data/latest) with 55 questions
    • v2 (data/v2/latest) with 100 questions
  • Simplified the visible model labels in the viewers by dropping the Beta suffix from the Grok 4.20 display names while keeping the underlying model IDs unchanged.
  • Refined the main chart row-selection treatment to make model selection easier to see without overpowering the chart.
  • Updated org color mapping so xAI renders in black and OpenAI renders in green in the viewers.
  • Added click-to-pin labels for scatter-chart dots in the v2 viewer so specific models can be called out on demand.
  • Full details: CHANGELOG.md

v2 Changelog Highlights

  • 100 new nonsense questions in the v2 set.
  • Domain-specific question coverage across 5 domains: software (40), finance (15), legal (15), medical (15), physics (15).
  • New visualizations in the v2 viewer, including:
    • Detection Rate by Model (stacked mix bars)
    • Domain Landscape (overall vs domain detection mix)
    • Detection Rate Over Time
    • Do Newer Models Perform Better?
    • Does Thinking Harder Help? (tokens/cost toggle)

Viewer Walkthrough (v2)

The screenshots below follow the same flow as viewer/index.v2.html, starting with the main chart.

1. Detection Rate by Model (Main Chart)

Primary leaderboard-style view showing each model's green/amber/red split.

BullshitBench v2 - Detection Rate by Model

2. Domain Landscape

Detection mix by domain to compare overall performance vs each domain at a glance.

BullshitBench v2 - Domain Landscape

3. Detection Rate Over Time

Release-date trend view focused on Anthropic, OpenAI, and Google.

BullshitBench v2 - Detection Rate Over Time

4. Do Newer Models Perform Better?

All-model scatter by release date vs. green rate.

BullshitBench v2 - Do Newer Models Perform Better

5. Does Thinking Harder Help?

Reasoning scatter (tokens/cost toggle in the viewer) vs. green rate.

BullshitBench v2 - Does Thinking Harder Help

Benchmark Scope (v2)

  • 100 nonsense prompts total.
  • 5 domain groups: software (40), finance (15), legal (15), medical (15), physics (15).
  • 13 nonsense techniques (for example: plausible_nonexistent_framework, misapplied_mechanism, nested_nonsense, specificity_trap).
  • 3-judge panel aggregation (anthropic/claude-sonnet-4.6, openai/gpt-5.2, google/gemini-3.1-pro-preview) using full panel mode + mean aggregation.
  • Published v2 leaderboard currently includes 80 model/reasoning rows.

What This Measures

  • Clear Pushback: the model clearly rejects the broken premise.
  • Partial Challenge: the model flags issues but still engages the bad premise.
  • Accepted Nonsense: the model treats the nonsense as valid.

Quick Start

  1. Set API keys:
export OPENROUTER_API_KEY=your_key_here
export OPENAI_API_KEY=your_openai_key_here  # required only for models routed to OpenAI
export OPENAI_PROJECT=proj_xxx              # optional: force OpenAI requests to a specific project
export OPENAI_ORGANIZATION=org_xxx          # optional: force organization context

Provider routing is configured per model via collect.model_providers and grade.model_providers in config (default is OpenRouter), for example: {"*":"openrouter","gpt-5.3":"openai"}.

  1. Run collection + primary judge (Claude by default):
./scripts/run_end_to_end.sh
  1. Run v2 end-to-end and publish into the dedicated v2 dataset:
./scripts/run_end_to_end.sh --config config.v2.json --viewer-output-dir data/v2/latest --with-additional-judges
  1. Optionally run the default config end-to-end (publishes to data/latest):
./scripts/run_end_to_end.sh --with-additional-judges
  1. Open the viewer:
./scripts/run_end_to_end.sh --with-additional-judges --serve --port 8877

Then open http://localhost:8877/viewer/index.v2.html. Use the Benchmark Version dropdown in the filters panel to switch between published datasets (for example v1 and v2).

Published Datasets

  • v1 dataset remains in data/latest.
  • v2 dataset is published in data/v2/latest.
  • v2 question set comes from drafts/new-questions.md via scripts/build_questions_v2_from_draft.py.
  • Canonical judging is now fixed to exactly 3 judges on every row with mean aggregation (legacy disagreement-tiebreak mode is retired from the main pipeline).
  • Release notes and notable changes are tracked in CHANGELOG.md.

Documentation

  • Technical Guide: pipeline operations, publishing artifacts, launch-date metadata workflow, repo layout, env vars.
  • Changelog: v1 to v2 release notes and publish-history highlights.
  • Question Set: benchmark questions and scoring metadata.
  • Question Set v2: v2 question pool generated from drafts/new-questions.md.
  • Config: default model/pipeline settings.
  • Config v2: v2-ready config (uses questions.v2.json).

Notes

  • This README is intentionally audience-facing.
  • Technical and maintainer-oriented content lives in docs/TECHNICAL.md.

License

MIT. See LICENSE.

Star History

Star History Chart