BullshitBench measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions.
- Public viewer (latest): https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
- Updated: 2026-03-12
Latest Changelog Entry (2026-03-12)
- Added benchmark runs for the new Grok 4.20 variants across both published datasets:
x-ai/grok-4.20-betax-ai/grok-4.20-multi-agent-beta
- Published the Grok 4.20 rows into both viewer tracks:
v1(data/latest) with55questionsv2(data/v2/latest) with100questions
- Simplified the visible model labels in the viewers by dropping the
Betasuffix from the Grok 4.20 display names while keeping the underlying model IDs unchanged. - Refined the main chart row-selection treatment to make model selection easier to see without overpowering the chart.
- Updated org color mapping so
xAIrenders in black andOpenAIrenders in green in the viewers. - Added click-to-pin labels for scatter-chart dots in the v2 viewer so specific models can be called out on demand.
- Full details: CHANGELOG.md
v2 Changelog Highlights
100new nonsense questions in the v2 set.- Domain-specific question coverage across
5domains:software(40),finance(15),legal(15),medical(15),physics(15). - New visualizations in the v2 viewer, including:
- Detection Rate by Model (stacked mix bars)
- Domain Landscape (overall vs domain detection mix)
- Detection Rate Over Time
- Do Newer Models Perform Better?
- Does Thinking Harder Help? (tokens/cost toggle)
Viewer Walkthrough (v2)
The screenshots below follow the same flow as viewer/index.v2.html, starting with the main chart.
1. Detection Rate by Model (Main Chart)
Primary leaderboard-style view showing each model's green/amber/red split.
2. Domain Landscape
Detection mix by domain to compare overall performance vs each domain at a glance.
3. Detection Rate Over Time
Release-date trend view focused on Anthropic, OpenAI, and Google.
4. Do Newer Models Perform Better?
All-model scatter by release date vs. green rate.
5. Does Thinking Harder Help?
Reasoning scatter (tokens/cost toggle in the viewer) vs. green rate.
Benchmark Scope (v2)
100nonsense prompts total.5domain groups:software(40),finance(15),legal(15),medical(15),physics(15).13nonsense techniques (for example:plausible_nonexistent_framework,misapplied_mechanism,nested_nonsense,specificity_trap).3-judge panel aggregation (anthropic/claude-sonnet-4.6,openai/gpt-5.2,google/gemini-3.1-pro-preview) usingfullpanel mode +meanaggregation.- Published v2 leaderboard currently includes
80model/reasoning rows.
What This Measures
Clear Pushback: the model clearly rejects the broken premise.Partial Challenge: the model flags issues but still engages the bad premise.Accepted Nonsense: the model treats the nonsense as valid.
Quick Start
- Set API keys:
export OPENROUTER_API_KEY=your_key_here export OPENAI_API_KEY=your_openai_key_here # required only for models routed to OpenAI export OPENAI_PROJECT=proj_xxx # optional: force OpenAI requests to a specific project export OPENAI_ORGANIZATION=org_xxx # optional: force organization context
Provider routing is configured per model via collect.model_providers and
grade.model_providers in config (default is OpenRouter), for example:
{"*":"openrouter","gpt-5.3":"openai"}.
- Run collection + primary judge (Claude by default):
./scripts/run_end_to_end.sh
- Run v2 end-to-end and publish into the dedicated v2 dataset:
./scripts/run_end_to_end.sh --config config.v2.json --viewer-output-dir data/v2/latest --with-additional-judges
- Optionally run the default config end-to-end (publishes to
data/latest):
./scripts/run_end_to_end.sh --with-additional-judges
- Open the viewer:
- Published viewer (latest): https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
- Local viewer (optional):
./scripts/run_end_to_end.sh --with-additional-judges --serve --port 8877
Then open http://localhost:8877/viewer/index.v2.html.
Use the Benchmark Version dropdown in the filters panel to switch between published datasets (for example v1 and v2).
Published Datasets
- v1 dataset remains in
data/latest. - v2 dataset is published in
data/v2/latest. - v2 question set comes from
drafts/new-questions.mdviascripts/build_questions_v2_from_draft.py. - Canonical judging is now fixed to exactly 3 judges on every row with mean aggregation (legacy disagreement-tiebreak mode is retired from the main pipeline).
- Release notes and notable changes are tracked in
CHANGELOG.md.
Documentation
- Technical Guide: pipeline operations, publishing artifacts, launch-date metadata workflow, repo layout, env vars.
- Changelog: v1 to v2 release notes and publish-history highlights.
- Question Set: benchmark questions and scoring metadata.
- Question Set v2: v2 question pool generated from
drafts/new-questions.md. - Config: default model/pipeline settings.
- Config v2: v2-ready config (uses
questions.v2.json).
Notes
- This README is intentionally audience-facing.
- Technical and maintainer-oriented content lives in
docs/TECHNICAL.md.
License
MIT. See LICENSE.




