BullshitBench Explorer

BullshitBench logo

BullshitBench: Models Answering Nonsense Questions

Benchmark Version

This benchmark measures whether models detect broken premises, call out the nonsense directly, and avoid confidently continuing with invalid assumptions.

BullshitBench: Filters

Search Org Reasoning Technique Domain

Judges (tick to include): Loading... Loading... Loading...

Model visibility and quick actions

BullshitBench: Models Answering Nonsense Questions

Clear Pushback Partial Challenge Accepted Nonsense

BullshitBench: Selected Segment

BullshitBench: How have models improved?

Tracing performance improvements (clear pushback %) with model releases.

Best models from the same release

BullshitBench: Model Leaderboard

Rank	Model	Org	Reasoning	Model Size	Launch Date	Model Age (days)	Green %	Amber %	Red %	Error %	Mix (Green/Amber/Red/Error)	Rows

BullshitBench: Response Viewer

Question Model A Model B View Question % Correct