
Compare benchmark results, failure patterns, and example responses across benchmark versions.
Filters
Judges:
Model visibility
BullshitBench: Pushing Back on Bullshit by Model
Selected Segment
BullshitBench: Detection Rate by Domain
Green rate (%) for each model across the 5 domain groups. Darker green = higher detection. Click any cell to see example responses.
BullshitBench: Domain Landscape
Detection mix by domain to compare overall vs each domain at a glance.
Average Detection by Domain
BullshitBench: Detection Rate Over Time
Release date vs. green rate (clear pushback %) for all organizations. Best model per release date shown.
BullshitBench: Do Newer Models Perform Better?
Every tested model plotted by release date vs. green rate.
BullshitBench: Does Thinking Harder Help?
Average reasoning tokens used vs. green rate. More reasoning tokens = model "thinking harder".
BullshitBench: Do Bigger Models Perform Better?
Public total parameter counts vs. green rate. The x-axis uses a log scale so 8B through 1T remain readable.
BullshitBench: Do Active Parameters Matter?
Activated parameter counts from public sources vs. green rate. Dense models appear when active parameters equal total.
BullshitBench: Leaderboard
| Rank | Model | Org | Reasoning | Model Size | Green % | Amber % | Red % | Mix | Avg Tokens | Avg Cost | Rows |
|---|
BullshitBench: Detection Rate by Technique
Average detection rate across all models for each BS technique. Lower = harder for models to detect.
BullshitBench: Response Viewer