Show HN: Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning
ppbench.comI've been working on applying LLMs to long-context, verifiable problems over the past year, and today I'm releasing a benchmark of 62,000 pencil puzzles across 94 types (sudoku, nonori, slitherlink, etc.). The benchmark also allows for intermediate checks /rule breaks for all varieties at any step.
I tested 51 models against a subset (300 puzzles) in two modes: single-shot (output the full solution) and agentic (iterate with verifier feedback).
Some results:
- Best model (GPT 5.2@xhigh) solves 56%. (~ half the puzzles are unsolved by any model)
- Agentic solves average 29 turns. The longest attempt took ~1,200 turns over 14 hours.
- Cost per success varies wildly (cheapest: $0.00033 — Grok 4.1 Fast Reasoning, most expensive: $238.16 — Claude Sonnet 4.6 (1M context))
- Reasoning depth (eg. @medium, @high, @xhigh) dramatically improves capability (up to repeated infrastructure failure for @xhigh)
- Stark difference between US closed models (3 at >33%) and Chinese open models (top: 6%)
Made the website to show off the dataset + play every puzzle, and even every replay AI agent solves step-by-step (fun to watch how it gets to solutions).
Also here's the paper: https://arxiv.org/abs/2603.02119
I didn't test human ability to solve, but it seems these puzzles are pretty difficult. I'd be curious how HN audience fares on the puzzles.
No comments yet.