Show HN: Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning

5 points by bluecoconut 4 months ago · 0 comments · 2 min read

Reader

I've been working on applying LLMs to long-context, verifiable problems over the past year, and today I'm releasing a benchmark of 62,000 pencil puzzles across 94 types (sudoku, nonori, slitherlink, etc.). The benchmark also allows for intermediate checks /rule breaks for all varieties at any step.

I tested 51 models against a subset (300 puzzles) in two modes: single-shot (output the full solution) and agentic (iterate with verifier feedback).

Some results:

- Best model (GPT 5.2@xhigh) solves 56%. (~ half the puzzles are unsolved by any model)

- Agentic solves average 29 turns. The longest attempt took ~1,200 turns over 14 hours.

- Cost per success varies wildly (cheapest: $0.00033 — Grok 4.1 Fast Reasoning, most expensive: $238.16 — Claude Sonnet 4.6 (1M context))

- Reasoning depth (eg. @medium, @high, @xhigh) dramatically improves capability (up to repeated infrastructure failure for @xhigh)

- Stark difference between US closed models (3 at >33%) and Chinese open models (top: 6%)

Made the website to show off the dataset + play every puzzle, and even every replay AI agent solves step-by-step (fun to watch how it gets to solutions).

Also here's the paper: https://arxiv.org/abs/2603.02119

I didn't test human ability to solve, but it seems these puzzles are pretty difficult. I'd be curious how HN audience fares on the puzzles.

No comments yet.

Settings

Show HN: Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning

Keyboard Shortcuts