I appreciate Anthropic's AI safety research but fair is fair and refusals count as failures: Opus 4.7 (high) scores 41.0 on the Extended NYT Connections Benchmark. Opus 4.7 (no reasoning) scores 15.3. GLM-5.1 scores 84.3. Qwen3.5-27B scores 60.7. Step 3.5 Flash scores 39.9. https://t.co/9oieu3Agsq

I appreciate Anthropic's AI safety research but fair is fair and refusals count as failures: Opus 4.7 (high) scores 41.0 on the Extended NYT Connections Benchmark. Opus 4.7 (no reasoning) scores 15.3. GLM-5.1 scores 84.3. Qwen3.5-27B scores 60.7. Step 3.5 Flash scores 39.9.

So I was digging into my API-calling code (manually, like a caveman, because Codex wasn't finding any issues), thinking something was wrong, but Opus 4.7 actually blocks almost all NYT Connections requests. WTF!

2:44 AM · Apr 17, 202613.4KViews