Opus 4.7 isn't dumb, it's just lazy

Anthropic’s release of Claude Opus 4.7 this past week has been polarizing, to say the least. Anthropic claims their latest model is better at long-context tasks, follows instructions more closely, and scores higher on the benchmarks. Some folks on Bluesky report substantial performance jumps on their internal evals. Yet a cursory glance at r/ClaudeCode turns up users frustrated with context collapse, hallucinations, and runaway token usage. One user even asked Opus 4.7 itself why everyone hates it.

My own experience has been similar. The initial excitement at 4.7’s fast responses festered into disappointment once I realized they were only fast because 4.7 wasn’t bothering to make any tool calls. That gut feel turned into puzzlement when I saw Simon Willison’s post about Qwen 3.6 beating Opus 4.7 at the Pelican on a Bicycle benchmark.

So naturally, I ran some experiments. I didn’t have Qwen 3.6 set up locally, so I pulled it via OpenRouter for an apples-to-apples comparison, and baselined against Opus 4.6 to see whether 4.7 is actually a step up.

Disclaimer: these are one-shot examples and might not generalize at all outside of the niche SVG generation case.

Base Prompt

First up, the classic Pelican prompt:

Generate an SVG of a pelican riding a bicycle.

Qwen 3.6 did a great job, with Opus 4.6 second and 4.7 dead last. The picture shifts once you look at the completion, reasoning, and cost numbers: 4.7 barely reasoned at all (20 tokens!) and produced far fewer completion tokens than the other two. That’s even more impressive once you account for 4.7’s new, more token-hungry tokenizer. Net: a 4.7 call runs at 1/6 the price of the equivalent 4.6 call.

Just Think Harder

Next attempt: my lazy version of ‘let’s think step by step’ for SVG generation.

Generate an SVG of a pelican riding a bicycle, do your best, think hard and plan before generating

Gotta say, Qwen’s work was impressive. Look at the grass next to the road! Even 4.6 added clouds. 4.7? Still can’t close the bike frame. That said, Opus 4.7 only used 95 tokens to ‘think hard’ (a 40x jump from the base prompt), compared to ~500 for the other two. Its output is still the shortest, and still 1/4 the price of Opus 4.6.

AI, Write My Prompt

Maybe ‘it’s a skill issue’ and 4.7 just needed moar prompts. So I asked 4.7 to write a detailed prompt for me:

Generate an SVG of a pelican riding a bicycle. Goal: the most lifelike pelican and the most geometrically correct bicycle you can produce.

Plan thoroughly before drawing. Work through these explicitly in your reasoning, not as visible commentary:

Pelican anatomy — silhouette, proportions, the gular pouch hanging from the lower mandible, the S-curve of the neck, wing folding, leg articulation (a bird’s visible “knee” is actually the ankle and bends backward), and the angle at which the upper beak attaches to the skull. Decide concrete coordinates for the head, eye, beak tip, neck base, body center, wing root, leg joints, and feet.

Bicycle geometry — a real diamond frame is fully triangulated. Pick wheel centres first, then derive the bottom bracket, seat tube, down tube, top tube, head tube, fork, chain stays, and seat stays. Every tube must meet at a shared joint with another tube — no floating endpoints. Verify the rear triangle and front triangle each close.

Composition and contact — the pelican must actually sit on the seat with weight on it, legs must reach the pedals at a plausible crank angle (one pedal up, one pedal down), and wings should interact with the handlebars (gripping, resting on, or steering). No floating limbs. No detached beaks. No wheels off the ground.

Self-check pass — before emitting the SVG, walk back through your coordinates and confirm each pelican part connects to the next, the bicycle frame triangulates, and every contact point lines up. Fix anything that doesn’t.

Then output a single complete SVG inside one svg code fence. No commentary after.

This time all three produced reasonable outputs, with 4.7’s handlebar slightly off and Qwen’s frame geometry slightly unsound — chalk it up to the RNG gods. Worth noting: the sun, clouds, and grass from the earlier runs have disappeared. All three models now generate the pelican and the bike, nothing more. I’m still giving 4.6 and Qwen a slight edge on pelican detail.

Cost-wise, I’m a bit shocked at how efficient 4.7 is on both output and reasoning tokens. It used 1/10 the reasoning tokens of Qwen 3.6 and 1/30 that of Opus 4.6, with only 1/9 the completion tokens of 4.6, at 1/10 the price. On a tokens-per-pelican basis, 4.7 is comparable to Qwen 3.6 and clearly better value than 4.6.

That’s also when I noticed that Opus 4.7 consistently produces smaller SVGs than the other two. Time to fix that.

Still Wordy, But 1200x1200

Same prompt as version 3, plus:

Required canvas: viewBox="0 0 1200 1200" with width="1200" and height="1200". Compose for a square frame and use the full 1200×1200 area.

Looking at the resized outputs, I like Qwen’s pelican the most, but 4.7’s bike frame edges out the other two — despite 4.7 still producing the smallest SVG of the three.

Looking at token usage, this is where Larry Wall’s old ‘laziness is a programmer’s virtue’ line earns it. Opus 4.7 costs 1/7 as much as 4.6 and uses about 1/10 the reasoning tokens of Qwen.

The Scoreboard

Prompt	Qwen 3.6	Opus 4.6	Opus 4.7
1 · minimal	$0.00797.0 KB	$0.24155.9 KB	$0.04423.0 KB
2 · think hard	$0.011810.8 KB	$0.26577.5 KB	$0.06023.9 KB
3 · structured	$0.02129.5 KB	$1.45859.6 KB	$0.18976.2 KB
4 · structured, 1200×1200	$0.024413.5 KB	$2.059411.5 KB	$0.35079.4 KB
Totals	$0.065340.7 KB	$4.025034.5 KB	$0.644722.6 KB

Each cell: cost above, output SVG size below. Reasoning effort high, max_tokens 100k via OpenRouter.

So what does all this mean? To quote Simon Willison:

…there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models… Today, even that loose connection to utility has been broken.

Opus 4.7 is rated by other frontier models as clearly the best in my AI-on-AI Arena, but to get the most out of it, you need to write detailed prompts that tell it exactly what you want. What does that mean for vibe coding? Maybe start by asking 4.7 to enhance your prompt before it writes any code. Or maybe laziness really is a virtue — now shared by programmers and LLMs alike.

As for me, my daily driver will continue to be the latest Anthropic flagship model at the highest reasoning setting my subscription will pay for. But I’m adopting the following new rules for Opus 4.7:

Closer reading for hallucinations. In my experience, 4.7 makes more obvious hallucinations than 4.6 (possibly due to laziness).
Care less about the token usage. It does a pretty good job of conserving tokens even when I don’t always want it to.
Be more explicit in my prompts. And check with Opus more often on refining prompts for missing details.
Use it for tasks that require reasoning. When its full capacity is used it can do some pretty amazing things (see the AI-on-AI Arena for examples).