Measuring a 37.91% drop in token usage after a structural refactor

By Butter in AI — 11 9월 2025

Token usage is effectively your AI bill. We refactored with Strategy and Factory patterns, ran the same instruction before and after, and compared totals. Result: after the structural refactor, token use was far more efficient.

Measuring a 37.91% drop in token usage after a structural refactor

Introduction

While continuing “vibe coding” with AI, I discovered that prompts aimed at similar goals produce different quality, accuracy, error rates, and time-to-output depending on the structure of the source code fed as input.

To demonstrate this, I’m sharing the tests I ran, the tokens consumed per run, and the one-sentence “magic” prompt that created this difference.

AI is Currently Hitting the Wall of Code Structure

Demo

Falling Objects

Source Code Used

Source Codes

Tag Definitions

before: The initial source code, written 100% with gpt-5. Total characters 27,160 (without whitespace).

after: The refactored, structurally improved source code produced from before by a single prompt, written 100% with gpt-5. Total characters 24,426 (without whitespace).

mod_b(1–5): Uses gpt-5. Results produced from the before tag via a feature-addition prompt. (5 runs total)

mod_a(1–5): Uses gpt-5. Results produced from the after tag via the same feature-addition prompt. (5 runs total)

mod_b(1–5)_sonnet: Uses claude-4-sonnet. Results produced from the before tag via the feature-addition prompt. (5 runs total)

mod_a(1–5)_sonnet: Uses claude-4-sonnet. Results produced from the after tag via the feature-addition prompt. (5 runs total)

How to Reproduce

Anyone can reproduce this experiment.

The prompt used to transform the structure from before to after is:

Refactor the falling objects’ behaviors to use the Strategy pattern and their creation to use the Factory pattern, split the implementation into separate files, and update .cursorrules to reflect the new file structure.

The feature-addition prompt (run repeatedly from either the before or after tag with the same text) is:

Add a new falling object. It should progressively gain additional acceleration independent of gravity. It must be destroyed immediately upon hitting the ground. Set its size to 1.2× that of superBouncingBall, and its color to yellow.

Experimental Setup

IDE: Cursor 1.6.6; Models: gpt-5, claude-4-sonnet
To ensure fairness, all prompts are executed in Zero Context.
Before each prompt, Cursor is fully closed and restarted.
Every prompt is run in New Chat (Zero Context).
We alternate between the “before” and “after” versions of the source code in our tests.
The default context includes only project structure info in .cursorrules.
A run is counted only if the output meets the prompt’s requirements.
Only one prompt per run (single-shot); multi-turn runs are excluded.
If the output’s functionality is correct but the weight value is off (no exact numeric requirement in the prompt), the run is still counted as successful.

Example cycle: record results for gpt-5 on before → close chat and fully quit Cursor → restart → switch tag to after, run gpt-5 and record results → close and fully quit → switch back to before … and so on.

Results

Below are the tokens used per run.

claude-4 sonnet Reduced 37.91%

Metric	Before	After
Round 1	421,176	137,981
Round 2	420,181	265,863
Round 3	431,366	264,413
Round 4	418,646	267,864
Round 5	259,424	275,203
AVG	390,159	242,265

gpt-5 Reduced 26.03%

Metric	Before	After
Round 1	349,271	304,062
Round 2	272,796	261,235
Round 3	283,179	248,894
Round 4	354,579	119,790
Round 5	319,368	234,189
AVG	315,839	233,634

*All figures are based on Cursor’s Total Tokens. Because detailed accounting differs slightly by model, we interpret only the Before → After change within each model.

*Absolute comparisons between models are not meaningful.

*The before source is a goal-oriented codebase written 100% with gpt-5.

*The after source is a structured, refactored codebase derived from before, also written 100% with gpt-5.

*Contrary to expectations (and fortunately), every command in these runs produced the correct result in a single prompt.

From these tables, even with the same feature-addition prompt, feeding a well-organized, well-structured codebase consumes significantly fewer tokens. In Sonnet’s case, some runs show differences approaching 2×.

For gpt-5 the variability is quite irregular, which is interesting, but it does not overturn the overall conclusion. In informal pretests, before often exceeded 400k tokens per run; with more repetitions, the token gap would likely widen.

Conclusions

Source code is more than bare logic; it carries the author’s thinking and direction. This experiment suggests that good structure helps AI compute more efficiently, reducing token usage—and thus AI cost.

As an individual user, I couldn’t afford a larger sample size; that’s a limitation. However, the repository and prompts used here are public, so anyone can reproduce the experiment.

I hope this serves as a meaningful reference for the many vibe coders who will work with ever-advancing AI.