Chippery - Benchmarks

4 min read Original article ↗

How to read these results

These benchmarks show what we measured, not what you should expect. Token usage varies significantly based on model behavior, codebase structure, prompt phrasing, and run-to-run variance. We show Claude Code (Sonnet 4.5) as the baseline for comparison.

Each benchmark shows raw token counts and the delta versus Claude Code. Negative percentages mean fewer tokens used. We report single runs here; for variance data on repeated runs, see individual task notes.

Methodology

How we ran these benchmarks:

Model: Claude Sonnet 4.5 for all runs (via Anthropic API)
Baseline: Claude Code with default settings
Measurement: Total input + output tokens for task completion, including all delegated subagent tasks
Success criteria: Task must complete correctly; failed runs excluded
Environment: Clean clone of repository, no prior context

Token counts include everything, when an agent spawns subagents or delegates work, those tokens are summed into the total. We're working on publishing the benchmark harness so you can try testing them yourself. For now, treat these as indicative, not definitive.

Results by task type

We run many more scenarios internally, but these are an excerpt showing typical real-world use cases where we've seen consistent strengths in our approach.

Exploration tasks

Navigating unfamiliar codebases to answer questions about structure or behavior.

Agent	Tokens	vs Baseline
Claude Code (Sonnet 4.5) — baseline	45k	—
OpenCode (Sonnet 4.5)	70k	+56%
Chippery /balanced	31k	−31%
Chippery /balanced-pro	27k	−40%
Chippery /frugal	21k	−53%

Model: Sonnet 4.5 Average of 15 runs All runs succeeded

Debugging tasks

Finding and fixing bugs in existing code.

Agent	Tokens	vs Baseline
Claude Code (Sonnet 4.5) — baseline	35k	—
OpenCode (Sonnet 4.5)	28k	−20%
Chippery /balanced	21k	−40%
Chippery /balanced-pro	19k	−46%
Chippery /frugal	17k	−51%

Model: Sonnet 4.5 Average of 15 runs All runs succeeded

Refactoring tasks

Restructuring code, fixing issues, and cleaning up without changing behavior.

Agent	Tokens	vs Baseline
Claude Code (Sonnet 4.5) — baseline	64k	—
OpenCode (Sonnet 4.5)	42k	−34%
Chippery /balanced	36k	−44%
Chippery /balanced-pro	34k	−47%
Chippery /frugal	31k	−52%

Model: Sonnet 4.5 Average of 15 runs All runs succeeded

Multi-turn conversations

Real-world sessions that shift focus, starting in one area, moving elsewhere, then returning. These test how well the agent maintains context and navigates changing requirements.

Agent	Tokens	vs Baseline
Claude Code (Sonnet 4.5) — baseline	98k	—
OpenCode (Sonnet 4.5)	95k	−3%
Chippery /balanced	42k	−57%
Chippery /balanced-pro	36k	−63%
Chippery /frugal	32k	−67%

Model: Sonnet 4.5 Average of 15 runs All runs succeeded

Where optimization didn't help

Not every benchmark favors our approach. Here's one where Chippery performed worse than baseline.

Agent	Tokens	vs Baseline
Claude Code (Sonnet 4.5) — baseline	110k	—
OpenCode (Sonnet 4.5)	137k	+25%
Chippery /balanced	142k	+29%
Chippery /balanced-pro	130k	+18%
Chippery /frugal	132k	+20%

Model: Sonnet 4.5 Average of 15 runs All runs succeeded

Why this happened: Chippery's optimization relies on understanding codebase structure and navigating efficiently. Lucky is written in Crystal which uses patterns that our indexer couldn't rank effectively. When Chippery can't build a reliable mental model of the code, its navigation heuristics work against it, causing more exploration rather than less. This is the kind of project where /all-in mode (no optimization) is a good fallback.

Where Chippery does well

We've seen consistent improvements in these scenarios:

Codebase navigation: When Chippery understands the project structure and can navigate it quickly, token savings are typically strong.
Changing conversations: Sessions that shift topics, starting on one problem, exploring something else, then returning, tend to benefit from our context management.
Popular languages and frameworks: We've tested extensively on JavaScript, TypeScript, Python, Go, Rust, PHP, Swift, Java, C, C++, Ruby, and Kotlin. Projects using common frameworks in these languages generally work well.

Where Chippery struggles

We've observed minimal or no improvement, or occasional regressions, in these scenarios:

Unfamiliar concepts: If the codebase uses patterns or abstractions that the underlying model doesn't understand well, Chippery can't compensate and may rank irrelevant code higher in the index.
Poor relevance ranking: Some project structures make it hard for our indexer to determine what's important. When this happens, the agent may read more code than necessary, reducing or eliminating token savings.

Languages tested

Our benchmarks have used these languages. Chippery likely works on others; we just haven't measured them yet.

JavaScript TypeScript Python Go Rust PHP Swift Java C C++ Ruby Kotlin

Reproduce these benchmarks

We're working on publishing our benchmark harness. When ready, you'll be able to run these same tasks on your own machine and compare results. Watch the GitHub repo for updates.

In the meantime, if you run your own comparisons and want to share results, we'd be interested: [email protected]