How to read these results
These benchmarks show what we measured, not what you should expect. Token usage varies significantly based on model behavior, codebase structure, prompt phrasing, and run-to-run variance. We show Claude Code (Sonnet 4.5) as the baseline for comparison.
Each benchmark shows raw token counts and the delta versus Claude Code. Negative percentages mean fewer tokens used. We report single runs here; for variance data on repeated runs, see individual task notes.
Methodology
How we ran these benchmarks:
- Model: Claude Sonnet 4.5 for all runs (via Anthropic API)
- Baseline: Claude Code with default settings
- Measurement: Total input + output tokens for task completion, including all delegated subagent tasks
- Success criteria: Task must complete correctly; failed runs excluded
- Environment: Clean clone of repository, no prior context
Token counts include everything, when an agent spawns subagents or delegates work, those tokens are summed into the total. We're working on publishing the benchmark harness so you can try testing them yourself. For now, treat these as indicative, not definitive.
Results by task type
We run many more scenarios internally, but these are an excerpt showing typical real-world use cases where we've seen consistent strengths in our approach.
Exploration tasks
Navigating unfamiliar codebases to answer questions about structure or behavior.
| Agent | Tokens | vs Baseline |
|---|---|---|
| Claude Code (Sonnet 4.5) — baseline | 45k | — |
| OpenCode (Sonnet 4.5) | 70k | +56% |
| Chippery /balanced | 31k | −31% |
| Chippery /balanced-pro | 27k | −40% |
| Chippery /frugal | 21k | −53% |
Model: Sonnet 4.5 Average of 15 runs All runs succeeded
Debugging tasks
Finding and fixing bugs in existing code.
| Agent | Tokens | vs Baseline |
|---|---|---|
| Claude Code (Sonnet 4.5) — baseline | 35k | — |
| OpenCode (Sonnet 4.5) | 28k | −20% |
| Chippery /balanced | 21k | −40% |
| Chippery /balanced-pro | 19k | −46% |
| Chippery /frugal | 17k | −51% |
Model: Sonnet 4.5 Average of 15 runs All runs succeeded
Refactoring tasks
Restructuring code, fixing issues, and cleaning up without changing behavior.
| Agent | Tokens | vs Baseline |
|---|---|---|
| Claude Code (Sonnet 4.5) — baseline | 64k | — |
| OpenCode (Sonnet 4.5) | 42k | −34% |
| Chippery /balanced | 36k | −44% |
| Chippery /balanced-pro | 34k | −47% |
| Chippery /frugal | 31k | −52% |
Model: Sonnet 4.5 Average of 15 runs All runs succeeded
Multi-turn conversations
Real-world sessions that shift focus, starting in one area, moving elsewhere, then returning. These test how well the agent maintains context and navigates changing requirements.
| Agent | Tokens | vs Baseline |
|---|---|---|
| Claude Code (Sonnet 4.5) — baseline | 98k | — |
| OpenCode (Sonnet 4.5) | 95k | −3% |
| Chippery /balanced | 42k | −57% |
| Chippery /balanced-pro | 36k | −63% |
| Chippery /frugal | 32k | −67% |
Model: Sonnet 4.5 Average of 15 runs All runs succeeded
Where optimization didn't help
Not every benchmark favors our approach. Here's one where Chippery performed worse than baseline.
| Agent | Tokens | vs Baseline |
|---|---|---|
| Claude Code (Sonnet 4.5) — baseline | 110k | — |
| OpenCode (Sonnet 4.5) | 137k | +25% |
| Chippery /balanced | 142k | +29% |
| Chippery /balanced-pro | 130k | +18% |
| Chippery /frugal | 132k | +20% |
Model: Sonnet 4.5 Average of 15 runs All runs succeeded
Why this happened: Chippery's optimization relies on understanding codebase structure and navigating efficiently. Lucky is written in Crystal which uses patterns that our indexer couldn't rank effectively. When Chippery can't build a reliable mental model of the code, its navigation heuristics work against it, causing more exploration rather than less. This is the kind of project where /all-in mode (no optimization) is a good fallback.
Where Chippery does well
We've seen consistent improvements in these scenarios:
- Codebase navigation: When Chippery understands the project structure and can navigate it quickly, token savings are typically strong.
- Changing conversations: Sessions that shift topics, starting on one problem, exploring something else, then returning, tend to benefit from our context management.
- Popular languages and frameworks: We've tested extensively on JavaScript, TypeScript, Python, Go, Rust, PHP, Swift, Java, C, C++, Ruby, and Kotlin. Projects using common frameworks in these languages generally work well.
Where Chippery struggles
We've observed minimal or no improvement, or occasional regressions, in these scenarios:
- Unfamiliar concepts: If the codebase uses patterns or abstractions that the underlying model doesn't understand well, Chippery can't compensate and may rank irrelevant code higher in the index.
- Poor relevance ranking: Some project structures make it hard for our indexer to determine what's important. When this happens, the agent may read more code than necessary, reducing or eliminating token savings.
Languages tested
Our benchmarks have used these languages. Chippery likely works on others; we just haven't measured them yet.
JavaScript TypeScript Python Go Rust PHP Swift Java C C++ Ruby Kotlin
Reproduce these benchmarks
We're working on publishing our benchmark harness. When ready, you'll be able to run these same tasks on your own machine and compare results. Watch the GitHub repo for updates.
In the meantime, if you run your own comparisons and want to share results, we'd be interested: [email protected]