I Benchmarked Github CLI vs MCP vs Tool Search vs Code Mode — Turns Out the Best Solution is None…

What is the best way to give tools to our agents? Over the past couple of weeks, the debate started to heat up.

Peter Steinberger, the creator of OpenClaw, has been a big proponent of CLI over MCP. He even built a tool called MCPorter to help people convert MCP to CLI.

“MCP sucks honestly,” wrote Garry Tan, CEO of Y Combinator. “It eats too much context window… I vibe coded a CLI wrapper in 30 minutes and it worked 100x better.”

At the same time, Anthropic believes MCP is fixable through techniques like Code Mode and Tool Search. OpenAI introduced native support for tool search in GPT 5.4 as well, to combat the inefficiency.

But do they actually work?

Let’s run some benchmarks

There have been a few benchmarks on this topic already. Notably -

https://mariozechner.at/posts/2025-08-15-mcp-vs-cli but with the limitation that it was done with a niche tool that didn’t quite represent mainstream usage
https://smithery.ai/blog/mcp-vs-cli-is-the-wrong-fight found MCP outperforming CLI in success rate, but the CLI here was auto-generated from API spec

I also couldn’t find a comprehensive analysis that covers real world performance of code mode and tool search.

To bring some more data points to the table, I decided to

Benchmark a mainstream tool that many people use — Github, through a mainstream agent — Claude Code
Work through real, practical use cases across different complexity levels
Cover raw CLI, MCP, Tool Search and Code Mode
Test the hypothesis that none of the techniques today is optimal and we can do better

Results

Without further ado, here are the final results.

I compiled a more detailed report in https://axi.md, but here are some key takeaways.

Finding 1 — Github MCP is 2–3x more expensive to use than CLI

According to the results, using Github MCP will eat your tokens 2–3x faster than using their CLI.

There’s almost no practical reason to use their MCP except for some of the different handling of security.

This is largely due to the upfront tool schema consuming lots of tokens, as well as tool search / code mode often introducing multiple turns for tool discovery.

Finding 2 — Tool Search saves upfront tokens but spends them on extra turns

ToolSearch starts with a smaller context but needs extra turns for tool discovery. Whether that trade-off pays depends on task complexity. Tracking per-turn context size across four tasks reveals the crossover:

Both conditions start at a fixed context size — 16K for ToolSearch (no schemas), 42K for raw MCP (all schemas loaded) — and grow as conversation history accumulates. On a simple view_pr, ToolSearch finishes at 18K (still well under raw’s 42K), so the upfront savings carry through to a lower cost despite one extra turn.

But as tasks grow, the contexts converge. By weekly_catchup, ToolSearch’s last turn has reached 43K—matching raw’s starting context—yet it still needs 2 extra turns, costing 1.5× more. On merged_pr_ci_audit, both end at ~85K tokens, but ToolSearch takes 24 turns to get there vs. 18—the upfront savings are completely gone, and the extra turns make it 1.8× more expensive.

ToolSearch also introduces a new failure mode: on run_then_jobs, the agent cannot find the right tool via search (0/5 success), while raw mcp succeeds 4/5.

Finding 3 — Code Mode is the cheapest way to use MCP, but still 2x more expensive than CLI, and it’s very slow

The mcp-with-code-mode condition, where the agent writes TypeScript scripts calling typed wrapper functions, achieves the lowest cost among MCP conditions ($0.101/task). Writing code amortizes schema costs by batching multiple API calls per script. But the code-generation overhead makes it the slowest condition (43.4s avg, the longest of all five).

On ci_failure_investigation, mcp-with-code-mode costs $0.194/task—3× AXI’s $0.065—but achieves 5/5 success, outperforming both direct MCP conditions (2/5 each). Code-mode also introduces a unique failure mode: unhandled runtime errors cause 4/5 failures on nonexistent_repo, where the agent writes scripts that throw instead of gracefully reporting the error.

Finding 4 — AXI achieves 100% success rate at lowest cost and latency

Wait… what is AXI?

A new era of designing for agent ergonomics

Let’s take a step back and think about how we as humans use digital products everyday.

When we turn on our phones, we see a few frequently used apps showing on the home screen. Some of them have notification badges telling us what needs attention. When we open up Gmail, we see a few recent email threads, with a preview of the content inside. We can then click on a thread to see more if we want to.

Every screen in that journey was carefully designed to minimize cognitive overload and allow us to accomplish what we need to do in the most ergonomic way possible.

Now — imagine we turn on our phone, instead of a home screen, we see a full 300+ page manual and every single button we can possibly click along with every setting we can possibly configure all laid out with equal importance in front of our eyes.

That’s what we’re giving to our agents today.

AXI — Agent eXperience Interface — is a set of 10 design principles for building agent-ergonomic CLI tools, so our agents can accomplish tasks for us efficiently and reliably.

The 10 Principles

Token-efficient output — Use TOON format for ~40% token savings over JSON
Minimal default schemas — 3–4 fields per list item, not 10+
Content truncation — Truncate large text fields with size hints and escape hatches
Content first — Prefer showing actual data, not a wall of help text
Contextual disclosure — Append relevant next-step commands after output, not all upfront
Consistent way to get help — Concise per-subcommand reference for when agents need it
Pre-computed fields — Include aggregated statuses that eliminate round trips
Definitive empty states — Explicit “0 results” rather than ambiguous empty output
Graceful error handling — Idempotent mutations, structured errors, no interactive prompts
Output discipline — stdout for data, stderr for debug; clean exit codes

I shared more details in https://axi.md, along with a skill you can add to your agent to teach it how to build a good AXI, and a reference implementation of gh-axi that you can use today.

Conclusion

The debate between CLI and MCP as agent-tool interfaces misses the deeper question: what design principles make any interface effective for agents?

This evaluation shows that by taking a principled design approach, AXI outperforms both raw CLI and MCP on every metric — success, cost, duration, and turns.

It’s time to start building apps agents love — it’s not a CLI; it’s not a MCP. It’s an AXI.