Press enter or click to view image in full size
Empirium, our in-house experiment-tracking platform at ZAR, exposes 42 MCP tools. Every time Claude Code connects, all 42 tool definitions would get loaded into its context window. That’s about 18,000 tokens of JSON schema — 427 tokens per tool — burned before the model does a single useful thing.
Cloudflare faced the same problem at a much larger scale — 2,500 API endpoints, over 2 million tokens of tool definitions. Their solution was elegant: replace all those tools with just two. One to search, one to execute. They called it “Code Mode.”
I implemented it for Empirium last night, using Rails and ActionMCP. Opus 4.6 mostly one-shotted the change after being pointed at the Cloudflare blog post, and getting some tips from me on how to approach the problem in Ruby.
So without further ado, here’s how my implementation of Code Mode works, and why every MCP server with more than a handful of tools should consider it.
The Problem with Many Tools
MCP (Model Context Protocol) is straightforward. You define tools with JSON Schema, the AI model reads those schemas, and calls the tools by name with arguments. It works beautifully when you have five tools. It works okay with fifteen. At forty-five, you’re paying a serious tax. At hundreds or thousands of tools, it just doesn’t work.
Every MCP tool definition includes its name, description, and full input schema with property types, descriptions, and required fields. The model has to parse and retain all of this context to decide which tool to call. Most of it is wasted — a given interaction might use three or four tools at most, if any at all.
Two Tools to Rule Them All
The pattern is dead simple. You expose two meta-tools:
code_search — accepts a Ruby code string, executes it in a read-only sandbox that exposes a tools method returning the full tool catalog. The AI writes Ruby to filter, search, and explore.
code_execute — accepts a Ruby code string, executes it in a sandbox that can invoke any MCP tool via call_tool(name, **args). The AI writes Ruby to chain calls, transform results, and build workflows.
The key insight is that LLMs are very good at writing code. Better, in fact, than they are at navigating large JSON schemas to pick the right tool and assemble the right arguments. Give them a programming environment and they figure it out.
The Implementation
The Sandbox
The foundation is a BasicObject clean room. BasicObject in Ruby gives you almost nothing — no Kernel, no Object methods, no file access, no shell access. You build up only what you need:
module CodeMode
class Sandbox < BasicObject
FORBIDDEN_PATTERNS = [
/\bsystem\b/, /\bexec\b/, /\b`/, /\bFile\b/, /\bDir\b/,
/\brequire\b/, /\beval\b/, /\bProcess\b/, /\bKernel\b/,
/\bsend\b/, /\bconst_get\b/, /\bObjectSpace\b/, /\bENV\b/
].freeze
TIMEOUT_SECONDS = 5
def evaluate(code)
if (violation = check_forbidden(code))
return { error: "Forbidden pattern detected: #{violation}" }
end
result = ::Timeout.timeout(TIMEOUT_SECONDS) { _eval(code) }
{ result: result }
rescue ::Timeout::Error
{ error: "Code execution timed out after #{TIMEOUT_SECONDS} seconds" }
rescue ::StandardError => e
{ error: "#{e.class}: #{e.message}" }
end
end
endRegex guards catch obvious escape attempts before evaluation. Timeout catches infinite loops. BasicObject blocks access to the rest of Ruby’s standard library. It’s not a Turing-complete security boundary — it’s a practical one. Empirium is for internal use only. Your mileage may vary. (But keep reading to learn about an additional layer of security using “LLM as judge”.)
The Search Sandbox
The search sandbox inherits from Sandbox and exposes the tool catalog:
class SearchSandbox < Sandbox
def tools
::ActionMCP::ToolsRegistry.non_abstract.map do |item|
item.klass.to_h
end
end
endThe ToolsRegistry is picking up the 42 tool definitions that I mentioned earlier.
Press enter or click to view image in full size
Using the SearchSandbox the AI can now write things like:
tools.select { |t| t[:name].include?("experiment") }.map { |t| t[:name] }And get back a filtered view of just the tools it needs, without loading all 42 definitions.
The Execute Sandbox
The execute sandbox adds the ability to call tools with user context:
class ExecuteSandbox < Sandbox
def initialize(user)
@user = user
end
def call_tool(name, **args)
::ActionMCP::Current.set(gateway: ::OpenStruct.new(user: @user)) do
response = ::ActionMCP::ToolsRegistry.tool_call(name, args.stringify_keys)
parse_response(response)
end
end
endThe user context flows through exactly as it would with a normal MCP tool call. Authentication, audit trails, everything works. The AI can now write:
teams = call_tool("teams_list")
alpha = teams["teams"].find { |t| t["slug"] == "alpha" }
exps = call_tool("experiments_list", team_slug: alpha["slug"])
{ team: alpha["name"], total: exps["count"],
running: exps["experiments"].count { |e| e["status"] == "running" } }Three MCP tool calls, chained together with data transformation, in a single round trip. Without code mode, that’s three separate tool invocations with three LLM reasoning steps in between.
The LLM Pre-Scan
Before executing code, I run a quick safety check through Google’s Gemini 3 Flash (using my own Raix gem and OpenRouter). This approach is cheap, fast, and adds a semantic layer on top of the regex guards:
class CodeScanner
include Raix::ChatCompletion SYSTEM_PROMPT = <<~PROMPT
You are a code safety scanner for an MCP (Model Context Protocol) sandbox environment.
The sandbox allows Ruby code that calls `tools` (to list available MCP tools) and
`call_tool(name, **args)` (to invoke them).
Your job: determine if the submitted code is SAFE or UNSAFE.
SAFE code:
- Calls `tools` to discover available MCP tools
- Calls `call_tool` to invoke MCP tools with arguments
- Uses basic Ruby (arrays, hashes, strings, iteration, filtering)
- Chains multiple `call_tool` invocations together
UNSAFE code:
- Attempts to access the filesystem, network, or shell
- Tries to break out of the sandbox (eval, send, const_get, ObjectSpace, etc.)
- Accesses environment variables or credentials
- Does anything unrelated to Empirium data operations
Respond with exactly one word: SAFE or UNSAFE
PROMPT
def initialize(code)
@code = code
self.model = "google/gemini-3-flash-preview"
end
def safe?
transcript << { system: SYSTEM_PROMPT }
transcript << { user: @code }
response = chat_completion
response.to_s.strip.upcase.start_with?("SAFE")
rescue StandardError
true # Fail open — scanner unavailable means skip
end
end
For us the scanner is a nice-to-have, not a gate. It fails open, so the code runs even if the scanner fails with just the regex guards and BasicObject sandbox. Your use case might want to do the opposite, especially if your MCP tools are open to consumers outside of your company walls.
Profile-Based Routing
I left the original 42 tools on /mcp untouched, probably for no good reason other than I would have had to come up with a different way of documenting and implementing my API if I had gotten rid of them.
Code mode lives at /mcp_cm as a separate concurrent endpoint. A Rack middleware switches ActionMCP's thread-local profile:
class CodeModeProfile
def initialize(app)
@app = app
end
def call(env)
if env["PATH_INFO"]&.start_with?("/mcp_cm")
ActionMCP.with_profile(:code_mode) { @app.call(env) }
else
@app.call(env)
end
end
endThe code_mode profile in config/mcp.yml exposes only the two tools:
profiles:
primary:
tools: [all]
code_mode:
tools: [code_search, code_execute]Both endpoints share the same authentication, the same tool implementations, the same database. The only difference is what the AI client sees when it connects.
What It Looks Like in Practice
Here’s a real interaction. I asked Claude Code to add a random emoji to every assumption title in Empirium using the code mode endpoint. One tool call:
emojis = %w[🚀 🔥 💡 🎯 ⚡ 🌟 🎲 🧪 🔬 🏆 💎 🌈]
all = call_tool("assumptions_list")
assumptions = all["assumptions"]
results = assumptions.map do |a|
emoji = emojis.sample
new_statement = "#{emoji} #{a['statement']}"
call_tool("assumptions_update", id: a["id"], statement: new_statement)
{ id: a["id"], emoji: emoji }
end
{ updated: results.size, details: results }Twenty assumptions updated in a single round trip. I watched the web page update in realtime as it worked. Blazing fast, like mind-blowingly so.
Without code mode, that would have been 21 separate tool calls (1 list + 20 updates), each requiring the model to reason about the next step. With code mode, the model writes the loop once and the server executes it.
Undoing it was equally trivial:
all = call_tool("assumptions_list")
all["assumptions"].map do |a|
clean = a["statement"].sub(/\A\p{So}\s*/, "")
call_tool("assumptions_update", id: a["id"], statement: clean)
endThe savings you gain with Code Mode compound.
Every conversation turn that would have listed all 42 tools now lists two. Every multi-step workflow that would have required multiple tool calls and LLM reasoning steps collapses into a single code execution.
Should You Do This?
If your MCP server has fewer than ten tools, maybe not? The overhead isn’t worth the complexity. But if you’re north of twenty tools —or if your users routinely chain multiple tools together — code mode pays for itself immediately.
The implementation is small. My entire code mode implementation is under 200 lines of Ruby across four files, plus a middleware and some config. It took me about an hour, including testing. The sandbox pattern is reusable. The profile-based routing means I can offer both endpoints simultaneously and let clients choose.
The deeper principle here is one that keeps showing up in AI application development: don’t make the model navigate complexity when you can give it tools to manage that complexity itself. LLMs write code better than they do almost anything else. Set them free.