Code Mode MCP - NFHN Reader

Give Your AI Agent an Entire API in 1,000 Tokens

Press enter or click to view image in full size

Over the last couple weeks I’ve built a full experiment-centric replacement for Linear called Empirium.

Empirium, our in-house experiment-tracking platform at ZAR, exposes 42 MCP tools. Every time Claude Code connects, all 42 tool definitions would get loaded into its context window. That’s about 18,000 tokens of JSON schema — 427 tokens per tool — burned before the model does a single useful thing.

Cloudflare faced the same problem at a much larger scale — 2,500 API endpoints, over 2 million tokens of tool definitions. Their solution was elegant: replace all those tools with just two. One to search, one to execute. They called it “Code Mode.”

I implemented it for Empirium last night, using Rails and ActionMCP. Opus 4.6 mostly one-shotted the change after being pointed at the Cloudflare blog post, and getting some tips from me on how to approach the problem in Ruby.

So without further ado, here’s how my implementation of Code Mode works, and why every MCP server with more than a handful of tools should consider it.

The Problem with Many Tools

MCP (Model Context Protocol) is straightforward. You define tools with JSON Schema, the AI model reads those schemas, and calls the tools by name with arguments. It works beautifully when you have five tools. It works okay with fifteen. At forty-five, you’re paying a serious tax. At hundreds or thousands of tools, it just doesn’t work.

Every MCP tool definition includes its name, description, and full input schema with property types, descriptions, and required fields. The model has to parse and retain all of this context to decide which tool to call. Most of it is wasted — a given interaction might use three or four tools at most, if any at all.

Two Tools to Rule Them All

The pattern is dead simple. You expose two meta-tools:

code_search — accepts a Ruby code string, executes it in a read-only sandbox that exposes a tools method returning the full tool catalog. The AI writes Ruby to filter, search, and explore.

code_execute — accepts a Ruby code string, executes it in a sandbox that can invoke any MCP tool via call_tool(name, **args). The AI writes Ruby to chain calls, transform results, and build workflows.

The key insight is that LLMs are very good at writing code. Better, in fact, than they are at navigating large JSON schemas to pick the right tool and assemble the right arguments. Give them a programming environment and they figure it out.

The Implementation

The Sandbox

The foundation is a BasicObject clean room. BasicObject in Ruby gives you almost nothing — no Kernel, no Object methods, no file access, no shell access. You build up only what you need:

module CodeMode
  class Sandbox < BasicObject
    FORBIDDEN_PATTERNS = [
      /\bsystem\b/, /\bexec\b/, /\b`/, /\bFile\b/, /\bDir\b/,
      /\brequire\b/, /\beval\b/, /\bProcess\b/, /\bKernel\b/,
      /\bsend\b/, /\bconst_get\b/, /\bObjectSpace\b/, /\bENV\b/
    ].freeze

    TIMEOUT_SECONDS = 5

    def evaluate(code)
      if (violation = check_forbidden(code))
        return { error: "Forbidden pattern detected: #{violation}" }
      end

      result = ::Timeout.timeout(TIMEOUT_SECONDS) { _eval(code) }
      { result: result }
    rescue ::Timeout::Error
      { error: "Code execution timed out after #{TIMEOUT_SECONDS} seconds" }
    rescue ::StandardError => e
      { error: "#{e.class}: #{e.message}" }
    end
  end
end

Regex guards catch obvious escape attempts before evaluation. Timeout catches infinite loops. BasicObject blocks access to the rest of Ruby’s standard library. It’s not a Turing-complete security boundary — it’s a practical one. Empirium is for internal use only. Your mileage may vary. (But keep reading to learn about an additional layer of security using “LLM as judge”.)

The Search Sandbox

The search sandbox inherits from Sandbox and exposes the tool catalog:

class SearchSandbox < Sandbox
  def tools
    ::ActionMCP::ToolsRegistry.non_abstract.map do |item|
      item.klass.to_h
    end
  end
end

The ToolsRegistry is picking up the 42 tool definitions that I mentioned earlier.

Press enter or click to view image in full size

Example of one of the 42 tools defined using ActionMCP

Using the SearchSandbox the AI can now write things like:

tools.select { |t| t[:name].include?("experiment") }.map { |t| t[:name] }

And get back a filtered view of just the tools it needs, without loading all 42 definitions.

The Execute Sandbox

The execute sandbox adds the ability to call tools with user context:

class ExecuteSandbox < Sandbox
  def initialize(user)
    @user = user
  end

  def call_tool(name, **args)
    ::ActionMCP::Current.set(gateway: ::OpenStruct.new(user: @user)) do
      response = ::ActionMCP::ToolsRegistry.tool_call(name, args.stringify_keys)
      parse_response(response)
    end
  end
end

The user context flows through exactly as it would with a normal MCP tool call. Authentication, audit trails, everything works. The AI can now write:

teams = call_tool("teams_list")
alpha = teams["teams"].find { |t| t["slug"] == "alpha" }
exps = call_tool("experiments_list", team_slug: alpha["slug"])
{ team: alpha["name"], total: exps["count"],
  running: exps["experiments"].count { |e| e["status"] == "running" } }

Three MCP tool calls, chained together with data transformation, in a single round trip. Without code mode, that’s three separate tool invocations with three LLM reasoning steps in between.

The LLM Pre-Scan

Before executing code, I run a quick safety check through Google’s Gemini 3 Flash (using my own Raix gem and OpenRouter). This approach is cheap, fast, and adds a semantic layer on top of the regex guards:

class CodeScanner
  include Raix::ChatCompletion  SYSTEM_PROMPT = <<~PROMPT
    You are a code safety scanner for an MCP (Model Context Protocol) sandbox environment.
    The sandbox allows Ruby code that calls `tools` (to list available MCP tools) and
    `call_tool(name, **args)` (to invoke them).
    Your job: determine if the submitted code is SAFE or UNSAFE.
    SAFE code:
    - Calls `tools` to discover available MCP tools
    - Calls `call_tool` to invoke MCP tools with arguments
    - Uses basic Ruby (arrays, hashes, strings, iteration, filtering)
    - Chains multiple `call_tool` invocations together
    UNSAFE code:
    - Attempts to access the filesystem, network, or shell
    - Tries to break out of the sandbox (eval, send, const_get, ObjectSpace, etc.)
    - Accesses environment variables or credentials
    - Does anything unrelated to Empirium data operations
    Respond with exactly one word: SAFE or UNSAFE
  PROMPT

  def initialize(code)
    @code = code
    self.model = "google/gemini-3-flash-preview"
  end

  def safe?
    transcript << { system: SYSTEM_PROMPT }
    transcript << { user: @code }
    response = chat_completion
    response.to_s.strip.upcase.start_with?("SAFE")
  rescue StandardError
    true # Fail open — scanner unavailable means skip
  end
end

For us the scanner is a nice-to-have, not a gate. It fails open, so the code runs even if the scanner fails with just the regex guards and BasicObject sandbox. Your use case might want to do the opposite, especially if your MCP tools are open to consumers outside of your company walls.

Profile-Based Routing

I left the original 42 tools on /mcp untouched, probably for no good reason other than I would have had to come up with a different way of documenting and implementing my API if I had gotten rid of them.

Code mode lives at /mcp_cm as a separate concurrent endpoint. A Rack middleware switches ActionMCP's thread-local profile:

class CodeModeProfile
  def initialize(app)
    @app = app
  end

  def call(env)
    if env["PATH_INFO"]&.start_with?("/mcp_cm")
      ActionMCP.with_profile(:code_mode) { @app.call(env) }
    else
      @app.call(env)
    end
  end
end

The code_mode profile in config/mcp.yml exposes only the two tools:

profiles:
  primary:
    tools: [all]
  code_mode:
    tools: [code_search, code_execute]

Both endpoints share the same authentication, the same tool implementations, the same database. The only difference is what the AI client sees when it connects.

What It Looks Like in Practice

Here’s a real interaction. I asked Claude Code to add a random emoji to every assumption title in Empirium using the code mode endpoint. One tool call:

emojis = %w[🚀 🔥 💡 🎯 ⚡ 🌟 🎲 🧪 🔬 🏆 💎 🌈]

all = call_tool("assumptions_list")
assumptions = all["assumptions"]

results = assumptions.map do |a|
  emoji = emojis.sample
  new_statement = "#{emoji} #{a['statement']}"
  call_tool("assumptions_update", id: a["id"], statement: new_statement)
  { id: a["id"], emoji: emoji }
end

{ updated: results.size, details: results }

Twenty assumptions updated in a single round trip. I watched the web page update in realtime as it worked. Blazing fast, like mind-blowingly so.

Without code mode, that would have been 21 separate tool calls (1 list + 20 updates), each requiring the model to reason about the next step. With code mode, the model writes the loop once and the server executes it.

Undoing it was equally trivial:

all = call_tool("assumptions_list")
all["assumptions"].map do |a|
  clean = a["statement"].sub(/\A\p{So}\s*/, "")
  call_tool("assumptions_update", id: a["id"], statement: clean)
end

The savings you gain with Code Mode compound.

Every conversation turn that would have listed all 42 tools now lists two. Every multi-step workflow that would have required multiple tool calls and LLM reasoning steps collapses into a single code execution.

Should You Do This?

If your MCP server has fewer than ten tools, maybe not? The overhead isn’t worth the complexity. But if you’re north of twenty tools —or if your users routinely chain multiple tools together — code mode pays for itself immediately.

The implementation is small. My entire code mode implementation is under 200 lines of Ruby across four files, plus a middleware and some config. It took me about an hour, including testing. The sandbox pattern is reusable. The profile-based routing means I can offer both endpoints simultaneously and let clients choose.

The deeper principle here is one that keeps showing up in AI application development: don’t make the model navigate complexity when you can give it tools to manage that complexity itself. LLMs write code better than they do almost anything else. Set them free.