Everything That Can Be Deterministic, Should Be: My Claude Code Setup

Andrej Karpathy wrote this:

I’ve never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue.
There’s a new programmable layer of abstraction to master (in addition to the usual layers below) involving agents, subagents, their prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations, and a need to build an all-encompassing mental model for strengths and pitfalls of fundamentally stochastic, fallible, unintelligible and changing entities suddenly intermingled with what used to be good old fashioned engineering.
Clearly some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it, while the resulting magnitude 9 earthquake is rocking the profession. Roll up your sleeves to not fall behind.

I’ve spent the last year building a different architecture on top of Claude Code. 35 agents. 68 skills. 234MB of context. This is what I’ve learned.

The Wrong Default

The dominant pattern in AI engineering is the Master Prompt. A single system prompt, a sparse “You are a helpful assistant,” and raw access to everything.

This is the pattern people start with because it’s easy. Google Cloud’s architecture guide explicitly recommends starting simple. Anthropic’s own documentation says to find the simplest solution and only add complexity when needed.

The problem is that “simple” breaks down faster than you’d expect. A single agent tasked with too many responsibilities becomes a “Jack of all trades, master of none.” As instruction complexity increases, adherence to specific rules degrades. Error rates compound. Hallucinations multiply.

Claude Code demonstrates the alternative. When you ask it to search your codebase, it doesn’t run grep -r "pattern" with guessed flags. It calls a Grep tool that wraps ripgrep with optimized parameters. The tool handles encoding. It has sensible defaults.

The execution is deterministic.

Claude Code’s job is to decide what to search for. The Grep tool’s job is to search.

This is what I mean by “skill” throughout this article: a deterministic program that the LLM invokes rather than simulates. File search is the simplest example, but the pattern extends to everything.

The Division

There are two kinds of operations:

Solved problems: Operations we know how to implement reliably.

File search. Ripgrep exists. It’s fast, it handles encoding, it has optimized defaults.
Test execution. pytest and go test run deterministically.
Build validation. The compiler either accepts the code or it doesn’t.
YAML parsing. It’s structured data extraction.

Unsolved problems: Operations that require contextual understanding.

Understanding WHAT is wrong with a failing Kubernetes pod.
Deciding WHICH files are relevant to a bug.
Interpreting an error message in the context of a codebase.
Connecting symptoms to root causes across multiple subsystems.

You can’t write a program to diagnose every Kubernetes pod failure. ImagePullBackOff because a registry secret expired. OOMKilled because the memory limit was 256Mi and the JVM wanted 512. CrashLoopBackOff because the liveness probe path changed and nobody updated the deployment.

The list is infinite. The variations are contextual. The diagnosis requires understanding.

The Master Prompt asks the LLM to do both. It asks the LLM to simulate file search AND understand what it finds. It asks the LLM to guess at kubectl commands AND interpret Kubernetes events.

This is wrong.

The variance is in the wrong place. The LLM is varying its execution when it should only vary its decisions.

Division of labor: Programs do search, tests, build, parsing. LLMs do understanding, decisions, interpretation, connections.

The New Layer

The architecture that works has four layers. Each layer constrains the one below it.

Layer 1: The Router

The Router doesn’t solve problems. It prevents context pollution.

If I ask to “debug the auth service,” my Router identifies “auth service” as a Go project and “debug” as a task type. It selects the appropriate specialist before any work begins.

The Router’s job is to pick the right domain and methodology. It’s a switchboard, not a solver.

This concept exists in every multi-agent framework. LangChain calls it routing. Azure calls it a coordinator pattern. The idea is the same: classify input and direct it to a specialized followup task.

For the full mechanism, I wrote about this in The /do Router.

Layer 2: The Agent

The Agent is not a persona. It’s a Dense Context.

In a Master Prompt system, the agent is defined by “You are helpful.”

In my system, the Agent is defined by its constraints. A golang-engineer Agent contains:

Go 1.22+ idioms and standard library patterns.
Project-specific architecture decisions.
Concurrency anti-patterns to avoid.
Error wrapping conventions.

The Agent provides the knowledge required to solve the problem. But not the method of solving it.

LangChain introduced the concept of Skills as “prompt-driven specializations that an agent can invoke on-demand.” The insight is correct: general purpose agents like Claude Code use remarkably few tools. Claude Code uses about a dozen. Manus uses less than 20. The power isn’t in tool count. It’s in context density.

Layer 3: The Skill

This is where most implementations fail.

They ask the Agent (knowledge) to also derive the process. But domain knowledge and methodology are orthogonal concerns.

The Skill is the Methodology.

A systematic-debugging Skill is a deterministic workflow:

Reproduce. Create a minimal reproduction case.
Isolate. Narrow to the specific component.
Identify. Determine root cause with evidence.
Verify. Confirm the fix without side effects.

This Skill applies to any Agent.

If I attach systematic-debugging to the golang-engineer Agent, the LLM applies Go knowledge through the debugging process. It can’t skip steps. It can’t jump to conclusions.

Phase gates enforce this: “Do NOT proceed to IDENTIFY until you have demonstrated reliable reproduction.”

The industry is catching up to this. A recent GraphBit article describes nearly the same architecture: “deterministic tools, validated execution graphs, and optional LLM orchestration.” The execution is deterministic. The orchestration is stochastic.

Layer 4: Deterministic Programs

This is the foundational rule.

The LLM should not interact with the environment directly.

Yes, Claude Code already does this. These are examples to illustrate the pattern, not novel inventions:

Don’t let the LLM use grep. Give it a code_search() function that wraps ripgrep with optimized flags.

Don’t let the LLM use cat. Give it a read_file() function that handles encoding and truncation.

Don’t let the LLM run kubectl with guessed arguments. Give it structured functions that execute the commands and return parsed output.

The point isn’t that these specific tools are new. The point is that this pattern should be applied to everything. The LLM selects the tool. The tool executes the logic.

The execution is deterministic. The variance is confined to the selection, not the runtime behavior.

The hierarchy: Router → Agent → Skill → Program

An Example

Walk through: “debug this failing pod.”

Layer 1 (Router): Identifies this as a Kubernetes task with debugging methodology. Routes to kubernetes-engineer Agent with systematic-debugging Skill.

Layer 2 (Agent): Loads K8s context. Pod lifecycle states. Common failure patterns. The relationship between Events, Pod Status, and container logs.

Layer 3 (Skill): Enforces the debugging process. First: reproduce. What is the current state? The Skill won’t let the LLM guess.

Layer 4 (Programs): Deterministic functions execute:

get_pod_description() runs kubectl describe pod and parses output
get_pod_events() extracts Events from the description
get_container_logs() retrieves logs with proper flags

The LLM receives structured data:

status: ImagePullBackOff
events:
  - type: Warning
    reason: Failed
    message: "Failed to pull image: unauthorized"
lastState:
  terminated:
    reason: ImagePullBackOff

Now the LLM does what LLMs do well. It connects the dots.

“ImagePullBackOff because the registry secret expired. The secret registry-creds was last rotated 91 days ago. The registry requires rotation every 90 days.”

The diagnosis requires understanding. The data gathering was mechanical.

Raw data transformed into diagnosis through LLM interpretation

Learning to Hold It

The alien tool came with no manual. Here’s what I’ve figured out.

Everything that can be a program, should be. We know how to write code that searches files reliably. We know how to write code that runs tests. We know how to parse YAML. Don’t ask an LLM to simulate those capabilities. Ask it to orchestrate them.

The question is not “Can the LLM do this?”

The question is “Should the LLM do this?”

If the process is deterministic, write a program.

The LLM doesn’t run tests. It decides which tests to run.

It doesn’t search files. It decides what to search for.

It doesn’t parse YAML. It interprets what the parsed YAML means.

This is where stochastic systems belong: in the decisions that deterministic programs can’t make. The diagnosis. The interpretation. The connection across contexts.

Not the execution.

The magnitude 9 earthquake is still rocking the profession. But the pattern is becoming clear: Router → Agent → Skill → Program. Context density over tool breadth. Deterministic execution wrapped in stochastic orchestration.

This is how you hold the alien tool.