Everything That Can Be Deterministic, Should Be: My Claude Code Setup

Andrej Karpathy wrote this:

I’ve never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue.
There’s a new programmable layer of abstraction to master (in addition to the usual layers below) involving agents, subagents, their prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations, and a need to build an all-encompassing mental model for strengths and pitfalls of fundamentally stochastic, fallible, unintelligible and changing entities suddenly intermingled with what used to be good old fashioned engineering.
Clearly some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it, while the resulting magnitude 9 earthquake is rocking the profession. Roll up your sleeves to not fall behind.

I’ve spent a year building on top of Claude Code. 35 agents. 68 skills. 234MB of context. Here’s what I’ve figured out so far.

The Wrong Default

I keep seeing the same setup. One system prompt. “You are a helpful assistant.” Raw access to everything.

People start here because it’s easy. Google Cloud’s architecture guide says start simple. Anthropic’s own documentation says find the simplest solution and add complexity only when you need it. So everyone does.

And it works, for a while. But I found that “simple” breaks down faster than you’d expect. One agent with too many responsibilities starts confusing things. More instructions means less adherence to the specific rules. Errors compound.

Here’s what got me thinking differently. Claude Code already does the right thing in miniature. When you ask it to search your codebase, it doesn’t shell out to grep -r "pattern" with guessed flags. It calls a Grep tool that wraps ripgrep with optimized parameters. The tool handles encoding. It has sensible defaults.

The execution is deterministic.

Claude Code’s job is deciding what to search for. The Grep tool’s job is searching.

I use the word “skill” throughout this article to mean a deterministic program that the LLM calls rather than tries to simulate. File search is the simplest example, but the pattern extends to everything I do.

The Division

I think about operations in two buckets.

Solved problems. Things we know how to do reliably with code.

File search. Ripgrep exists. It’s fast, it handles encoding, it has optimized defaults.
Test execution. pytest and go test run deterministically.
Build validation. The compiler either accepts the code or it doesn’t.
YAML parsing. It’s structured data extraction.

Unsolved problems. Things that need contextual understanding.

Understanding WHAT is wrong with a failing Kubernetes pod.
Deciding WHICH files are relevant to a bug.
Interpreting an error message in the context of a codebase.
Connecting symptoms to root causes across multiple subsystems.

You can’t write a program to diagnose every Kubernetes pod failure. ImagePullBackOff because a registry secret expired. OOMKilled because the memory limit was 256Mi and the JVM wanted 512. CrashLoopBackOff because the liveness probe path changed and nobody updated the deployment.

The list is infinite. The variations are contextual.

A single master prompt asks the LLM to handle both. Simulate file search AND understand what it finds. Guess at kubectl commands AND interpret Kubernetes events.

The variance ends up in the wrong place. The LLM is varying its execution when it should only vary its decisions.

Division of labor: Programs do search, tests, build, parsing. LLMs do understanding, decisions, interpretation, connections.

The Four Layers

I ended up building four layers. Each one constrains the one below it.

Layer 1: The Router

The Router doesn’t solve problems. It prevents context pollution.

If I ask to “debug the auth service,” my Router identifies “auth service” as a Go project and “debug” as a task type. It picks the right specialist before any work starts.

Think of it like a switchboard. You call in, someone figures out which department can help, and connects you. The switchboard operator doesn’t fix your dishwasher.

This concept exists everywhere. LangChain calls it routing. Azure calls it a coordinator pattern. I wrote about how mine works in The /do Router.

Layer 2: The Agent

In a master prompt system, the agent is basically “You are helpful.” In my system, the Agent is defined by its constraints. A golang-engineer Agent contains:

Go 1.22+ idioms and standard library patterns.
Project-specific architecture decisions.
Concurrency anti-patterns to avoid.
Error wrapping conventions.

The Agent provides the knowledge to solve the problem. Not the method.

LangChain introduced the concept of Skills as “prompt-driven specializations that an agent can invoke on-demand.” The insight is right: general purpose agents like Claude Code use remarkably few tools. Claude Code uses about a dozen. Manus uses less than 20. The power isn’t in tool count. It’s in how dense the context is.

Layer 3: The Skill

Most implementations fail here.

They ask the Agent (knowledge) to also figure out the process. But domain knowledge and methodology are different concerns. I kept hitting this wall myself until I separated them.

A systematic-debugging Skill is a deterministic workflow:

Reproduce. Create a minimal reproduction case.
Isolate. Narrow to the specific component.
Identify. Determine root cause with evidence.
Verify. Confirm the fix without side effects.

This Skill applies to any Agent.

If I attach systematic-debugging to the golang-engineer Agent, the LLM applies Go knowledge through the debugging process. It can’t skip steps. It can’t jump to conclusions. Phase gates enforce this: “Do NOT proceed to IDENTIFY until you have demonstrated reliable reproduction.”

The industry is moving toward the same idea. A recent GraphBit article describes nearly the same architecture: “deterministic tools, validated execution graphs, and optional LLM orchestration.” The execution is deterministic. The orchistration is stochastic.

Layer 4: Deterministic Programs

The foundational rule. The LLM should not interact with the environment directly.

Claude Code already does this for some things. I’m arguing it should be applied to everything.

Don’t let the LLM use grep. Give it a code_search() function that wraps ripgrep with optimized flags.

Don’t let the LLM use cat. Give it a read_file() function that handles encoding and truncation.

Don’t let the LLM run kubectl with guessed arguments. Give it structured functions that execute the commands and return parsed output.

The LLM selects the tool. The tool does the work. The variance is confined to the selection, not the runtime behavior.

The hierarchy: Router → Agent → Skill → Program

An Example

Walk through: “debug this failing pod.”

Layer 1 (Router): Identifies this as a Kubernetes task with debugging methodology. Routes to kubernetes-engineer Agent with systematic-debugging Skill.

Layer 2 (Agent): Loads K8s context. Pod lifecycle states. Common failure patterns. The relationship between Events, Pod Status, and container logs.

Layer 3 (Skill): Enforces the debugging process. First: reproduce. What is the current state? The Skill won’t let the LLM guess.

Layer 4 (Programs): Deterministic functions execute:

get_pod_description() runs kubectl describe pod and parses output
get_pod_events() extracts Events from the description
get_container_logs() retrieves logs with proper flags

The LLM receives structured data:

status: ImagePullBackOff
events:
  - type: Warning
    reason: Failed
    message: "Failed to pull image: unauthorized"
lastState:
  terminated:
    reason: ImagePullBackOff

Now the LLM does what LLMs are actually good at. It connects the dots.

“ImagePullBackOff because the registry secret expired. The secret registry-creds was last rotated 91 days ago. The registry requires rotation every 90 days.”

The diagnosis requires understanding. The data gathering was mechanical.

Raw data transformed into diagnosis through LLM interpretation

What I’ve Figured Out So Far

I don’t have this all solved. I’m still iterating on everything, and half of what I’ve built might be obsolete in three months.

But the core pattern keeps holding up. We know how to write code that searches files reliably. We know how to run tests. So I stopped asking the LLM to simulate those things and started asking it to orchistrate them instead.

The question is not “Can the LLM do this?”

The question is “Should the LLM do this?”

If the process is deterministic, write a program.

The LLM doesn’t run tests. It decides which tests to run.

It doesn’t search files. It decides what to search for.

I keep the stochastic stuff where it belongs: in the decisions that deterministic programs can’t make. The diagnosis and the interpretation.

Not the execution.

I’ve settled on Router, Agent, Skill, Program. Context density over tool breadth. Deterministic execution wrapped in stochastic orchistration. Whether that’s the right pattern for everyone, I genuinely don’t know. It’s been the right pattern for me, working across Go and Python daily, and it saves me from re-thinking the same problems over and over.

But we’re all in the dark searching around for ideas. None of us know what the answers will be.