š„ 2nd Place ā Claude @ MIT Spring 2026 Hackathon
"Agents whose behavior you can read, verify, and trust."
Track: Governance & Collaboration ā Help people work together better
Theme: Human-AI teaming through transparent, auditable behavioral specifications
Live Deployment: https://interpretable-autoresearch.pages.dev/
demo.mp4
The problem
AI agents are increasingly taking consequential actions ā running experiments, writing code, making autonomous decisions ā but their behavior remains opaque. Humans cannot audit what they did, why, or whether it aligned with intent.
Three failures identified by MIT CSAIL:
- Unintended decisions ā Acting AI systems inevitably diverge from human intent, with no audit trail to diagnose why.
- No value alignment ā Agents don't inherently understand human values or ethics; behavior is hidden inside prompts and opaque code.
- Privacy & control risks ā Agents with broad access and no transparent behavioral contract are a security and governance liability.
Source: MIT CSAIL Alliances ā "Agentic AI: What you need to know about AI agents"
Who this affects
Concretely, the people shipping agents today are running into this:
- AI / ML researchers leaving Karpathy-style autoresearch loops running overnight, waking up to a TSV of metrics and no defensible answer to "why did the agent try this?"
- Performance & platform engineers delegating profile-and-optimize work to coding agents, then stuck reviewing 40 commits with no traceable reasoning behind any of them.
- Engineering teams adopting coding agents in production codebases, where "the agent wrote this" is not an answer regulators, security reviewers, or future maintainers will accept.
- Compliance, safety, and governance owners asked to sign off on autonomous systems whose behavior is specified inside prompts they can't read, version, or audit.
The shared pain: when an autonomous agent does something surprising, nobody ā not the operator, not the engineer, not the auditor ā can replay why. That's the gap this project closes.
Research foundation
We apply "What You See Is What It Does" (Meng & Jackson, SPLASH 2025 ā arXiv:2508.14511), a structural pattern for legible software from MIT CSAIL. The paper proposes two primitives:
Concepts ā Fully independent services grounded in real-world behavior, not state. Each concept names a lifecycle, exposes actions (past-tense events that have occurred), and derives queryable state from action history. Example: Reviewing, Citing, Sharing.
Reactions (synchronizations) ā Event-based when / where / then rules that mediate between concepts. Each reaction is simultaneously readable prose and executable code. Every agent action is traceable to a specific reaction.
when:
Experimenting.kept(?prev) OR Experimenting.discarded(?prev)
where:
Experimenting: no experiment is currently running
then:
request Hypothesizing.form(informed_by: ?prev)
when:
Hypothesizing.formed(?hypothesis)
then:
request Modifying.apply(?hypothesis, to: train.py)
when:
Modifying.applied(?change, to: train.py)
where:
Hypothesizing: ?change originates from ?hypothesis
Experimenting: ?hypothesis corresponds to ?experiment
then:
request Committing.commit(?change)
request Experimenting.run(?experiment)
... more reactions relevant to researcher's actions
This gives us a domain-specific language where behavioral features are granular, declarative, and human-readable ā and readily generated or verified by an LLM.
Our solution: behavioral code as the collaboration layer
Every agent ā human, research group, or LLM tool ā is described by behavioral code: a set of reactions over shared concepts. This creates a legible, auditable contract for every action the agent takes.
How it works
Step 1 ā Human describes intent casually
"Review my students' paper drafts and email me a summary." No prompts. No code. No system engineering.
Step 2 ā System interprets into behavioral code Each reaction carries both prose (for humans) and formal DSL (for execution). Any action is traceable to a specific reaction and its author.
Step 3 ā Agent is deployed and stays legible Humans can read, modify, or audit the behavioral code at any time. When behavior should change, the code changes ā not hidden prompts.
The trust mechanism
Provenance is built in: every action carries a by field identifying which agent made the claim. Other agents verify by inspecting who attested what ā no global authority required, no black box.
Acting.acted(action: Reviewing.completed, by: <agent>, args: { artifact: ?artifact })
What works today
The repo ships two end-to-end runnable autoresearch loops driven by coding agents operating against a behavioral-code program.md. Both produce a real, append-only events.jsonl you can inspect, replay, and audit.
interpretable-autoresearch/
āāā model-training/ # autoresearch loop over a small LLM training script
āāā performance-engineering/ # autoresearch loop over a C++ N-body simulator
What you can verify yourself:
- Each loop is bootable in under five minutes (
uv sync/make) and runs an actual baseline experiment on commodity hardware (Apple Silicon Mac or single NVIDIA GPU). - Each loop emits a typed event per action ā
Hypothesizing.formed,Modifying.applied,Experimenting.run,Evaluating.measured,Logging.recordedā with acaused_bychain. Openevents.jsonland read straight down: every line tells you who did it, what they did, and which earlier event triggered it. - Each hypothesis records its prediction (
direction,magnitude,mechanism,side_effects) before the experiment runs. EachLogging.recordedrecordsoutcome_vs_predictionafter. The log is therefore not retrofittable ā you cannot quietly rewrite history to look smarter than you were. - Every keep/revert is a real
git commit/git reset --hard HEAD~1against a branch namedautoresearch/<tag>. The git history matches the event log.
Use cases ā what a real user gets out of this
1. Karpathy Autoresearch (this repo, model-training/)
A researcher leaves an agent running overnight. By morning they have: a branch of attempted experiments, an events.jsonl whose every line is a typed event with a caused_by cause, and ā crucially ā a record of which hypotheses predicted what and how reality answered. Stuck on a result? Open the log, find the Hypothesizing.formed event, read the prediction.mechanism field, find the matching Logging.recorded.outcome_vs_prediction, see exactly where the agent's mental model diverged from reality.
Example reaction chain (from model-training/):
Experimenting.kept(exp-007) ā Hypothesizing.formed(?h, prediction) ā
Modifying.applied ā Experimenting.run ā Evaluating.measured ā
Experimenting.kept | Experimenting.discarded ā Logging.recorded(outcome_vs_prediction)
Each arrow is a separate, readable reaction. Each can be inspected, paused, or overridden by the human, and each event is one line in events.jsonl whose caused_by points at its trigger.
2. Software performance optimization (this repo, performance-engineering/)
A platform engineer hands the agent a slow service and a benchmark. Behavioral code makes the agent's reasoning readable: each optimization decision maps to a declared reaction the engineer can review, approve, or reject. The Discovering and Profiling concepts force the agent to justify every change against measured cost attribution ā there's no "I felt like rewriting this with SIMD" because the rule says hypotheses cite a recent profile or they don't fire.
Putting humans in control
The whole design point is that humans stay in charge by editing legible code, not by guarding a black box.
- You program the agent in Markdown.
program.mdis the agent's behavior. Want different behavior? Edit it. No fine-tuning, no prompt-spelunking, no custom harness. The behavioral code is yours, versioned in your git repo, reviewable in PR. - The human is a first-class event. The
Communicatingconcept means "the user told the agent X" is recorded asCommunicating.received, with subsequent reactions citing that event incaused_by. Off-record nudges don't exist; if you steered the agent, the log shows it. - Pause, override, or revert at any line. Every modification is a real git commit, every revert is a real
git reset --hard HEAD~1, every event has anevent_id. You can stop the loop, editprogram.md, and resume ā the next reaction tail-readsevents.jsonland continues. - Predictions can't be retroactively edited. Because
Hypothesizing.formedis appended before the experiment runs, the agent can't quietly rewrite its own predictions to match the outcome. The log is a tamper-evident learning record, not a sanitized PR description. - The autonomy is bounded by the program. Reactions only fire when their
when/whereconditions match. The agent has no ambient action ā it cannot do something that isn't reachable from a reaction inprogram.md. Restricting agent capability is a code edit, not a prompt patch.
The researcher still owns the research questions, the engineer still owns the design choices, the agent does the legible labor of trying things and writing down why.
Why this matters for governance & collaboration
This project addresses the core governance challenge of agentic AI: accountability. Most approaches treat human oversight as a feature bolted on after deployment. We treat it as a structural property of the language itself.
- Agents cannot act outside their behavioral code ā there is no ambient action.
- Every action is attributable to a specific reaction authored by a specific agent (
by: autoresearch-<tag>), with acaused_bychain back to its trigger. - Modifying agent behavior requires changing legible, versioned code (
program.md) ā not hunting through prompts. - Multiple agents collaborate through shared concepts, making the interface between them readable to humans.
- Predictions are recorded before outcomes (
Hypothesizing.formed.prediction) and explicitly compared after (Logging.recorded.outcome_vs_prediction), so the log captures mechanism understanding, not only metric deltas.
This directly enables the kind of human-AI collaboration where trust is earned incrementally and verified continuously ā not assumed.
Risks we took seriously
We are not claiming this solves agent safety. We are claiming it makes a specific class of agent failures catchable instead of invisible. Here is what we wrestled with:
- Risk: agents lying in the log. An agent could fabricate a
Hypothesizing.formed.reasoningfield after the fact to match a result that worked. What the design does about it: events are append-only and timestamped; predictions are required before the run;outcome_vs_predictionrequires the agent to explicitly compare. A retrofitted prediction is detectable in the timestamp ordering and in thecaused_bychain. Combined with git commit timestamps, the log is replayable evidence. - Risk: scope creep / agent touching the wrong files. A coding agent given shell access can modify anything on disk.
What the design does about it:
Modifying.applieddeclares its target file (to: train.pyfor model-training, scoped tosrc/for performance-engineering). The reactionR3won't fire on out-of-scope edits, and any out-of-scope change shows up as an unattributed git diff with noModifying.appliedevent ā i.e. the inconsistency is visible to a human reader. - Risk: over-trust. A researcher reads only the metric column of the log and assumes the agent figured something out, when really it stumbled into a noise-floor win or a cache-effect speedup it doesn't understand.
What the design does about it: the
outcome_vs_predictionfield is mandatory and explicitly invites disagreement ("metric matched but mechanism unclear: speedup may have come from cache effects, not the change I proposed"). The performance loop additionally enforces asignificanceflag against a recorded noise floor ā sub-noise wins are required to be discarded, not kept. - Risk: deskilling and displacement. If agents do all the experimentation, junior researchers and engineers lose the practice that builds expertise.
What we believe: this design is closer to a teaching artifact than a black-box assistant. The
events.jsonlis exactly the kind of structured lab notebook a junior researcher should keep ā predictions before outcomes, mechanisms named explicitly, mistakes acknowledged in writing. Reading an agent's log is itself instructive in a way that reading a TSV of metrics is not. - Risk: false sense of governance. "We have an audit log" is not the same as "we have safety." What we are honest about: the log catches behavioral divergence (the agent did something that doesn't trace back to a reaction; the prediction was wrong; the mechanism was wrong). It does not prevent prompt-injection, model-level deception, or scenarios where the agent is sandbagging in plausible-looking events. Those require complementary work (sandboxing, capability restriction, alignment evaluations) that this project does not claim to do. What we offer is a structural property that other safety work can build on, not a substitute for it.
References
- Meng, E. & Jackson, D. (2025). What You See Is What It Does: A Structural Pattern for Legible Software. Onward! at SPLASH 2025. arXiv:2508.14511
- MIT CSAIL Alliances. Agentic AI: What you need to know about AI agents. cap.csail.mit.edu
- Karpathy, A. autoresearch. github.com/karpathy/autoresearch
Appendix
model-training/ ā Karpathy-style LLM autoresearch
A simplified single-GPU LLM training setup (a fork of Karpathy's nanochat lineage, with macOS / Apple Silicon MPS support added) wrapped in a behavioral-code program. The agent is handed train.py and a fixed wall-clock training budget, and it iterates: form a hypothesis about the model/optimizer, modify train.py, train, evaluate val_bpb, keep or revert, log the outcome against its prediction. Repeat overnight.
Layout
program.mdā the agent's instructions, expressed as concepts (Experimenting,Hypothesizing,Modifying,Evaluating,Logging,Communicating) and reactions R1āR7. This is the file a human edits to change agent behavior.prepare.pyā fixed constants, data download, tokenizer training, dataloader, evaluation harness. Not modified by the agent.TIME_BUDGETlives here (currently 30 s for fast prototyping; upstream uses 300 s).train.pyā single-file GPT model + Muon/AdamW optimizer + training loop. The only file the agent edits.events.jsonlā append-only event log produced by the agent (untracked, regenerated per run).run.logā most recentuv run train.pyoutput, used by the agent to extractval_bpband detect crashes.analysis.ipynb,progress.pngā human-side inspection of the run.original/ā upstream-style referenceprogram.md(5-minute budget, free-form loop), kept for diff against the behavioral-code version.CHANGES.mdā notes on the local prototype delta vs. upstream (time budget, behavioral-code framing, MPS support).
Quick start (Apple Silicon Mac or single NVIDIA GPU; Python 3.10+; uv)
cd model-training uv sync uv run prepare.py # one-time data + tokenizer prep, ~2 min uv run train.py # one manual baseline experiment, ~30 s + startup/eval
Then point a coding agent (Claude / Codex / etc.) at program.md and let it run autonomously. See model-training/README.md for full details.
Why the behavioral-code version is better than "agent + TSV". The agent is not running a free-form "edit, train, log to TSV" loop. It is a reaction interpreter: at each step it tails events.jsonl, matches a when clause, and fires the corresponding then. Every hypothesis carries an explicit prediction; every Logging.recorded event carries outcome_vs_prediction. The log is a record of learning, not just of metrics ā and that's exactly what a researcher needs at 8 a.m. when they want to know what the agent figured out overnight.
performance-engineering/ ā autoresearch over a C++ codebase
A deliberately unoptimized 3-D gravitational N-body simulator (src/nbody.cpp: O(N²) pairwise forces, AoS layout, no Newton's third law, single-threaded) plus an end-to-end benchmark harness. The agent is dropped into the repo cold: it must first discover the codebase, write or adopt a benchmark, establish a noise floor, then loop on profile ā hypothesize ā modify ā measure ā keep-or-discard.
Layout
program.mdā agent instructions over conceptsDiscovering,Profiling,Experimenting,Hypothesizing,Modifying,Evaluating,Logging,Communicatingand reactions R0āR8. Notable additions vs. model-training: a one-shotDiscoveringreaction at the start, and aProfilinglifecycle that gates every hypothesis on a recent measurement.bench_e2e.pyā Python harness that builds and runssrc/nbody, computes a median wall-clock time over N runs, and verifies a position-weightedchecksumagainst the baseline as a correctness anchor. Prints flatkey: valuelines for the agent'sEvaluating.measurestep.events.jsonlā append-only event log.src/nbody.cppā the C++ simulator. Fair game for the agent: algorithms, data layout, vectorization, parallelization.Makefileā-O3 -std=c++17 -march=native -fopenmp. Builds./nbody.visualize.pyā matplotlib trajectory viewer for human sanity checks; depends on the-dumpbinary format (changing the format requires updating this file).README.mdā full description of the simulator, CLI, output format, and the contract the agent must preserve (checksum semantics, CLI flags,maketarget).
Quick start
cd performance-engineering make -C src # build ./src/nbody python bench_e2e.py --runs 5 # establish a baseline + noisefloor
Then point an agent at program.md. The agent's first reaction (R0) is Discovering.discover ā it walks src/, reads the README, decides whether to use bench_e2e.py or write its own harness, and records its codebase map, hot-path hypothesis, and noise floor as a single Discovering.completed event. Everything after that cites back to it.
Why the behavioral-code version is better than "agent + TSV". Performance work is harder to make interpretable than model training: the bottleneck is unknown, the benchmark may not exist, and finding the right thing to change is most of the work. The reactions enforce two disciplines that orthodox loops skip:
- Profile-grounded hypotheses.
Hypothesizing.formedmust cite a recentProfiling.profiledevent and a specific function attribution. No guessing at hot paths. - Noise-aware keeps.
Evaluating.measuredcarries asignificanceflag against the noise floor recorded at discovery. A speedup within run-to-run variance isbelow_noise_floorand gets reverted, not kept. (This is the kind of mistake an unsupervised agent will otherwise make and accidentally "win" with.)