Evolutionary code discovery for Claude Code. No API keys, no cloud infrastructure, no extra services — just the claude CLI you already use.
claude-evolve is a native Claude Code plugin that reimplements the ShinkaEvolve algorithm (ICLR 2026) using Claude's built-in model tiers (opus, sonnet, haiku) at different thinking-effort levels as the LLM ensemble. It maintains populations of programs across islands, generates mutations via a UCB1 bandit over (model × effort) arms, and evolves code autonomously toward any fitness function you can evaluate with real code execution.
On the 26-circle packing benchmark from the ShinkaEvolve paper, claude-evolve reaches 99.94% of the reference state-of-the-art score (2.6343 vs 2.6360) in 50 generations with 3 islands — starting from a naive initial solution scoring 1.8378.
Table of Contents
- What It Does
- Requirements
- Installation
- Quick Start (3 commands)
- Core Concepts
- The Four Skills
- Writing
initial.py - Writing
evaluate.py - Writing
config.json - Running an Evolution
- Monitoring and Visualization
- Configuration Reference
- Architecture
- Benchmark: Circle Packing
- Troubleshooting
- FAQ
What It Does
You give claude-evolve three things:
- An initial program with markers around the part you want to optimize
- An evaluator script that runs the program and returns a fitness score
- A config specifying the evolutionary budget
It runs an autonomous loop that:
- Asks Claude models to mutate your code (diff patches, full rewrites, crossover)
- Runs your evaluator on each mutation to measure fitness
- Keeps the best programs in an archive across multiple island populations
- Migrates elites between islands periodically
- Rejects near-duplicate proposals via AST-based novelty detection
- Uses a UCB1 bandit to learn which
(model × effort)combinations produce the biggest improvements - Periodically generates meta-recommendations from top programs and injects them into future prompts
After N generations, the best program in the archive is your optimized solution.
Requirements
- Claude Code CLI (the
claudecommand) with an authenticated account - Python 3.10 or newer on PATH
pipto install the claude-evolve Python package- Unix-like environment (Linux, macOS, WSL) or Windows environment. Windows environment is quite different.
- ~200 MB disk space for dependencies (numpy, scipy, aiosqlite, mcp)
You do not need:
- An Anthropic API key, OpenAI key, or any other external service credential
- A GPU
- Docker, SLURM, or other orchestration
- A network connection beyond what Claude Code already needs
Installation
Option 1: Claude Code plugin marketplace (recommended)
Run each command below in its own Claude Code message — do not paste them as a single block, since Claude Code interprets only the first slash command on a line.
1. Add the marketplace:
/plugin marketplace add https://github.com/samuelzxu/claude-evolve
2. Install the plugin:
/plugin install claude-evolve@evolve
3. Reload plugins so the new skills and MCP server register:
(Alternatively, restart Claude Code. Either way is required after any plugin install before its skills become callable in the current session.)
4. Run the setup skill to install Python dependencies and verify everything works:
This will:
- Check your Python version (≥ 3.10 required)
- Create a venv in
core/.venv/(or use system Python with--global) - Install the
claude_evolvePython package in editable mode - Verify the Claude CLI is available
- Confirm the MCP server exposes its five tools
- Run a smoke test that calls Claude once to verify the bridge
Option 2: Manual install from source
git clone https://github.com/samuelzxu/claude-evolve cd claude-evolve python3 -m venv core/.venv source core/.venv/bin/activate pip install -e core/ claude mcp add -s user claude-evolve -- python3 -m claude_evolve.server
Then open any Claude Code session and confirm the MCP server is connected:
claude mcp list
# Should show: claude-evolve: python3 -m claude_evolve.server - ✓ Connected
or Windows powershell
git clone https://github.com/samuelzxu/claude-evolve cd claude-evolve py -3 -m venv core\.venv core\.venv\Scripts\python.exe -m pip install --upgrade pip setuptools wheel core\.venv\Scripts\python.exe -m pip install -e .\core
Quick Start (3 commands)
Once installed, there are two paths depending on what you already have.
Path A: You have a vague optimization goal
If you don't yet have initial.py and evaluate.py, run the guided interview:
/evolve-interview I want to improve the sorting algorithm in src/solver.py to minimize total wall-clock time on my benchmark dataset
The interview asks you questions about:
- What exactly are we optimizing? (fitness metric)
- Which code is mutable? (helps you place
# EVOLVE-BLOCK-START/ENDmarkers) - How do we measure fitness without asking an LLM? (test suite, custom evaluator, etc.)
- What must stay unchanged? (non-goals, time budgets, correctness constraints)
It scores your answers on four dimensions using a mathematical ambiguity metric and refuses to proceed until ambiguity drops below 20%. Once clear, it crystallizes the specification into three files (initial.py, evaluate.py, config.json) and offers to start the run immediately.
Path B: You already have initial.py and evaluate.py
Just launch the evolutionary loop:
Or with inline parameters if you don't have a config file yet:
/evolve --init initial.py --eval evaluate.py --generations 50
Then check on it anytime:
Core Concepts
EVOLVE-BLOCK markers
claude-evolve only mutates code between these markers:
# EVOLVE-BLOCK-START def my_algorithm(input): # The LLM can rewrite anything in here return input # EVOLVE-BLOCK-END # Code outside the markers is preserved verbatim def run_experiment(**kwargs): return my_algorithm(kwargs['data'])
Put the part you want optimized inside the markers. Put fixed infrastructure (imports, the stable interface function, helpers you don't want touched) outside. The interface run_experiment(**kwargs) is conventionally kept outside so the evaluator has a stable entry point.
The evolutionary loop
Each generation:
- Select an island (round-robin across populations)
- Select a parent from that island using weighted sampling (sigmoid of fitness)
- Get inspirations: top-K elites and recent archive members from the island
- Select a bandit arm: UCB1 picks a
(model, effort)pair likesonnet/mediumorhaiku/high - Build a prompt: parent code + inspirations + task description + a randomly-chosen persona
- Query Claude: subprocess call to
claude --model X --effort Y -p <prompt> - Apply the patch: diff (SEARCH/REPLACE blocks), full rewrite, crossover (two parents), or fix (for broken mutations)
- Check novelty: AST fingerprint + MinHash similarity rejects near-duplicates
- Evaluate: run
python3 evaluate.py --program_path <candidate>as a subprocess - Update database: store the new program, update bandit reward based on improvement over parent
- Periodic work: meta-scratchpad recommendations (every 10 gens), island migration (every 10 gens)
Islands
Multiple independent populations evolve in parallel. Elites migrate between islands every N generations. This prevents premature convergence — one island might find a good local optimum while another explores a different region of the search space. Migration then spreads good ideas.
The (model, effort) bandit ensemble
claude-evolve treats each combination of (model, effort_level) as a distinct "arm" for a UCB1 multi-armed bandit:
sonnet/max,sonnet/high,sonnet/medium,sonnet/lowhaiku/high,haiku/medium,haiku/lowopus/max,opus/high(slow; used sparingly)
The bandit learns from actual improvement scores which arm produces the biggest fitness gains and prioritizes it. In practice, on fast-iteration tasks, haiku/high or sonnet/low usually win — they're fast enough to iterate quickly and smart enough to produce quality mutations. sonnet/high and above use extended thinking which can take 10–30 minutes per call and is usually not worth it.
Patch types
- Diff (50% default) — LLM returns
<<<<<<< SEARCH / ======= / >>>>>>> REPLACEblocks. Fast, preserves most of the code, good for targeted improvements. - Full rewrite (25%) — LLM returns a complete new version inside a fenced code block. Good for substantial rewrites when the current approach is wrong.
- Crossover (10%) — LLM receives two parents and is asked to combine their best ideas. Good for discovering hybrid strategies.
- Fix (15%) — When a diff or full patch produces broken code (syntax error, crash), the fix patch sends the broken code + error message back to the LLM for correction. Salvages otherwise-wasted generations.
Novelty rejection
Before evaluating a new program, claude-evolve computes an AST-based fingerprint (structural features + MinHash) and compares it against existing programs in the same island. If similarity exceeds 0.95, the proposal is rejected as a near-duplicate. This prevents the bandit from wasting LLM calls on mutations that would produce essentially the same program.
Meta-scratchpad
Every 10 generations, a 3-step LLM pipeline analyzes top programs and writes a short "scratchpad" of recommendations. These get injected into future mutation prompts, giving later generations high-level insights extracted from earlier successes.
The Four Skills
| Skill | Purpose | When to use |
|---|---|---|
/evolve-install |
Install plugin, create venv, verify dependencies, check MCP | First time, or after a plugin update |
/evolve-interview |
Socratic interview to define a task and generate spec files | You have a goal but no initial.py/evaluate.py yet |
/evolve |
Start an autonomous evolution run | You have spec files and want to run |
/evolve-status |
Check a running evolution's progress | Anytime during a run |
/evolve-install
/evolve-install # auto-detect: install or verify
/evolve-install --local # create core/.venv/ and install there (default)
/evolve-install --global # use system Python, no venv
/evolve-install --verify # only run the health check phase
/evolve-install --force # reinstall from scratch
/evolve-install --help # show options
The first time you run this, it walks through five phases and saves state to state/install-state.json. On subsequent runs, it detects the existing install and offers a quick verify instead of repeating the wizard.
/evolve-interview
/evolve-interview "I want to optimize the ranking function in src/ranker.py to maximize NDCG@10 on my validation set"
The interview asks one question at a time targeting the weakest of four clarity dimensions:
| Dimension | Weight | Question style |
|---|---|---|
| Goal | 30% | "What single number defines success?" |
| Program | 25% | "Which exact function/region is mutable?" |
| Evaluation | 30% | "How do we measure that without asking an LLM?" |
| Constraints | 15% | "What must remain unchanged or bounded?" |
After each answer, it computes an ambiguity score:
ambiguity = 1 - (goal × 0.30 + program × 0.25 + eval × 0.30 + constraints × 0.15)
It will not crystallize artifacts until ambiguity ≤ 0.20. At rounds 4, 6, and 8, special "challenge agent" modes activate: contrarian (tests whether your hard constraints are actually hard), simplifier (probes whether the spec can be reduced), and ontologist (forces you to name the core entity when things feel fuzzy).
When the interview completes, it writes:
initial.pywith# EVOLVE-BLOCK-START/# EVOLVE-BLOCK-ENDmarkers placed based on your answersevaluate.pyas a standalone script that runs real code (LLM-as-judge is explicitly forbidden)config.jsonwith sensible defaults tuned to your task
Then it runs the baseline evaluation and reports your starting score before handing off to /evolve.
/evolve
/evolve config.json # use a pre-written config
/evolve --init initial.py --eval evaluate.py # quick start with defaults
/evolve --init initial.py --eval evaluate.py --generations 100
Starts the evolution loop. Runs autonomously — you don't need to interact with it. Progress is logged to state/evolve.log (JSONL format) and state is checkpointed after every generation so you can inspect progress with /evolve-status or resume after an interruption.
/evolve-status
Reports:
- Current generation / total generations
- Best combined score and which program achieved it
- Bandit arm selection counts and mean rewards
- Active islands and their sizes
- Recent generation events from the JSONL log
- Run PID (so you can kill it with
kill -TERM <pid>to trigger graceful shutdown)
Writing initial.py
Your seed program has three parts:
1. Imports (outside EVOLVE-BLOCK)
import numpy as np from scipy.optimize import minimize
Put all imports outside the markers — they're part of the stable infrastructure.
2. The mutable algorithm (inside EVOLVE-BLOCK)
# EVOLVE-BLOCK-START def my_algorithm(input_data): """This is what the LLM will mutate. Keep the function signature stable — evaluate.py calls this. Everything else can change. """ # Naive initial implementation return sum(input_data) / len(input_data) # EVOLVE-BLOCK-END
Write the simplest version that could possibly work. The LLM is better at improving something that runs than at generating from scratch.
3. The stable interface (outside EVOLVE-BLOCK)
def run_experiment(**kwargs): """Called by evaluate.py. Must not change.""" data = kwargs.get('data', [1, 2, 3, 4, 5]) result = my_algorithm(data) return result
Important: run_experiment(**kwargs) is the conventional entry point. The evaluator imports the candidate module and calls this function. Keep it outside the markers so the LLM can't accidentally break the interface.
Real example: circle packing
# EVOLVE-BLOCK-START """Constructor-based circle packing for n=26 circles""" import numpy as np def construct_packing(): n = 26 centers = np.zeros((n, 2)) centers[0] = [0.5, 0.5] for i in range(8): angle = 2 * np.pi * i / 8 centers[i + 1] = [0.5 + 0.3 * np.cos(angle), 0.5 + 0.3 * np.sin(angle)] for i in range(16): angle = 2 * np.pi * i / 16 centers[i + 9] = [0.5 + 0.45 * np.cos(angle), 0.5 + 0.45 * np.sin(angle)] centers = np.clip(centers, 0.01, 0.99) radii = compute_max_radii(centers) return centers, radii def compute_max_radii(centers): n = centers.shape[0] radii = np.ones(n) for i in range(n): x, y = centers[i] radii[i] = min(x, y, 1.0 - x, 1.0 - y) for _ in range(n * n): changed = False for i in range(n): for j in range(i + 1, n): dist = float(np.linalg.norm(centers[i] - centers[j])) if radii[i] + radii[j] > dist: scale = dist / (radii[i] + radii[j]) radii[i] *= scale radii[j] *= scale changed = True if not changed: break return np.maximum(radii, 0.0) # EVOLVE-BLOCK-END def run_experiment(**kwargs): """Fixed interface — not evolved.""" centers, radii = construct_packing() sum_radii = float(np.sum(radii)) return centers, radii, sum_radii
See core/examples/circle_packing/initial.py for the working version.
Writing evaluate.py
The evaluator is a standalone Python script that:
- Accepts
--program_path <path>as a CLI argument - Dynamically imports the candidate module at that path
- Calls
run_experiment()(or whatever entry point you defined) - Validates correctness if needed
- Computes a fitness score
- Prints JSON to stdout with at minimum
combined_score(float) andcorrect(bool)
Minimal template
#!/usr/bin/env python3 """Evaluator for <your task>.""" import argparse import importlib.util import json import sys def load_program(path): spec = importlib.util.spec_from_file_location("candidate", path) mod = importlib.util.module_from_spec(spec) spec.loader.exec_module(mod) return mod def main(): p = argparse.ArgumentParser() p.add_argument("--program_path", required=True) args = p.parse_args() try: mod = load_program(args.program_path) result = mod.run_experiment() score = compute_score(result) correct = validate(result) except Exception as e: print(json.dumps({ "combined_score": 0.0, "correct": False, "error": str(e), })) sys.exit(0) print(json.dumps({ "combined_score": float(score), "correct": bool(correct), })) def compute_score(result): """Return a scalar to maximize. Higher is better.""" return result # replace with your fitness function def validate(result): """Return True if the result meets correctness requirements.""" return True if __name__ == "__main__": main()
Critical rules
- Always print JSON to stdout, never raise uncaught exceptions. Failed mutations should return
combined_score: 0.0, correct: Falsewith an error message so the loop can continue. - Higher scores must be better. claude-evolve maximizes
combined_score. If your natural metric is "error rate" or "latency", return its negation or reciprocal. - Never use an LLM to judge fitness. Fitness must come from real code execution. If your problem truly has no objective metric, claude-evolve is not the right tool.
- Keep evaluations fast. Each generation runs one evaluation. If each takes 30 minutes, 50 generations takes 25 hours. Aim for under 60 seconds per evaluation.
- Exit with code 0 even on failure. Non-zero exit codes make the runner treat the evaluator itself as broken, not the candidate.
- Be deterministic or well-averaged. If your evaluation is noisy, average over N seeds so the score reflects true quality, not RNG luck.
Example: testing with pytest
def main(): p = argparse.ArgumentParser() p.add_argument("--program_path", required=True) args = p.parse_args() # Copy candidate to a fixture location that the test suite imports import shutil shutil.copy(args.program_path, "src/candidate.py") # Run pytest and parse results import subprocess result = subprocess.run( ["pytest", "tests/", "--json-report", "--json-report-file=/tmp/report.json"], capture_output=True, timeout=60, ) with open("/tmp/report.json") as f: report = json.load(f) passed = report["summary"].get("passed", 0) total = report["summary"]["total"] print(json.dumps({ "combined_score": passed / total if total else 0.0, "correct": passed == total, "passed": passed, "total": total, }))
Example: benchmark with runtime measurement
def main(): p = argparse.ArgumentParser() p.add_argument("--program_path", required=True) args = p.parse_args() try: mod = load_program(args.program_path) import time start = time.perf_counter() result = mod.run_experiment(data=BENCHMARK_INPUT) elapsed = time.perf_counter() - start # Verify correctness correct = result == EXPECTED_OUTPUT # Fitness: reciprocal of runtime, zero if incorrect score = 1.0 / elapsed if correct else 0.0 except Exception as e: print(json.dumps({"combined_score": 0.0, "correct": False, "error": str(e)})) return print(json.dumps({ "combined_score": score, "correct": correct, "runtime_ms": elapsed * 1000, }))
Writing config.json
A minimal config:
{
"task_description": "Optimize the sort algorithm for speed on random integer arrays",
"language": "python",
"init_program_path": "initial.py",
"eval_program_path": "evaluate.py",
"num_generations": 50,
"ensemble": {
"arms": [
"sonnet/medium",
"sonnet/low",
"haiku/high",
"haiku/medium",
"haiku/low"
]
},
"islands": {
"num_islands": 2,
"migration_interval": 10
},
"llm_timeout": 300,
"eval_timeout": 120
}That's enough to run. All other fields have sensible defaults. See Configuration Reference for the full list.
Running an Evolution
Start it
Or inline:
/evolve --init initial.py --eval evaluate.py --generations 50
The runner spawns as a background subprocess and writes state to ./state/ relative to the config file's directory. It prints pid and returns immediately — the evolution runs on its own.
Check progress
This reads state/run_state.json and reports the current generation, best score, bandit stats, and recent log events. Run it as often as you want — it doesn't disturb the running evolution.
You can also tail the JSONL log directly:
Each line is a JSON event like:
{"ts": "2026-04-10T23:27:41Z", "gen": 0, "event": "init", "score": 1.8378, "correct": true}
{"ts": "2026-04-10T23:30:12Z", "gen": 1, "event": "generation_complete", "arm": "haiku/high", "patch_type": "full", "score": 2.0837, "correct": true}Stop it gracefully
Or directly:
kill -TERM <pid> # pid is shown in /evolve-status
The runner catches SIGTERM and SIGINT, saves state, waits up to 30 seconds for in-flight work, and exits cleanly. You can resume later by launching /evolve with the same config — it detects existing state and continues from the last completed generation.
After it finishes
The database at state/programs.db contains every program the loop generated. Query it to find the winner:
sqlite3 state/programs.db " SELECT id, generation, combined_score, correct, patch_type FROM programs WHERE correct = 1 ORDER BY combined_score DESC LIMIT 10 "
Or use the visualization tool:
/evolve-visualize progress # score-over-time chart
/evolve-visualize genealogy # parent-child tree of top programs
/evolve-visualize bandit # arm selection stats
Extract the winning program's code:
sqlite3 state/programs.db "SELECT code FROM programs WHERE id = <best_id>" > best.py
Monitoring and Visualization
claude-evolve includes three built-in visualizations:
Progress chart
Shows the best score per generation as a terminal sparkline plus a table:
Generation: 1 5 10 15 20 25 30 35 40 45 50
Best score: ▁▂▃▄▅▅▆▆▇▇█ (1.8378 → 2.6343)
Gen Best Mean Improvement
1 1.8378 1.8378 +0.00%
5 1.9215 1.8803 +4.55%
10 2.4103 2.1844 +31.16%
15 2.5943 2.3871 +41.17%
...
50 2.6343 2.5102 +43.34%
Genealogy tree
Shows the parent-child relationships of the top-K programs:
└─ id=23 gen=40 score=2.6343 full
└─ id=16 gen=22 score=2.6296 fix
└─ id=10 gen=14 score=2.6236 full
└─ id=9 gen=13 score=2.5943 diff
└─ id=1 gen=0 score=1.8378 seed
Bandit arm stats
Shows which (model × effort) combinations were selected and how they performed:
Arm Selected Mean reward UCB1 score
sonnet/low 15 +0.398 2.14
sonnet/medium 9 +0.262 1.97
haiku/high 8 +0.180 1.88
haiku/medium 9 +0.155 1.83
haiku/low 9 +0.088 1.72
Configuration Reference
Full config.json schema with all fields and defaults:
{
"task_description": "You are an expert optimization and algorithm design assistant...",
"language": "python",
"init_program_path": "initial.py",
"eval_program_path": "evaluate.py",
"num_generations": 100,
"ensemble": {
"arms": [
"opus/max", "opus/high",
"sonnet/max", "sonnet/high", "sonnet/medium", "sonnet/low",
"haiku/high", "haiku/medium", "haiku/low"
],
"selection": "ucb1",
"exploration_coef": 1.0,
"epsilon": 0.2,
"shift_by_baseline": true,
"shift_by_parent": true,
"adaptive_scale": true,
"asymmetric_scaling": true
},
"patches": {
"types": ["diff", "full", "cross", "fix"],
"probs": [0.5, 0.25, 0.1, 0.15],
"max_resamples": 3,
"max_attempts": 1
},
"islands": {
"num_islands": 2,
"migration_interval": 10,
"migration_rate": 0.1,
"archive_size": 40,
"archive_selection_strategy": "fitness",
"parent_selection_strategy": "weighted",
"enable_dynamic_islands": false,
"stagnation_threshold": 100,
"elite_selection_ratio": 0.3
},
"meta": {
"rec_interval": 10,
"max_recommendations": 5,
"sample_single": true
},
"novelty": {
"similarity_threshold": 0.95,
"max_attempts": 3
},
"prompt_evo": {
"enabled": false,
"evolution_interval": null,
"archive_size": 10
},
"logging": {
"format": "jsonl",
"path": "state/evolve.log",
"level": "INFO"
},
"eval_timeout": 120,
"llm_timeout": 300,
"use_text_feedback": false,
"inspiration_sort_order": "ascending",
"results_dir": "state/"
}Key knobs
| Field | Default | Description |
|---|---|---|
num_generations |
100 | How many mutation→evaluate cycles to run |
ensemble.arms |
9 arms | Which (model, effort) combinations the UCB1 bandit can pick from. Drop slow arms like sonnet/high if you want fast iteration. |
ensemble.exploration_coef |
1.0 | UCB1 exploration weight. Higher = more exploration, lower = more exploitation |
patches.types |
4 types | Which mutation styles to use. Can drop cross if you only have 1 island or want simpler mutations. |
patches.probs |
[0.5, 0.25, 0.1, 0.15] |
Sampling probabilities for each patch type. Must sum to 1.0 and match length of patches.types. |
islands.num_islands |
2 | Number of independent populations. Use 1 for small tasks, 3+ for larger exploration. |
islands.migration_interval |
10 | Migrate elites every N generations |
novelty.similarity_threshold |
0.95 | Reject proposals with AST similarity above this. Lower = more novelty enforcement |
llm_timeout |
300 | Max seconds per claude subprocess call. Set to 900+ if using sonnet/max or opus/* arms |
eval_timeout |
120 | Max seconds per evaluator subprocess. Set based on your evaluator's worst-case runtime |
meta.rec_interval |
10 | Generate meta-scratchpad recommendations every N generations |
prompt_evo.enabled |
false | Enable system-prompt co-evolution (experimental) |
Recommended arm sets by task type
- Fast iteration (< 30s per generation):
["haiku/high", "haiku/medium", "haiku/low"] - Balanced (default):
["sonnet/medium", "sonnet/low", "haiku/high", "haiku/medium", "haiku/low"] - Deep thinking (several minutes per generation):
["sonnet/max", "sonnet/high", "opus/max"]+llm_timeout: 1800 - Bedrock multi-model (see below):
["bedrock/deepseek.v3.2/medium", "bedrock/qwen.qwen3-coder-next/high", "bedrock/us.anthropic.claude-sonnet-4-6/low"]
Using AWS Bedrock Models
claude-evolve can use any model available through AWS Bedrock's Converse API as bandit arms — Claude, DeepSeek, Qwen, Kimi, Nemotron, GLM, MiniMax, and more. This gives you model-family diversity in a single evolution run.
Prerequisites
- An AWS account with Bedrock model access enabled for the models you want to use
- AWS credentials configured locally (via
~/.aws/credentialsor environment variables) - Install the bedrock extras:
pip install 'claude-evolve[bedrock]'
Configuration
Add a bedrock section to your config.json and prefix arm names with bedrock/:
{
"bedrock": {
"enabled": true,
"profile": "your-aws-profile",
"region": "us-east-1"
},
"ensemble": {
"arms": [
"bedrock/deepseek.v3.2/medium",
"bedrock/us.anthropic.claude-sonnet-4-6/low",
"bedrock/moonshotai.kimi-k2.5/medium",
"bedrock/qwen.qwen3-coder-next/high",
"bedrock/nvidia.nemotron-super-3-120b/medium",
"bedrock/openai.gpt-oss-120b-1:0/medium",
"bedrock/zai.glm-5/medium"
]
}
}Arm format
bedrock/<model-id>/<effort>
- model-id: The Bedrock model ID or inference profile ID. Claude models require the inference profile form (e.g.,
us.anthropic.claude-sonnet-4-6notanthropic.claude-sonnet-4-6). - effort:
low,medium,high, ormax.
Effort mapping
| Model family | low | medium | high | max |
|---|---|---|---|---|
| Claude (anthropic) | thinking 4K tokens | thinking 16K | thinking 32K | thinking 64K |
| All others | temp 0.3 | temp 0.7 | temp 0.9 | temp 1.0 |
Available models (tested)
bedrock/deepseek.v3.2/medium
bedrock/us.anthropic.claude-sonnet-4-6/low
bedrock/us.anthropic.claude-opus-4-6-v1/low
bedrock/us.anthropic.claude-opus-4-5-20251101-v1:0/low
bedrock/moonshotai.kimi-k2.5/medium
bedrock/moonshot.kimi-k2-thinking/medium
bedrock/nvidia.nemotron-super-3-120b/medium
bedrock/openai.gpt-oss-120b-1:0/medium
bedrock/qwen.qwen3-vl-235b-a22b/medium
bedrock/qwen.qwen3-coder-next/high
bedrock/zai.glm-5/medium
bedrock/minimax.minimax-m2.5/medium
bedrock/ai21.jamba-1-5-large-v1:0/medium
Hybrid ensembles
You can mix Claude CLI arms and Bedrock arms in the same run:
{
"bedrock": { "enabled": true, "profile": "myprofile", "region": "us-east-1" },
"ensemble": {
"arms": [
"sonnet/medium",
"haiku/high",
"bedrock/deepseek.v3.2/medium",
"bedrock/qwen.qwen3-coder-next/high"
]
}
}Arms without the bedrock/ prefix use the Claude Code CLI as before.
Troubleshooting Bedrock
- "Invocation of model ID X with on-demand throughput isn't supported": Use the inference profile ID (e.g.,
us.anthropic.claude-sonnet-4-6instead ofanthropic.claude-sonnet-4-6). Runaws bedrock list-inference-profilesto find valid IDs. - "The maximum tokens you requested exceeds the model limit": This is handled automatically for known models. If you hit it with a new model, file an issue.
- Throttling: The backend retries with exponential backoff automatically.
- Credentials: Set
profilein the bedrock config to match your~/.aws/credentialsprofile name. Or omit it to use the default profile / environment variables.
Architecture
claude-evolve/
├── CLAUDE.md # Plugin registration + skill index
├── .claude-plugin/
│ ├── marketplace.json # Claude Code marketplace manifest
│ └── plugin.json # Plugin metadata
├── .mcp.json # MCP server registration
├── package.json # Thin npm wrapper for plugin install
├── bridge/
│ └── mcp-wrapper.mjs # Node wrapper that spawns the Python MCP server
├── skills/ # Slash command definitions
│ ├── evolve-install.md
│ ├── evolve-interview.md
│ ├── evolve.md
│ └── evolve-status.md
├── core/ # Python core
│ ├── pyproject.toml
│ └── claude_evolve/
│ ├── server.py # Python MCP server (5 tools)
│ ├── cli.py # `claude-evolve` command-line entry point
│ ├── runner.py # Main evolution orchestrator (sync-first)
│ ├── evaluation.py # Evaluator subprocess wrapper
│ ├── config.py # EvolveConfig dataclass + JSON I/O
│ ├── ensemble/
│ │ ├── bandit.py # UCB1 bandit over (model, effort) arms
│ │ ├── bridge.py # `claude --model X --effort Y -p ...` wrapper
│ │ └── personas.py # 5 prompt personas (replaces temperature)
│ ├── database/
│ │ ├── db.py # SQLite program database
│ │ ├── islands.py # Island management + migration
│ │ ├── parents.py # Weighted/power-law parent selection
│ │ ├── archive.py # Elite archive maintenance
│ │ └── models.py # Program dataclass
│ ├── mutations/
│ │ ├── apply_diff.py # SEARCH/REPLACE patching
│ │ ├── apply_full.py # Full rewrite patching
│ │ ├── crossover.py # Crossover patch type
│ │ ├── apply_fix.py # Fix patch for broken mutations
│ │ └── sampler.py # PromptSampler (builds mutation prompts)
│ ├── prompts/ # Prompt templates (ported from ShinkaEvolve)
│ ├── meta/
│ │ ├── summarizer.py # Meta-scratchpad 3-step pipeline
│ │ └── prompt_evolver.py # Prompt co-evolution
│ ├── novelty/
│ │ ├── embeddings.py # AST fingerprint + MinHash
│ │ └── judge.py # Novelty rejection logic
│ └── visualization/
│ ├── progress.py # Score-over-time chart
│ ├── genealogy.py # Parent-child tree
│ └── bandit.py # Arm selection stats
└── examples/
└── circle_packing/
├── initial.py # Naive 26-circle packing
├── evaluate.py # Circle packing evaluator
└── config.json # Benchmark config
Benchmark: Circle Packing
The reference benchmark is the 26-circle packing task from the ShinkaEvolve paper: place 26 non-overlapping circles inside the unit square [0,1]² to maximize the sum of their radii.
Results
| System | Best score | Improvement over baseline |
|---|---|---|
| Naive initial (ring pattern) | 1.8378 | — |
| claude-evolve (50 gen, 3 islands, ~3 hours) | 2.6343 | +43.3% |
| ShinkaEvolve (reference, 150 gens) | 2.6360 | +43.4% |
claude-evolve reached 99.94% of ShinkaEvolve's reference score using only Claude Code models — no GPT, no Gemini, no cost-tracked API calls.
How to reproduce
cp -r core/examples/circle_packing ./my-circle-run
cd my-circle-run
/evolve config.jsonNative Windows PowerShell
WSL is recommended for long benchmark runs. If you run on native Windows, this path uses the direct Python CLI, not the Claude Code plugin slash commands. You do not need to run claude mcp add, /evolve, or /evolve-status for this path; those are only needed if you want the plugin/MCP workflow.
First verify the Claude CLI works from PowerShell:
"Reply with OK only" | claude -p --model claude-haiku-4-5-20251001 --effort low --output-format json
If native Windows fails with [WinError 206] The filename or extension is too long, apply the stdin bridge fix in Troubleshooting, then verify the Python bridge:
core\.venv\Scripts\python.exe -c "import asyncio; from claude_evolve.ensemble.bridge import query_claude_async; r=asyncio.run(query_claude_async('haiku/low','Reply with OK only',timeout=120,max_retries=1)); print(r.content)"
Run the benchmark:
cd core $env:CLAUDE_EVOLVE_EVAL_PYTHON = (Resolve-Path .\.venv\Scripts\python.exe).Path $env:CLAUDE_CODE_MAX_OUTPUT_TOKENS = "64000" .\.venv\Scripts\python.exe -m claude_evolve.cli run --config .\examples\circle_packing\config.json
Monitor from a second PowerShell window:
cd claude-evolve\core Get-Content .\state\evolve.log -Wait -Tail 20
Check state:
Get-Content .\state\run_state.jsonThe default benchmark can still take a long time because it samples high-effort arms such as opus/high, sonnet/max, and sonnet/high.
The discovered solution (in the database after the run completes) is a hybrid approach:
- Multiple initialization strategies: hexagonal grid, golden-angle sunflower, ring pattern
- Augmented Lagrangian method (PHR) for non-overlap constraints
- L-BFGS-B optimizer for local minimization
- Basin hopping for escaping local optima
- Adaptive temperature scheduling for multi-start
This is the textbook approach for packing problems — claude-evolve discovered it entirely through evolutionary search, starting from a naive initial solution that knew nothing about optimization theory.
Troubleshooting
Native Windows: [WinError 206] The filename or extension is too long
Native Windows can fail if the Claude prompt is passed as a command-line argument. The fix is to keep -p for non-interactive mode, but send the prompt through stdin.
First verify the Claude CLI itself works:
"Reply with OK only" | claude -p --model claude-haiku-4-5-20251001 --effort low --output-format json
In core\claude_evolve\ensemble\bridge.py, change _build_claude_cmd from:
"-p", prompt, "--output-format", "json",
to:
"-p", "--output-format", "json",
In the same file, find asyncio.create_subprocess_exec(...) and add stdin while keeping the existing stdout, stderr, cwd, and env arguments:
stdin=asyncio.subprocess.PIPE,
Then change:
to:
proc.communicate(input=prompt.encode("utf-8"))
Verify the Python bridge after the change:
cd core .\.venv\Scripts\python.exe -c "import asyncio; from claude_evolve.ensemble.bridge import query_claude_async; r=asyncio.run(query_claude_async('haiku/low','Reply with OK only',timeout=120,max_retries=1)); print(r.content)"
Expected output:
/evolve-install fails at Phase 2: "scipy build error"
Upgrade pip and try again:
source core/.venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -e core/On older Linux, you may need to install system libraries:
sudo apt install python3-dev libopenblas-dev liblapack-dev gfortran
/evolve-install says "MCP tools list empty"
The Python import failed. Debug directly:
python3 -m claude_evolve.server
# Should not raise any errors. If it does, the traceback shows what's missing.Common causes: mcp package not installed, numpy/scipy not installed, Python version < 3.10.
claude mcp list doesn't show claude-evolve
Close and re-open Claude Code. If still missing, manually register:
claude mcp add -s user claude-evolve -- python3 -m claude_evolve.server
Evolution runs but every generation has correct=False
Your evaluator is rejecting all candidates. Run it directly on the seed to see what's happening:
python3 evaluate.py --program_path initial.py
If the seed itself scores correct=False, the evaluator's validation is wrong — fix it before running evolution. The seed must be correct for the loop to have any parent to mutate from.
sonnet/high or opus/* arms always time out
This is expected with the default llm_timeout: 300. Extended thinking on complex tasks can take 10–30 minutes per call and may even exceed the default 32,000 output token limit.
Options:
-
Drop slow arms from the ensemble (recommended for most tasks):
"arms": ["sonnet/medium", "sonnet/low", "haiku/high", "haiku/medium", "haiku/low"]
-
Raise
llm_timeoutto 1800 (30 min) if you want to keepsonnet/max/opus/*and accept slow iteration: -
Raise
CLAUDE_CODE_MAX_OUTPUT_TOKENSenv var if you see "Response exceeded 32000 output tokens" errors in logs. claude-evolve already sets this to 64000 internally, but you can go higher:export CLAUDE_CODE_MAX_OUTPUT_TOKENS=128000
Evolution is slow even with haiku arms
Each claude call has subprocess startup overhead. To minimize:
- claude-evolve already uses
--strict-mcp-configwith an empty MCP config to skip loading other plugins per call. This is built in. - Make sure your evaluator is fast. If each evaluation takes 60 seconds, the LLM call is not the bottleneck.
- Reduce
patches.probsforcross(which needs two parents and builds longer prompts). - Consider
num_islands: 1andmigration_interval: 999to skip migration overhead for small runs.
"Not logged in · Please run /login"
This happens if your Claude CLI auth has expired. Run claude interactively once to re-authenticate, then retry.
Runs use too much of my Claude quota
Set num_generations lower and experiment with a short run first. 10 generations is usually enough to see whether your setup is working. A 50-generation run with sonnet/medium dominant typically costs ~$5–15 in Claude API credits if you're on a usage-based plan.
FAQ
How is this different from ShinkaEvolve?
Same algorithm, different LLM layer. ShinkaEvolve uses external API keys (OpenAI, Gemini, DeepSeek, Anthropic) through multi-provider routing. claude-evolve replaces all of that with a single claude --model X --effort Y -p ... subprocess call and treats (model × effort_level) as distinct arms in the UCB1 bandit. The core algorithm — islands, meta-scratchpads, novelty rejection, patch types, prompt co-evolution — is faithfully ported.
The practical difference: you don't manage API keys, you can't use GPT/Gemini, and the ensemble diversity comes from mixing Claude models at different thinking-effort levels plus 5 prompt personas (instead of temperature sampling, which the Claude CLI doesn't expose).
Can I use this without Claude Code?
No. claude-evolve is specifically designed to use the claude CLI that ships with Claude Code. It's not a standalone library — its entire value proposition is "use the Claude Code authentication and model access you already have."
If you want a standalone Python library for LLM-driven code evolution, use ShinkaEvolve directly.
Can I evolve code in languages other than Python?
The prompts and patch logic support multi-language output, but the evaluator subprocess is always launched as python3 evaluate.py. Your evaluator can be Python that compiles/runs other languages — e.g. it can subprocess.run(["cargo", "bench"]) and parse the output. The candidate code itself can be Rust, Go, Julia, etc., as long as you set language in the config so the mutation prompts generate the right code.
How do I know when to stop a run?
Watch /evolve-status for a few generations. If the best score plateaus for 10+ generations and the bandit has converged (one arm dominating), further iteration probably won't help much. Stop with /evolve-stop or let it run to num_generations.
What if my task genuinely has no objective fitness function?
claude-evolve is not the right tool. Evolutionary optimization requires a scalar signal to climb. If you need subjective quality judgment, use a standard code-review workflow with Claude Code, not evolutionary search.
Can I use this for non-code optimization (e.g. prompt engineering)?
Yes, indirectly. Put your prompt text inside a Python function that returns it, wrap that function in EVOLVE-BLOCK markers, and have your evaluator run the prompt against a benchmark and return the accuracy/score. claude-evolve will evolve the prompt string. This is how ShinkaEvolve's AIME mathematical reasoning benchmark worked in the paper.
How does this compare to AlphaEvolve / DeepMind's approach?
Same core idea (LLM-as-mutator in an evolutionary loop), but claude-evolve is open-source, runs locally, and uses only Claude Code models. ShinkaEvolve's paper demonstrated it matches or beats AlphaEvolve on several benchmarks while being far more sample-efficient (150 evaluations for the circle packing SOTA vs thousands for AlphaEvolve). claude-evolve inherits that efficiency.
Is this officially supported by Anthropic?
No. claude-evolve is an independent community plugin built on top of Claude Code's plugin infrastructure. It uses only public APIs (the claude CLI and the MCP protocol). There is no special access or coordination with Anthropic.
License
MIT. See LICENSE for details.
Credits
- ShinkaEvolve (Sakana AI, ICLR 2026) — the algorithm this plugin reimplements. Paper: arxiv.org/abs/2509.19349.
- oh-my-claudecode — the reference for skill structure, setup patterns, and the deep-interview concept.
- Claude Code team at Anthropic — for building the plugin infrastructure this runs on.