Running 30 Claude Code agents on one repo without them stomping each other

8 min read Original article ↗

amux Blog Running 30 agents on one repo

Git worktrees for isolation, SQLite CAS for task claiming, SSE for real-time progress. No Redis, no Kubernetes, no microservices.

The problem

You have a monorepo. You want 10 AI agents working on it simultaneously — one fixing auth, one writing tests, one refactoring the API layer, one building a new feature. If they all share the same working directory, you get this:

agent-1: git add . && git commit -m "fix auth middleware"
agent-2: git add . && git commit -m "refactor API layer"
# agent-2 just committed agent-1's half-finished auth changes

Every agent sees every other agent's uncommitted changes. Files get overwritten mid-edit. Merge conflicts pile up. One agent's npm install blows away another's node_modules. It's a mess.

The standard fix is "just give each agent its own clone." But a full git clone of a large repo takes minutes and eats disk. If you have 30 agents, you now have 30 copies of your repo.

Git worktrees: the correct primitive

git worktree is one of those git features that's been around since 2.5 (2015) and almost nobody uses. It creates a new working directory that shares the same .git object store as the original repo. No copying objects, no re-cloning. One command, sub-second.

git worktree add .worktrees/auth-fix -b session/auth-fix

That gives you a fresh working tree at .worktrees/auth-fix on a new branch session/auth-fix. The branch is isolated — commits in one worktree don't appear in another. But they share the object database, so disk usage is negligible.

In amux, when you create a session with worktree=true, this is what happens:

# 1. Verify the directory is a git repo
repo_root = subprocess.run(
    ["git", "-C", work_dir, "rev-parse", "--show-toplevel"],
    capture_output=True, text=True).stdout.strip()

# 2. Create an isolated worktree with a per-session branch
wt_dir = os.path.join(repo_root, ".worktrees", name)
branch_name = f"session/{name}"
subprocess.run(
    ["git", "-C", repo_root, "worktree", "add", wt_dir, "-b", branch_name],
    capture_output=True, text=True, timeout=15)

# 3. Start the Claude Code session in the worktree directory
subprocess.run(
    ["tmux", "new-session", "-d", "-s", tmux_sess,
     "-c", wt_dir,  # <-- the worktree, not the original repo
     "-e", f"AMUX_SESSION={name}",
     "-e", f"AMUX_URL=https://localhost:8822"],
    check=True)

When the session ends, cleanup is one command:

git worktree remove --force .worktrees/auth-fix

The branch stays behind if there are commits to review. If the worktree is clean, it's gone.

The task queue: a kanban board backed by SQLite

Isolation solves the filesystem collision problem. But now you have a coordination problem: how do 30 agents decide who works on what?

We use a kanban board backed by SQLite. Not Jira, not Linear, not a YAML file — a real board with columns (todo, doing, done) where both humans and agents can create, claim, and complete tasks.

When an agent starts, it gets two environment variables injected into its tmux session:

$AMUX_SESSION  # e.g., "auth-fix"
$AMUX_URL      # e.g., "https://localhost:8822"

And a shared memory file that teaches it the full REST API. The agent can immediately start checking the board for work:

# Check for tasks assigned to this session
curl -sk $AMUX_URL/api/board | python3 -c "
import json, sys, os
session = os.getenv('AMUX_SESSION')
tasks = json.load(sys.stdin)
mine = [t for t in tasks if t['session'] == session and t['status'] in ('todo', 'doing')]
for t in mine:
    print(f\"{t['id']}: {t['title']}\")"

Atomic task claiming (the interesting part)

The hard problem isn't listing tasks — it's claiming them. If you have 30 agents polling the same board, two of them will inevitably grab the same task at the same time.

The standard solution is a distributed lock (Redis, etcd, Zookeeper). We use a SQLite UPDATE with a WHERE clause — effectively compare-and-swap without any external dependencies:

POST /api/board/:id/claim  {"session": "auth-fix"}

Server-side:

# Step 1: Atomic UPDATE — only succeeds if task is still unclaimed
db.execute(
    """UPDATE issues SET status='doing', session=?, updated=?
       WHERE id=? AND status IN ('todo','backlog') AND deleted IS NULL""",
    (session_name, now, task_id))
db.commit()

# Step 2: Verify — did WE get it, or did another agent beat us?
row = db.execute("SELECT session FROM issues WHERE id=?", (task_id,)).fetchone()
if row["session"] != session_name:
    return {"error": "claim failed — taken by another session"}, 409

The WHERE status IN ('todo','backlog') is the CAS condition. SQLite's default isolation level means only one writer succeeds. The verification step handles the case where the row already matched another session.

No Redis. No distributed consensus. Just a WHERE clause and SQLite's write lock.

Task naming: readable IDs from session names

Every task gets a human-readable ID derived from the session name:

def _prefix_from_session(session):
    words = re.split(r"[-_\s]+", session)
    if len(words) == 1:
        return words[0].upper()[:5]  # "auth" -> "AUTH"
    return "".join(w[0] for w in words).upper()[:5]  # "data-pipeline" -> "DP"

So auth-fix gets tasks like AF-1, AF-2. data-pipeline-worker gets DPW-1. You can glance at the board and immediately know which session owns what.

Real-time progress: SSE, not polling

We need the dashboard to show task state changes the moment they happen. Polling is wasteful. WebSockets are complex. Server-Sent Events are perfect for this use case — unidirectional, auto-reconnecting, and work through every proxy.

GET /api/events  (text/event-stream)

The server streams five event types:

event: sessions
data: [{"name":"auth-fix","status":"active","task_name":"Fix OAuth flow",...}, ...]

event: board
data: [{"id":"AF-1","title":"Fix OAuth flow","status":"doing","session":"auth-fix",...}, ...]

event: alerts
data: {"type":"auto_compact","session":"auth-fix","message":"Auto-compacted — context at 18%"}

event: invalidate
data: {"type":"notes"}

event: heartbeat
data: {}

Sessions and board events have a 2-second cache with a lock to prevent thundering herd:

if now - cache["time"] > 2.0:
    if cache_lock.acquire(blocking=False):
        try:
            cache["data"] = list_sessions()
            cache["json"] = json.dumps(cache["data"], sort_keys=True)
            cache["time"] = time.time()
        finally:
            cache_lock.release()

Only one thread builds the response per TTL window. Every other concurrent SSE client gets the cached version. This matters when you have 30 agents plus a few browser tabs.

Agent-to-agent handoff

The most interesting behavior emerges when agents start delegating to each other. Agent A finishes the auth middleware and knows the API layer needs updating. It posts a task to the board:

curl -sk -X POST -H 'Content-Type: application/json' \
  -d '{"title":"Update API routes for new auth middleware","session":"api-refactor","status":"todo"}' \
  $AMUX_URL/api/board

The api-refactor agent sees the new task in its next board check, claims it atomically, and starts working. The dashboard shows all of this in real-time via SSE — the task card moves from "todo" to "doing" to "done" as you watch.

Agents can also send messages directly to each other:

curl -sk -X POST -H 'Content-Type: application/json' \
  -d '{"text":"Auth middleware is done — new token format is JWT with kid header"}' \
  $AMUX_URL/api/sessions/api-refactor/send

This injects the text directly into the other agent's Claude Code prompt via tmux send-keys. The receiving agent sees it as if a human typed it.

The watchdog: keeping 30 agents alive overnight

Running one Claude Code session for an hour is easy. Running 30 for 12 hours is a reliability problem. The failure modes are well-documented (we wrote about them):

  • Context overflow — the 200k token window fills up, Claude starts producing garbage or crashes
  • Thinking block corruption — a malformed internal reasoning block causes a hard crash
  • Silent exit — Claude exits to the shell with no error
  • Stuck waiting — Claude asks a question and sits there forever waiting for a human

The watchdog monitors every session's tmux output and reacts:

# Context getting low? Auto-compact before it crashes.
if "context left until auto-compact" in output:
    pct = int(re.search(r'(\d+)%', output).group(1))
    if pct < 30:
        send_text(name, "/compact")
        push_alert("auto_compact", name, f"Context at {pct}% — auto-compacted")

# Thinking block corrupted? Restart and replay.
if "thinking block is malformed" in output.lower():
    restart_and_replay(name)
    push_alert("thinking_reset", name, "Thinking block corruption — restarted")

# Claude exited? Restart it.
if not claude_ui_visible(output) and recently_alive(name):
    restart_session(name)
    push_alert("auto_restart", name, "Claude exited — restarted")

Every alert fires an SSE event that shows up on the dashboard in real-time. You can watch from your phone as agents auto-compact, recover from crashes, and keep going.

What this looks like in practice

A typical overnight session:

  1. Register 8 sessions, each pointing at the same repo but with worktree=true
  2. Post tasks to the board: "Fix flaky test suite", "Add rate limiting to API", "Migrate users table", etc.
  3. Start all sessions. Each one checks the board, claims a task atomically, and starts working in its own worktree
  4. Dashboard shows all 8 agents in real-time via SSE — active, idle, compacting, restarting
  5. When an agent finishes a task, it marks it done and claims the next one from the queue
  6. When an agent discovers work that another session should handle, it posts a new task to the board with that session's name
  7. Watchdog keeps everything alive. Context fills up? Auto-compact. Claude crashes? Auto-restart.
  8. Wake up to 8 branches with completed work, ready for review

The kanban board is the single source of truth. Every task has a session, a status, and a trail of what happened. The SSE stream means you never need to refresh — cards move across columns in real-time as agents work.

The stack

Everything described in this post is a single Python file. Not a framework, not a collection of microservices — one file that runs on Python 3 and tmux.

  • Isolation: git worktree (ships with git)
  • Task queue: SQLite (ships with Python)
  • Real-time: SSE over stdlib http.server
  • Agent communication: tmux send-keys
  • Dashboard: inline HTML/CSS/JS (no build step)
  • External dependencies: zero

The entire thing is open source. Clone it, run ./install.sh, and you're up in 30 seconds.

Get started with amux

Run dozens of Claude Code agents in parallel. Python 3 + tmux. Open source.

git clone https://github.com/mixpeek/amux && cd amux && ./install.sh
amux register myproject --dir ~/Dev/myproject --yolo
amux start myproject
amux serve  # → https://localhost:8822

View on GitHub