amux › Blog › Auto-Restart AI Agents
A watchdog that screen-scrapes tmux to detect crashed agents and restart them. It's crude and it works.
When you run AI coding agents in long-running sessions, they eventually hit a wall. Claude Code exhausts its context window, prints a farewell message, and exits to a bare shell prompt. If you're running one agent interactively, you notice immediately. If you're running 20 overnight, you wake up to find half of them have been sitting at a $ prompt for 6 hours.
amux solves this with a watchdog that periodically reads the terminal output, detects the shell prompt pattern, and auto-restarts the agent. The detection is a screen scrape. It's the kind of thing that feels wrong but works surprisingly well.
The detection problem
You need to distinguish four states from terminal output alone:
- Claude Code is running. The TUI is visible — you can see the
❯prompt, tool output, status indicators. - Claude Code exited to a shell. The terminal shows a bare
$or%prompt. This is what we want to detect and restart. - The user is working in the shell. They intentionally exited Claude and are doing something else. Don't restart.
- Claude was never started. Fresh shell session. Don't restart.
States 2, 3, and 4 all look the same on screen: a shell prompt. The difference is context — was Claude running recently? Did it exit on its own? You can't tell from a single snapshot. You need history.
Step 1: Capture the terminal
tmux gives you capture-pane, which dumps the visible terminal content as plain text:
tmux capture-pane -t mysession -p
This returns whatever is on screen, including ANSI escape codes. Strip them:
import re
def strip_ansi(text):
return re.sub(r'\x1b\[[0-9;]*[a-zA-Z]', '', text)
output = subprocess.check_output(
["tmux", "capture-pane", "-t", target, "-p"],
text=True, timeout=5
)
clean = strip_ansi(output)
Now you have the terminal content as plain text. This is your "API response."
Step 2: Detect Claude's TUI
When Claude Code is running, it draws a distinctive terminal UI. The most reliable indicator is the ❯ character (Unicode U+276F, "heavy right-pointing angle quotation mark ornament") — this is Claude's input prompt. It also has tool-use patterns, status bars, and other TUI elements.
def claude_ui_visible(clean_output):
"""Return True if Claude Code's TUI elements are visible."""
if "\u276f" in clean_output: # ❯ prompt
return True
# Other heuristics: tool output patterns, status indicators
return False
If the Claude UI is visible, the agent is alive. No action needed.
Step 3: Detect the shell prompt
If Claude's UI isn't visible, check if we're looking at a bare shell prompt:
def at_shell_prompt(clean_output):
"""Return True if the terminal looks like a bare shell prompt."""
if claude_ui_visible(clean_output):
return False
lines = [l for l in clean_output.splitlines() if l.strip()]
for line in lines[-5:]: # check last 5 non-empty lines
stripped = line.strip()
# Bash ends with $, zsh ends with %
if re.search(r'[$%]\s*$', stripped):
# Make sure it's not Claude's prompt character
if "\u276f" not in stripped:
return True
return False
We check the last 5 lines rather than just the last line because the terminal might have trailing blank lines, or the prompt might span multiple lines (common with fancy PS1 configurations like powerline or starship).
Step 4: The "was it ever alive?" guard
This is the critical insight that separates states 2, 3, and 4. We track a timestamp of the last time we saw Claude's UI in each session:
# During regular monitoring (runs every 30 seconds)
if claude_ui_visible(clean_output):
session_state["last_claude_alive"] = time.time()
The auto-restart logic only fires if Claude was seen alive within the last 10 minutes:
last_alive = session_state.get("last_claude_alive", 0)
if time.time() - last_alive > 600: # 10 minutes
# Claude wasn't alive recently — this is a fresh shell
# or the user exited a long time ago. Don't restart.
return
This handles two cases:
- State 3 (user working in shell): If the user exited Claude 30 minutes ago and has been typing shell commands,
last_claude_aliveis >10 minutes old. No restart. - State 4 (never started):
last_claude_aliveis 0 or doesn't exist. No restart.
State 2 (just exited) has a last_claude_alive within the last few minutes. Restart.
The 10-minute window is generous. In practice, Claude's context exhaustion happens suddenly — it's running, then it exits. The time between "last seen alive" and "detected at shell prompt" is typically under 60 seconds (one or two monitoring cycles).
Step 5: Rate limiting
Without rate limiting, a crash loop kills your system. Imagine Claude starts, immediately hits an error, and exits. The watchdog detects the exit, restarts Claude, Claude crashes again, watchdog restarts, ad infinitum.
last_restart = session_state.get("last_auto_restart", 0)
if time.time() - last_restart < 90: # 90 second cooldown
return # too soon, skip
90 seconds between restarts means the worst case is ~40 restarts per hour, which is annoying but not catastrophic. In practice, crash loops are rare — the typical failure mode (context exhaustion) doesn't repeat immediately because the fresh Claude instance starts with a full context window.
Step 6: The restart sequence
def auto_restart(session_name, session_state):
# Prevent concurrent restart attempts
if session_state.get("restarting"):
return
session_state["restarting"] = True
session_state["last_auto_restart"] = time.time()
def do_restart():
time.sleep(3) # let any exit animation finish
start_session(session_name) # relaunch claude in the pane
session_state.pop("restarting", None)
push_alert(f"Claude exited in '{session_name}' — auto-restarted")
threading.Thread(target=do_restart, daemon=True).start()
The 3-second delay before restart is important. When Claude exits, it sometimes prints a summary or cleanup message. If you restart immediately, the new Claude instance's TUI might collide with the tail end of the old output. Three seconds lets the terminal settle.
The restart itself is just launching Claude Code in the tmux pane — the same command that created the session originally. Claude starts fresh with no memory of the previous conversation.
The "no memory" question
When Claude restarts after context exhaustion, it starts with a clean context window. The previous conversation is gone. This sounds like a problem, but in practice it works well for two reasons:
- CLAUDE.md provides persistent context. The project instructions file is loaded on every start. If the agent's task is defined there (or on the shared board), the new instance picks it up automatically.
- The codebase is the ground truth. For batch operations (write tests, refactor, fix bugs), the agent examines the code, finds what needs doing, and does it. It doesn't need to remember that it already fixed 3 out of 5 files — it reads the codebase and finds the remaining 2.
For interactive sessions where conversation history matters, auto-restart is less useful. That's why it's opt-in.
Configuration: opt-in per session
Auto-restart is controlled by a per-session environment variable:
# In the session's .env file
CC_AUTO_CONTINUE=1
This is deliberately opt-in. Reasons:
- Some users want to review the context state before restarting
- Interactive sessions shouldn't auto-restart (the user might want to inspect the terminal)
- Debugging sessions might intentionally exit Claude to run shell commands
For batch/overnight sessions, you enable it once and forget about it.
The broader pattern: terminal as API
What we're really doing here is treating tmux capture-pane as a read endpoint. The terminal is the interface, and we're screen-scraping it for state. This feels hacky, but consider the alternative approaches:
- Process monitoring (check if Claude's PID is alive): Doesn't tell you if Claude is stuck, and Claude might spawn subprocesses that outlive it.
- Log file parsing: Claude Code doesn't write structured logs to a file you can monitor.
- Exit code detection: You'd need to wrap Claude in a script that reports exit codes. Works, but adds a layer.
- Claude's own status API: Doesn't exist for the CLI tool.
Screen-scraping the terminal works because Claude Code is a TUI application. Its visual state is its state. If you can see the ❯ prompt, it's alive. If you see a $ prompt, it's dead. The terminal is an unintentional but perfectly reliable status API.
The heuristics are simple, but they've been running in production across hundreds of agent sessions with zero false positives (restarting when Claude was still running) and near-zero false negatives (missing an exit). The 10-minute alive window and the TUI detection are sufficient to disambiguate every real-world case we've encountered.
The full monitoring loop
For completeness, here's how it fits together:
def monitor_sessions():
"""Run every 30 seconds."""
for session in get_all_sessions():
output = capture_pane(session.name)
clean = strip_ansi(output)
state = session_states[session.name]
# Track when Claude was last seen alive
if claude_ui_visible(clean):
state["last_claude_alive"] = time.time()
state.pop("restarting", None)
continue
# Check for auto-restart conditions
if not at_shell_prompt(clean):
continue
if state.get("restarting"):
continue
env = parse_env_file(session.env_path)
if env.get("CC_AUTO_CONTINUE") not in ("1", "true", "yes"):
continue
last_alive = state.get("last_claude_alive", 0)
if time.time() - last_alive > 600:
continue # wasn't alive recently
last_restart = state.get("last_auto_restart", 0)
if time.time() - last_restart < 90:
continue # rate limited
auto_restart(session.name, state)
30 seconds × N sessions. Each check is a tmux capture-pane (sub-millisecond) plus some regex matching. The overhead is negligible.
The result: overnight batch runs that used to stall when one agent hit context limits now self-heal. The dashboard shows an alert ("auto-restarted"), and the agent keeps working. Crude, simple, effective.
Get started with amux
Run dozens of Claude Code agents in parallel. Python 3 + tmux. Open source.
git clone https://github.com/mixpeek/amux && cd amux && ./install.sh
amux register myproject --dir ~/Dev/myproject --yolo
amux start myproject
amux serve # → https://localhost:8822