Detecting When Your AI Agent Dies (By Reading the Terminal)

8 min read Original article ↗

amux Blog Auto-Restart AI Agents

A watchdog that screen-scrapes tmux to detect crashed agents and restart them. It's crude and it works.

When you run AI coding agents in long-running sessions, they eventually hit a wall. Claude Code exhausts its context window, prints a farewell message, and exits to a bare shell prompt. If you're running one agent interactively, you notice immediately. If you're running 20 overnight, you wake up to find half of them have been sitting at a $ prompt for 6 hours.

amux solves this with a watchdog that periodically reads the terminal output, detects the shell prompt pattern, and auto-restarts the agent. The detection is a screen scrape. It's the kind of thing that feels wrong but works surprisingly well.

The detection problem

You need to distinguish four states from terminal output alone:

  1. Claude Code is running. The TUI is visible — you can see the prompt, tool output, status indicators.
  2. Claude Code exited to a shell. The terminal shows a bare $ or % prompt. This is what we want to detect and restart.
  3. The user is working in the shell. They intentionally exited Claude and are doing something else. Don't restart.
  4. Claude was never started. Fresh shell session. Don't restart.

States 2, 3, and 4 all look the same on screen: a shell prompt. The difference is context — was Claude running recently? Did it exit on its own? You can't tell from a single snapshot. You need history.

Step 1: Capture the terminal

tmux gives you capture-pane, which dumps the visible terminal content as plain text:

tmux capture-pane -t mysession -p

This returns whatever is on screen, including ANSI escape codes. Strip them:

import re

def strip_ansi(text):
    return re.sub(r'\x1b\[[0-9;]*[a-zA-Z]', '', text)

output = subprocess.check_output(
    ["tmux", "capture-pane", "-t", target, "-p"],
    text=True, timeout=5
)
clean = strip_ansi(output)

Now you have the terminal content as plain text. This is your "API response."

Step 2: Detect Claude's TUI

When Claude Code is running, it draws a distinctive terminal UI. The most reliable indicator is the character (Unicode U+276F, "heavy right-pointing angle quotation mark ornament") — this is Claude's input prompt. It also has tool-use patterns, status bars, and other TUI elements.

def claude_ui_visible(clean_output):
    """Return True if Claude Code's TUI elements are visible."""
    if "\u276f" in clean_output:  # ❯ prompt
        return True
    # Other heuristics: tool output patterns, status indicators
    return False

If the Claude UI is visible, the agent is alive. No action needed.

Step 3: Detect the shell prompt

If Claude's UI isn't visible, check if we're looking at a bare shell prompt:

def at_shell_prompt(clean_output):
    """Return True if the terminal looks like a bare shell prompt."""
    if claude_ui_visible(clean_output):
        return False

    lines = [l for l in clean_output.splitlines() if l.strip()]
    for line in lines[-5:]:  # check last 5 non-empty lines
        stripped = line.strip()
        # Bash ends with $, zsh ends with %
        if re.search(r'[$%]\s*$', stripped):
            # Make sure it's not Claude's prompt character
            if "\u276f" not in stripped:
                return True
    return False

We check the last 5 lines rather than just the last line because the terminal might have trailing blank lines, or the prompt might span multiple lines (common with fancy PS1 configurations like powerline or starship).

Step 4: The "was it ever alive?" guard

This is the critical insight that separates states 2, 3, and 4. We track a timestamp of the last time we saw Claude's UI in each session:

# During regular monitoring (runs every 30 seconds)
if claude_ui_visible(clean_output):
    session_state["last_claude_alive"] = time.time()

The auto-restart logic only fires if Claude was seen alive within the last 10 minutes:

last_alive = session_state.get("last_claude_alive", 0)
if time.time() - last_alive > 600:  # 10 minutes
    # Claude wasn't alive recently — this is a fresh shell
    # or the user exited a long time ago. Don't restart.
    return

This handles two cases:

  • State 3 (user working in shell): If the user exited Claude 30 minutes ago and has been typing shell commands, last_claude_alive is >10 minutes old. No restart.
  • State 4 (never started): last_claude_alive is 0 or doesn't exist. No restart.

State 2 (just exited) has a last_claude_alive within the last few minutes. Restart.

The 10-minute window is generous. In practice, Claude's context exhaustion happens suddenly — it's running, then it exits. The time between "last seen alive" and "detected at shell prompt" is typically under 60 seconds (one or two monitoring cycles).

Step 5: Rate limiting

Without rate limiting, a crash loop kills your system. Imagine Claude starts, immediately hits an error, and exits. The watchdog detects the exit, restarts Claude, Claude crashes again, watchdog restarts, ad infinitum.

last_restart = session_state.get("last_auto_restart", 0)
if time.time() - last_restart < 90:  # 90 second cooldown
    return  # too soon, skip

90 seconds between restarts means the worst case is ~40 restarts per hour, which is annoying but not catastrophic. In practice, crash loops are rare — the typical failure mode (context exhaustion) doesn't repeat immediately because the fresh Claude instance starts with a full context window.

Step 6: The restart sequence

def auto_restart(session_name, session_state):
    # Prevent concurrent restart attempts
    if session_state.get("restarting"):
        return
    session_state["restarting"] = True
    session_state["last_auto_restart"] = time.time()

    def do_restart():
        time.sleep(3)  # let any exit animation finish
        start_session(session_name)  # relaunch claude in the pane
        session_state.pop("restarting", None)
        push_alert(f"Claude exited in '{session_name}' — auto-restarted")

    threading.Thread(target=do_restart, daemon=True).start()

The 3-second delay before restart is important. When Claude exits, it sometimes prints a summary or cleanup message. If you restart immediately, the new Claude instance's TUI might collide with the tail end of the old output. Three seconds lets the terminal settle.

The restart itself is just launching Claude Code in the tmux pane — the same command that created the session originally. Claude starts fresh with no memory of the previous conversation.

The "no memory" question

When Claude restarts after context exhaustion, it starts with a clean context window. The previous conversation is gone. This sounds like a problem, but in practice it works well for two reasons:

  1. CLAUDE.md provides persistent context. The project instructions file is loaded on every start. If the agent's task is defined there (or on the shared board), the new instance picks it up automatically.
  2. The codebase is the ground truth. For batch operations (write tests, refactor, fix bugs), the agent examines the code, finds what needs doing, and does it. It doesn't need to remember that it already fixed 3 out of 5 files — it reads the codebase and finds the remaining 2.

For interactive sessions where conversation history matters, auto-restart is less useful. That's why it's opt-in.

Configuration: opt-in per session

Auto-restart is controlled by a per-session environment variable:

# In the session's .env file
CC_AUTO_CONTINUE=1

This is deliberately opt-in. Reasons:

  • Some users want to review the context state before restarting
  • Interactive sessions shouldn't auto-restart (the user might want to inspect the terminal)
  • Debugging sessions might intentionally exit Claude to run shell commands

For batch/overnight sessions, you enable it once and forget about it.

The broader pattern: terminal as API

What we're really doing here is treating tmux capture-pane as a read endpoint. The terminal is the interface, and we're screen-scraping it for state. This feels hacky, but consider the alternative approaches:

  • Process monitoring (check if Claude's PID is alive): Doesn't tell you if Claude is stuck, and Claude might spawn subprocesses that outlive it.
  • Log file parsing: Claude Code doesn't write structured logs to a file you can monitor.
  • Exit code detection: You'd need to wrap Claude in a script that reports exit codes. Works, but adds a layer.
  • Claude's own status API: Doesn't exist for the CLI tool.

Screen-scraping the terminal works because Claude Code is a TUI application. Its visual state is its state. If you can see the prompt, it's alive. If you see a $ prompt, it's dead. The terminal is an unintentional but perfectly reliable status API.

The heuristics are simple, but they've been running in production across hundreds of agent sessions with zero false positives (restarting when Claude was still running) and near-zero false negatives (missing an exit). The 10-minute alive window and the TUI detection are sufficient to disambiguate every real-world case we've encountered.

The full monitoring loop

For completeness, here's how it fits together:

def monitor_sessions():
    """Run every 30 seconds."""
    for session in get_all_sessions():
        output = capture_pane(session.name)
        clean = strip_ansi(output)
        state = session_states[session.name]

        # Track when Claude was last seen alive
        if claude_ui_visible(clean):
            state["last_claude_alive"] = time.time()
            state.pop("restarting", None)
            continue

        # Check for auto-restart conditions
        if not at_shell_prompt(clean):
            continue
        if state.get("restarting"):
            continue

        env = parse_env_file(session.env_path)
        if env.get("CC_AUTO_CONTINUE") not in ("1", "true", "yes"):
            continue

        last_alive = state.get("last_claude_alive", 0)
        if time.time() - last_alive > 600:
            continue  # wasn't alive recently

        last_restart = state.get("last_auto_restart", 0)
        if time.time() - last_restart < 90:
            continue  # rate limited

        auto_restart(session.name, state)

30 seconds × N sessions. Each check is a tmux capture-pane (sub-millisecond) plus some regex matching. The overhead is negligible.

The result: overnight batch runs that used to stall when one agent hit context limits now self-heal. The dashboard shows an alert ("auto-restarted"), and the agent keeps working. Crude, simple, effective.

Get started with amux

Run dozens of Claude Code agents in parallel. Python 3 + tmux. Open source.

git clone https://github.com/mixpeek/amux && cd amux && ./install.sh
amux register myproject --dir ~/Dev/myproject --yolo
amux start myproject
amux serve  # → https://localhost:8822

View on GitHub