GitHub - whitepaper27/Flight-Recorder: Replay debugger for AI agents — fix failures and replay from the exact failure point, skipping what already worked

8 min read Original article ↗

Fix one bug. Don't rerun everything.

Debugging AI agents today means re-running slow APIs, repeating LLM calls, and restarting full workflows — just to test a one-line fix.

Flight Recorder replays from the exact failure point. Everything that already worked is cached.

Flight Recorder Demo

  FAIL score_lead
    |
  WHY? get_contact returned None
    |
  FIX  your code
    |
  REPLAY from failure (cached steps skipped)
    |
  SUCCESS
$ flight-recorder debug last

Probable root cause (direct match):
  agent://crm/lookup(...) -> None
  agent://crm/scorer(...) -> AssertionError

Hint: agent://crm/lookup returned None, but agent://crm/scorer expects a dict

$ flight-recorder replay last --yes

Replay plan:
  Re-running: agent://crm/lookup (root cause)
  Re-running: agent://crm/scorer (downstream)

SUCCESS

Quick Start

pip install flight-recorder
python examples/basic_demo.py          # watch it fail
flight-recorder debug last             # see root cause
# fix the bug in basic_demo.py
flight-recorder replay last --yes      # verify fix works

How It Plugs Into Your Code

Two ways to use it:

fr.trace() — zero code changes, instant debug

from flight_recorder import FlightRecorder
fr = FlightRecorder()

with fr.trace():
    result = your_pipeline(input_data)    # all calls recorded automatically
# then: flight-recorder debug last

Records everything. Shows root cause. No decorators needed. Great for first try.

@fr.register — add decorators, unlock replay

@fr.register("get_contact")
def get_contact(email):
    return db.lookup(email)               # your code, unchanged

@fr.register("send_email", replayable=False)
def send_email(to, body):
    return smtp.send(to, body)            # never re-run on replay

fr.run(your_pipeline, input_data)
# then: flight-recorder debug last  AND  flight-recorder replay last

Same recording + debug, plus replay from failure point with cached results.

Feature fr.trace() @fr.register
Record events yes yes
Debug (root cause) yes yes
Replay from failure no yes
Code changes needed none add decorators

Under the hood, Flight Recorder:

  • Records inputs and outputs of each step
  • Records errors with full tracebacks
  • Tracks who called whom (parent-child relationships)
  • Stores everything in SQLite (zero config, no external services)

What it does not do: modify your function's behavior, catch your exceptions, or add latency to your pipeline. If Flight Recorder itself crashes, your code still runs normally.

Mental model: @fr.register = step-level instrumentation + failure tracing + selective replay.


30-Second Demo

from flight_recorder import FlightRecorder

fr = FlightRecorder()

@fr.register("agent://crm/lookup")
def crm_lookup(email):
    return None  # Bug: returns None for unknown contacts

@fr.register("agent://crm/scorer")
def score_lead(contact):
    assert contact is not None, "Contact required"
    return contact["score"]

@fr.register("agent://crm/qualifier")
def qualify_lead(email):
    contact = crm_lookup(email)
    score = score_lead(contact)
    return {"qualified": score > 70}

fr.run(qualify_lead, "unknown@example.com")

It crashes. Now debug it:

$ flight-recorder debug last

Probable root cause (direct match):
  agent://crm/lookup(...) -> None
  agent://crm/scorer(...) -> AssertionError

Timeline:
  [0000.000] [OK]   agent://crm/lookup -> None
  [0000.001] [FAIL] agent://crm/scorer -> AssertionError: Contact required

Hint:
  agent://crm/lookup returned None, but agent://crm/scorer expects a dict

Fix the bug (return a default record instead of None), then replay:

$ flight-recorder replay last --yes

Replay plan:
  Re-running: agent://crm/lookup (root cause -- code changed)
  Re-running: agent://crm/scorer (downstream of root cause)

SUCCESS
Final result: {"qualified": false}

No full re-run. Just the fixed agent and its downstream dependencies.


Real-World Example: CRM Pipeline with LLM Calls

Flight Recorder isn't just for toy demos. Here's it running on a real 5-agent pipeline with GPT-4o-mini and a SQLite CRM database:

Qualifier (orchestrator)
  |
  +---> CRM Lookup (SQLite database)
  +---> Web Enricher (enrichment API)
  |         |
  +-------->+
  |
  +---> LLM Scorer (GPT-4o-mini) ---> Email Drafter (GPT-4o-mini)
# Run with a known contact -- full pipeline succeeds
$ python -m examples.langgraph_crm.run sarah@acmecorp.com
Lead QUALIFIED (score: 75)
Draft email: "Hi Sarah, I hope this message finds you well..."

# Run with an unknown contact -- crashes at scorer
$ python -m examples.langgraph_crm.run unknown@nowhere.com
Agent FAILED: AssertionError: Contact data required for scoring

# Debug -- Flight Recorder identifies the root cause
$ flight-recorder debug last
Probable root cause (direct match):
  agent://crm/lookup returned None
  agent://crm/scorer expects a dict

# Fix crm_lookup, then replay -- LLM calls from enricher are cached
$ flight-recorder replay last --yes
SUCCESS (score: 10, qualified: false)

The LLM calls that already succeeded (web enricher) are cached and reused on replay. No wasted API credits.


Architecture

YOUR CODE                           FLIGHT RECORDER
+------------------+                +---------------------------+
|                  |                |                           |
| @fr.register()   |---record--->  | Event Store (SQLite WAL)  |
| your agent funcs |                | - STARTED events          |
|                  |                | - COMPLETED + return vals |
| fr.run(top_func) |                | - FAILED + tracebacks    |
|                  |                |                           |
+------------------+                +---------------------------+
                                              |
                                    +---------+---------+
                                    |                   |
                              +-----v-----+     +------v------+
                              | Causal    |     | Replay      |
                              | Analysis  |     | Engine      |
                              |           |     |             |
                              | - DAG     |     | - Cache     |
                              | - Root    |     | - Subprocess|
                              |   cause   |     | - Rerun set |
                              | - Hints   |     | - Dry-run   |
                              +-----------+     +-------------+
                                    |                   |
                              +-----v-----+     +------v------+
                              | CLI:      |     | CLI:        |
                              | debug     |     | replay      |
                              +-----------+     +-------------+

How It Works

Recording (@fr.register decorator):

  • Wraps each agent function (sync or async)
  • Records STARTED, COMPLETED, and FAILED events with full payloads
  • Tracks parent-child relationships via ContextVars (who called whom)
  • Stores everything in SQLite with WAL mode (concurrent reads, async writes)
  • Error-isolated: if Flight Recorder fails internally, your code still runs normally

Debugging (flight-recorder debug):

  • Builds a DAG of agent calls from recorded events
  • Identifies the deepest failure (the crash site)
  • Traces the root cause by matching sibling outputs to failed inputs
  • Generates hints: "agent X returned None, but agent Y expects a dict"

Replaying (flight-recorder replay):

  • Computes a rerun set: everything at or after the root cause gets re-run
  • Caches everything before the root cause (uses stored return values)
  • Spawns a fresh subprocess (no stale state, no duplicate registrations)
  • Re-imports your fixed code, calls the top-level function
  • Cached agents return instantly; re-run agents execute your new code

Key Design Decisions

Decision Why
SQLite WAL for storage Zero setup, concurrent reads, survives crashes
Subprocess for replay Clean process = no stale refs, no duplicate decorators
JSON-only for cached values Strict boundary. No pickle. Replay-safe or display-only.
ContextVars for session tracking Works with sync and async, propagates through await
Bounded write queue (10k max) Prevents OOM under bursty workloads
fr.run() entry point Explicit session metadata capture. No inspect.stack() hacks.
Error isolation FR bugs never crash your code. All internals wrapped in try/except.

CLI Reference

# Debug the most recent failed session
flight-recorder debug last

# Debug a specific session by ID
flight-recorder debug <session-id>

# Replay with fixed code (prompts for confirmation)
flight-recorder replay last

# Replay without confirmation
flight-recorder replay last --yes

# See what replay would do without executing
flight-recorder replay last --dry-run

# Manually specify where to replay from
flight-recorder replay last --from agent://crm/scorer

# List recent sessions
flight-recorder list

# Initialize Flight Recorder in current directory
flight-recorder init

API Reference

from flight_recorder import FlightRecorder

fr = FlightRecorder()

# Register an agent
@fr.register("agent://my-app/my-agent")
def my_agent(arg1, arg2):
    return result

# Register with replay protection (never re-runs on replay, always cached)
@fr.register("agent://my-app/send-email", replayable=False)
def send_email(to, subject, body):
    # This agent has side effects -- don't re-run on replay
    ...

# Run with session tracking (required for replay)
result = fr.run(my_agent, "arg1", "arg2")

# Clean up
fr.close()

Installation

pip install flight-recorder

Dependencies: typer, rich. No heavy frameworks.


Quick Start

pip install flight-recorder
python examples/basic_demo.py          # watch it fail
flight-recorder debug last             # see root cause
# fix the bug in basic_demo.py
flight-recorder replay last --yes      # verify fix works

Serialization

Flight Recorder stores agent inputs and outputs for debugging and replay. The serialization chain:

  1. JSON native types (dict, list, str, int, float, bool, None) -- replay-safe
  2. dataclasses (dataclasses.asdict()) -- replay-safe
  3. Pydantic models (.model_dump()) -- replay-safe
  4. Everything else (repr()) -- display-only, not replay-safe

If an agent returns something that isn't JSON-serializable, debug still works (you see the repr), but replay will re-run that agent instead of using the cached value.


Limitations (v0.1)

  • Linear/tree call chains only. Complex mesh topologies with implicit shared state are not fully traced. Root cause detection is a heuristic, not universal causal inference.
  • Replay assumes determinism. If your agent has side effects (API calls, DB writes), replay will re-trigger them. Use replayable=False for side-effectful agents.
  • Single-file scripts for replay. runpy.run_path() works for self-contained scripts. Package-based entry points supported via run_module() but less tested.
  • No schema migration. If you upgrade Flight Recorder and the event schema changes, old sessions may not be readable. v0.1 -- this will be fixed.

Project Structure

src/flight_recorder/
  __init__.py          # Public API: FlightRecorder
  core.py              # FlightRecorder facade, fr.run() entry point
  events.py            # Event dataclass, serialization helpers
  store.py             # SQLite WAL event store, bounded write queue
  recorder.py          # Decorator engine, ContextVar session tracking
  causal.py            # DAG builder, root cause heuristics
  replay.py            # Replay plan computation, subprocess orchestration
  _replay_worker.py    # Subprocess entry point for replay
  _registry.py         # Global agent registry for replay discovery
  cli/
    app.py             # Typer CLI entry point
    debug_cmd.py       # flight-recorder debug
    replay_cmd.py      # flight-recorder replay
    list_cmd.py        # flight-recorder list
    formatters.py      # Rich output formatting

License

MIT