GitHub - whitepaper27/Flight-Recorder: Replay debugger for AI agents — fix failures and replay from the exact failure point, skipping what already worked

Fix one bug. Don't rerun everything.

Debugging AI agents today means re-running slow APIs, repeating LLM calls, and restarting full workflows — just to test a one-line fix.

Flight Recorder replays from the exact failure point. Everything that already worked is cached.

  FAIL score_lead
    |
  WHY? get_contact returned None
    |
  FIX  your code
    |
  REPLAY from failure (cached steps skipped)
    |
  SUCCESS

$ flight-recorder debug last

Probable root cause (direct match):
  agent://crm/lookup(...) -> None
  agent://crm/scorer(...) -> AssertionError

Hint: agent://crm/lookup returned None, but agent://crm/scorer expects a dict

$ flight-recorder replay last --yes

Replay plan:
  Re-running: agent://crm/lookup (root cause)
  Re-running: agent://crm/scorer (downstream)

SUCCESS

Quick Start

pip install flight-recorder
python examples/basic_demo.py          # watch it fail
flight-recorder debug last             # see root cause
# fix the bug in basic_demo.py
flight-recorder replay last --yes      # verify fix works

How It Plugs Into Your Code

Two ways to use it:

fr.trace() — zero code changes, instant debug

from flight_recorder import FlightRecorder
fr = FlightRecorder()

with fr.trace():
    result = your_pipeline(input_data)    # all calls recorded automatically
# then: flight-recorder debug last

Records everything. Shows root cause. No decorators needed. Great for first try.

@fr.register — add decorators, unlock replay

@fr.register("get_contact")
def get_contact(email):
    return db.lookup(email)               # your code, unchanged

@fr.register("send_email", replayable=False)
def send_email(to, body):
    return smtp.send(to, body)            # never re-run on replay

fr.run(your_pipeline, input_data)
# then: flight-recorder debug last  AND  flight-recorder replay last

Same recording + debug, plus replay from failure point with cached results.

Feature	`fr.trace()`	`@fr.register`
Record events	yes	yes
Debug (root cause)	yes	yes
Replay from failure	no	yes
Code changes needed	none	add decorators

Under the hood, Flight Recorder:

Records inputs and outputs of each step
Records errors with full tracebacks
Tracks who called whom (parent-child relationships)
Stores everything in SQLite (zero config, no external services)

What it does not do: modify your function's behavior, catch your exceptions, or add latency to your pipeline. If Flight Recorder itself crashes, your code still runs normally.

Mental model: @fr.register = step-level instrumentation + failure tracing + selective replay.

30-Second Demo

from flight_recorder import FlightRecorder

fr = FlightRecorder()

@fr.register("agent://crm/lookup")
def crm_lookup(email):
    return None  # Bug: returns None for unknown contacts

@fr.register("agent://crm/scorer")
def score_lead(contact):
    assert contact is not None, "Contact required"
    return contact["score"]

@fr.register("agent://crm/qualifier")
def qualify_lead(email):
    contact = crm_lookup(email)
    score = score_lead(contact)
    return {"qualified": score > 70}

fr.run(qualify_lead, "unknown@example.com")

It crashes. Now debug it:

$ flight-recorder debug last

Probable root cause (direct match):
  agent://crm/lookup(...) -> None
  agent://crm/scorer(...) -> AssertionError

Timeline:
  [0000.000] [OK]   agent://crm/lookup -> None
  [0000.001] [FAIL] agent://crm/scorer -> AssertionError: Contact required

Hint:
  agent://crm/lookup returned None, but agent://crm/scorer expects a dict

Fix the bug (return a default record instead of None), then replay:

$ flight-recorder replay last --yes

Replay plan:
  Re-running: agent://crm/lookup (root cause -- code changed)
  Re-running: agent://crm/scorer (downstream of root cause)

SUCCESS
Final result: {"qualified": false}

No full re-run. Just the fixed agent and its downstream dependencies.

Real-World Example: CRM Pipeline with LLM Calls

Flight Recorder isn't just for toy demos. Here's it running on a real 5-agent pipeline with GPT-4o-mini and a SQLite CRM database:

Qualifier (orchestrator)
  |
  +---> CRM Lookup (SQLite database)
  +---> Web Enricher (enrichment API)
  |         |
  +-------->+
  |
  +---> LLM Scorer (GPT-4o-mini) ---> Email Drafter (GPT-4o-mini)

# Run with a known contact -- full pipeline succeeds
$ python -m examples.langgraph_crm.run sarah@acmecorp.com
Lead QUALIFIED (score: 75)
Draft email: "Hi Sarah, I hope this message finds you well..."

# Run with an unknown contact -- crashes at scorer
$ python -m examples.langgraph_crm.run unknown@nowhere.com
Agent FAILED: AssertionError: Contact data required for scoring

# Debug -- Flight Recorder identifies the root cause
$ flight-recorder debug last
Probable root cause (direct match):
  agent://crm/lookup returned None
  agent://crm/scorer expects a dict

# Fix crm_lookup, then replay -- LLM calls from enricher are cached
$ flight-recorder replay last --yes
SUCCESS (score: 10, qualified: false)

The LLM calls that already succeeded (web enricher) are cached and reused on replay. No wasted API credits.

Architecture

YOUR CODE                           FLIGHT RECORDER
+------------------+                +---------------------------+
|                  |                |                           |
| @fr.register()   |---record--->  | Event Store (SQLite WAL)  |
| your agent funcs |                | - STARTED events          |
|                  |                | - COMPLETED + return vals |
| fr.run(top_func) |                | - FAILED + tracebacks    |
|                  |                |                           |
+------------------+                +---------------------------+
                                              |
                                    +---------+---------+
                                    |                   |
                              +-----v-----+     +------v------+
                              | Causal    |     | Replay      |
                              | Analysis  |     | Engine      |
                              |           |     |             |
                              | - DAG     |     | - Cache     |
                              | - Root    |     | - Subprocess|
                              |   cause   |     | - Rerun set |
                              | - Hints   |     | - Dry-run   |
                              +-----------+     +-------------+
                                    |                   |
                              +-----v-----+     +------v------+
                              | CLI:      |     | CLI:        |
                              | debug     |     | replay      |
                              +-----------+     +-------------+

How It Works

Recording (@fr.register decorator):

Wraps each agent function (sync or async)
Records STARTED, COMPLETED, and FAILED events with full payloads
Tracks parent-child relationships via ContextVars (who called whom)
Stores everything in SQLite with WAL mode (concurrent reads, async writes)
Error-isolated: if Flight Recorder fails internally, your code still runs normally

Debugging (flight-recorder debug):

Builds a DAG of agent calls from recorded events
Identifies the deepest failure (the crash site)
Traces the root cause by matching sibling outputs to failed inputs
Generates hints: "agent X returned None, but agent Y expects a dict"

Replaying (flight-recorder replay):

Computes a rerun set: everything at or after the root cause gets re-run
Caches everything before the root cause (uses stored return values)
Spawns a fresh subprocess (no stale state, no duplicate registrations)
Re-imports your fixed code, calls the top-level function
Cached agents return instantly; re-run agents execute your new code

Key Design Decisions

Decision	Why
SQLite WAL for storage	Zero setup, concurrent reads, survives crashes
Subprocess for replay	Clean process = no stale refs, no duplicate decorators
JSON-only for cached values	Strict boundary. No pickle. Replay-safe or display-only.
ContextVars for session tracking	Works with sync and async, propagates through `await`
Bounded write queue (10k max)	Prevents OOM under bursty workloads
`fr.run()` entry point	Explicit session metadata capture. No `inspect.stack()` hacks.
Error isolation	FR bugs never crash your code. All internals wrapped in try/except.

CLI Reference

# Debug the most recent failed session
flight-recorder debug last

# Debug a specific session by ID
flight-recorder debug <session-id>

# Replay with fixed code (prompts for confirmation)
flight-recorder replay last

# Replay without confirmation
flight-recorder replay last --yes

# See what replay would do without executing
flight-recorder replay last --dry-run

# Manually specify where to replay from
flight-recorder replay last --from agent://crm/scorer

# List recent sessions
flight-recorder list

# Initialize Flight Recorder in current directory
flight-recorder init

API Reference

from flight_recorder import FlightRecorder

fr = FlightRecorder()

# Register an agent
@fr.register("agent://my-app/my-agent")
def my_agent(arg1, arg2):
    return result

# Register with replay protection (never re-runs on replay, always cached)
@fr.register("agent://my-app/send-email", replayable=False)
def send_email(to, subject, body):
    # This agent has side effects -- don't re-run on replay
    ...

# Run with session tracking (required for replay)
result = fr.run(my_agent, "arg1", "arg2")

# Clean up
fr.close()

Installation

pip install flight-recorder

Dependencies: typer, rich. No heavy frameworks.

Quick Start

pip install flight-recorder
python examples/basic_demo.py          # watch it fail
flight-recorder debug last             # see root cause
# fix the bug in basic_demo.py
flight-recorder replay last --yes      # verify fix works

Serialization

Flight Recorder stores agent inputs and outputs for debugging and replay. The serialization chain:

JSON native types (dict, list, str, int, float, bool, None) -- replay-safe
dataclasses (dataclasses.asdict()) -- replay-safe
Pydantic models (.model_dump()) -- replay-safe
Everything else (repr()) -- display-only, not replay-safe

If an agent returns something that isn't JSON-serializable, debug still works (you see the repr), but replay will re-run that agent instead of using the cached value.

Limitations (v0.1)

Linear/tree call chains only. Complex mesh topologies with implicit shared state are not fully traced. Root cause detection is a heuristic, not universal causal inference.
Replay assumes determinism. If your agent has side effects (API calls, DB writes), replay will re-trigger them. Use replayable=False for side-effectful agents.
Single-file scripts for replay. runpy.run_path() works for self-contained scripts. Package-based entry points supported via run_module() but less tested.
No schema migration. If you upgrade Flight Recorder and the event schema changes, old sessions may not be readable. v0.1 -- this will be fixed.

Project Structure

src/flight_recorder/
  __init__.py          # Public API: FlightRecorder
  core.py              # FlightRecorder facade, fr.run() entry point
  events.py            # Event dataclass, serialization helpers
  store.py             # SQLite WAL event store, bounded write queue
  recorder.py          # Decorator engine, ContextVar session tracking
  causal.py            # DAG builder, root cause heuristics
  replay.py            # Replay plan computation, subprocess orchestration
  _replay_worker.py    # Subprocess entry point for replay
  _registry.py         # Global agent registry for replay discovery
  cli/
    app.py             # Typer CLI entry point
    debug_cmd.py       # flight-recorder debug
    replay_cmd.py      # flight-recorder replay
    list_cmd.py        # flight-recorder list
    formatters.py      # Rich output formatting

License

MIT