Fix one bug. Don't rerun everything.
Debugging AI agents today means re-running slow APIs, repeating LLM calls, and restarting full workflows — just to test a one-line fix.
Flight Recorder replays from the exact failure point. Everything that already worked is cached.
FAIL score_lead
|
WHY? get_contact returned None
|
FIX your code
|
REPLAY from failure (cached steps skipped)
|
SUCCESS
$ flight-recorder debug last Probable root cause (direct match): agent://crm/lookup(...) -> None agent://crm/scorer(...) -> AssertionError Hint: agent://crm/lookup returned None, but agent://crm/scorer expects a dict $ flight-recorder replay last --yes Replay plan: Re-running: agent://crm/lookup (root cause) Re-running: agent://crm/scorer (downstream) SUCCESS
Quick Start
pip install flight-recorder python examples/basic_demo.py # watch it fail flight-recorder debug last # see root cause # fix the bug in basic_demo.py flight-recorder replay last --yes # verify fix works
How It Plugs Into Your Code
Two ways to use it:
fr.trace() — zero code changes, instant debug
from flight_recorder import FlightRecorder fr = FlightRecorder() with fr.trace(): result = your_pipeline(input_data) # all calls recorded automatically # then: flight-recorder debug last
Records everything. Shows root cause. No decorators needed. Great for first try.
@fr.register — add decorators, unlock replay
@fr.register("get_contact") def get_contact(email): return db.lookup(email) # your code, unchanged @fr.register("send_email", replayable=False) def send_email(to, body): return smtp.send(to, body) # never re-run on replay fr.run(your_pipeline, input_data) # then: flight-recorder debug last AND flight-recorder replay last
Same recording + debug, plus replay from failure point with cached results.
| Feature | fr.trace() |
@fr.register |
|---|---|---|
| Record events | yes | yes |
| Debug (root cause) | yes | yes |
| Replay from failure | no | yes |
| Code changes needed | none | add decorators |
Under the hood, Flight Recorder:
- Records inputs and outputs of each step
- Records errors with full tracebacks
- Tracks who called whom (parent-child relationships)
- Stores everything in SQLite (zero config, no external services)
What it does not do: modify your function's behavior, catch your exceptions, or add latency to your pipeline. If Flight Recorder itself crashes, your code still runs normally.
Mental model: @fr.register = step-level instrumentation + failure tracing + selective replay.
30-Second Demo
from flight_recorder import FlightRecorder fr = FlightRecorder() @fr.register("agent://crm/lookup") def crm_lookup(email): return None # Bug: returns None for unknown contacts @fr.register("agent://crm/scorer") def score_lead(contact): assert contact is not None, "Contact required" return contact["score"] @fr.register("agent://crm/qualifier") def qualify_lead(email): contact = crm_lookup(email) score = score_lead(contact) return {"qualified": score > 70} fr.run(qualify_lead, "unknown@example.com")
It crashes. Now debug it:
$ flight-recorder debug last Probable root cause (direct match): agent://crm/lookup(...) -> None agent://crm/scorer(...) -> AssertionError Timeline: [0000.000] [OK] agent://crm/lookup -> None [0000.001] [FAIL] agent://crm/scorer -> AssertionError: Contact required Hint: agent://crm/lookup returned None, but agent://crm/scorer expects a dict
Fix the bug (return a default record instead of None), then replay:
$ flight-recorder replay last --yes
Replay plan:
Re-running: agent://crm/lookup (root cause -- code changed)
Re-running: agent://crm/scorer (downstream of root cause)
SUCCESS
Final result: {"qualified": false}No full re-run. Just the fixed agent and its downstream dependencies.
Real-World Example: CRM Pipeline with LLM Calls
Flight Recorder isn't just for toy demos. Here's it running on a real 5-agent pipeline with GPT-4o-mini and a SQLite CRM database:
Qualifier (orchestrator)
|
+---> CRM Lookup (SQLite database)
+---> Web Enricher (enrichment API)
| |
+-------->+
|
+---> LLM Scorer (GPT-4o-mini) ---> Email Drafter (GPT-4o-mini)
# Run with a known contact -- full pipeline succeeds $ python -m examples.langgraph_crm.run sarah@acmecorp.com Lead QUALIFIED (score: 75) Draft email: "Hi Sarah, I hope this message finds you well..." # Run with an unknown contact -- crashes at scorer $ python -m examples.langgraph_crm.run unknown@nowhere.com Agent FAILED: AssertionError: Contact data required for scoring # Debug -- Flight Recorder identifies the root cause $ flight-recorder debug last Probable root cause (direct match): agent://crm/lookup returned None agent://crm/scorer expects a dict # Fix crm_lookup, then replay -- LLM calls from enricher are cached $ flight-recorder replay last --yes SUCCESS (score: 10, qualified: false)
The LLM calls that already succeeded (web enricher) are cached and reused on replay. No wasted API credits.
Architecture
YOUR CODE FLIGHT RECORDER
+------------------+ +---------------------------+
| | | |
| @fr.register() |---record---> | Event Store (SQLite WAL) |
| your agent funcs | | - STARTED events |
| | | - COMPLETED + return vals |
| fr.run(top_func) | | - FAILED + tracebacks |
| | | |
+------------------+ +---------------------------+
|
+---------+---------+
| |
+-----v-----+ +------v------+
| Causal | | Replay |
| Analysis | | Engine |
| | | |
| - DAG | | - Cache |
| - Root | | - Subprocess|
| cause | | - Rerun set |
| - Hints | | - Dry-run |
+-----------+ +-------------+
| |
+-----v-----+ +------v------+
| CLI: | | CLI: |
| debug | | replay |
+-----------+ +-------------+
How It Works
Recording (@fr.register decorator):
- Wraps each agent function (sync or async)
- Records STARTED, COMPLETED, and FAILED events with full payloads
- Tracks parent-child relationships via ContextVars (who called whom)
- Stores everything in SQLite with WAL mode (concurrent reads, async writes)
- Error-isolated: if Flight Recorder fails internally, your code still runs normally
Debugging (flight-recorder debug):
- Builds a DAG of agent calls from recorded events
- Identifies the deepest failure (the crash site)
- Traces the root cause by matching sibling outputs to failed inputs
- Generates hints: "agent X returned None, but agent Y expects a dict"
Replaying (flight-recorder replay):
- Computes a rerun set: everything at or after the root cause gets re-run
- Caches everything before the root cause (uses stored return values)
- Spawns a fresh subprocess (no stale state, no duplicate registrations)
- Re-imports your fixed code, calls the top-level function
- Cached agents return instantly; re-run agents execute your new code
Key Design Decisions
| Decision | Why |
|---|---|
| SQLite WAL for storage | Zero setup, concurrent reads, survives crashes |
| Subprocess for replay | Clean process = no stale refs, no duplicate decorators |
| JSON-only for cached values | Strict boundary. No pickle. Replay-safe or display-only. |
| ContextVars for session tracking | Works with sync and async, propagates through await |
| Bounded write queue (10k max) | Prevents OOM under bursty workloads |
fr.run() entry point |
Explicit session metadata capture. No inspect.stack() hacks. |
| Error isolation | FR bugs never crash your code. All internals wrapped in try/except. |
CLI Reference
# Debug the most recent failed session flight-recorder debug last # Debug a specific session by ID flight-recorder debug <session-id> # Replay with fixed code (prompts for confirmation) flight-recorder replay last # Replay without confirmation flight-recorder replay last --yes # See what replay would do without executing flight-recorder replay last --dry-run # Manually specify where to replay from flight-recorder replay last --from agent://crm/scorer # List recent sessions flight-recorder list # Initialize Flight Recorder in current directory flight-recorder init
API Reference
from flight_recorder import FlightRecorder fr = FlightRecorder() # Register an agent @fr.register("agent://my-app/my-agent") def my_agent(arg1, arg2): return result # Register with replay protection (never re-runs on replay, always cached) @fr.register("agent://my-app/send-email", replayable=False) def send_email(to, subject, body): # This agent has side effects -- don't re-run on replay ... # Run with session tracking (required for replay) result = fr.run(my_agent, "arg1", "arg2") # Clean up fr.close()
Installation
pip install flight-recorder
Dependencies: typer, rich. No heavy frameworks.
Quick Start
pip install flight-recorder python examples/basic_demo.py # watch it fail flight-recorder debug last # see root cause # fix the bug in basic_demo.py flight-recorder replay last --yes # verify fix works
Serialization
Flight Recorder stores agent inputs and outputs for debugging and replay. The serialization chain:
- JSON native types (dict, list, str, int, float, bool, None) -- replay-safe
- dataclasses (
dataclasses.asdict()) -- replay-safe - Pydantic models (
.model_dump()) -- replay-safe - Everything else (
repr()) -- display-only, not replay-safe
If an agent returns something that isn't JSON-serializable, debug still works (you see the repr), but replay will re-run that agent instead of using the cached value.
Limitations (v0.1)
- Linear/tree call chains only. Complex mesh topologies with implicit shared state are not fully traced. Root cause detection is a heuristic, not universal causal inference.
- Replay assumes determinism. If your agent has side effects (API calls, DB writes), replay will re-trigger them. Use
replayable=Falsefor side-effectful agents. - Single-file scripts for replay.
runpy.run_path()works for self-contained scripts. Package-based entry points supported viarun_module()but less tested. - No schema migration. If you upgrade Flight Recorder and the event schema changes, old sessions may not be readable. v0.1 -- this will be fixed.
Project Structure
src/flight_recorder/
__init__.py # Public API: FlightRecorder
core.py # FlightRecorder facade, fr.run() entry point
events.py # Event dataclass, serialization helpers
store.py # SQLite WAL event store, bounded write queue
recorder.py # Decorator engine, ContextVar session tracking
causal.py # DAG builder, root cause heuristics
replay.py # Replay plan computation, subprocess orchestration
_replay_worker.py # Subprocess entry point for replay
_registry.py # Global agent registry for replay discovery
cli/
app.py # Typer CLI entry point
debug_cmd.py # flight-recorder debug
replay_cmd.py # flight-recorder replay
list_cmd.py # flight-recorder list
formatters.py # Rich output formatting
License
MIT
