Server-side flight recorder for PostgreSQL. Answers "what was happening in my database?"
pg_flight_recorder continuously samples PostgreSQL system state in the background via pg_cron -- no external agents, sidecars, or polling required. It captures wait events, active sessions, locks, WAL activity, checkpoints, I/O, table and index stats, query performance, replication state, and configuration changes. When something goes wrong, the data is already there.
Architecture
Flight Recorder collects two types of data:
| System | What it captures | Frequency | Retention |
|---|---|---|---|
| Sampled Activity | Wait events, sessions, locks | 1 min | Ring buffer: 2h, Archives: 7d |
| Snapshots | WAL, checkpoints, I/O, tables, indexes | 1 min | 30 days |
Sampled activity flows through ring buffers (hot, low-overhead) into durable archives and aggregates (cold, long-retention). Snapshots are written directly into retention tables -- no ring buffer in the path. Safety mechanisms -- circuit breaker, load shedding, per-section timeouts, and pg_cron job timeouts -- prevent the recorder from impacting production workloads.
Extensions
Two extensions, each published as a separate dbdev package:
| Extension | Schema | Purpose | README |
|---|---|---|---|
| pgfr_record | pgfr_record |
Core: tables, collection, scheduling, ring buffers | pgfr_record/README.md |
| pgfr_analyze | pgfr_analyze |
Optional: reporting, anomaly detection, time travel | pgfr_analyze/README.md |
Requirements
- PostgreSQL 15, 16, 17, or 18
pg_cronextension- Optional:
pg_stat_statementsfor query-level analysis
Quick start
Download from GitHub Releases or clone the repo, cd into the project root, then:
# Install core + optional analysis extension
psql --single-transaction -f pgfr_record/install.sql
psql --single-transaction -f pgfr_analyze/install.sql-- Enable collection SELECT pgfr_record.enable(); -- Verify the recorder is running SELECT * FROM pgfr_record.health_check(); -- Generate a diagnostic report on the database SELECT pgfr_analyze.report('1 hour');
Common workflows
Verifying the recorder
Confirms that collection is running, pg_cron jobs are active, the circuit breaker isn't tripping, schema size is in range, and pg_stat_statements (if installed) isn't churning. Run after install, after upgrades, or whenever a report looks thin -- not on a daily cadence.
SELECT * FROM pgfr_record.health_check();
Daily monitoring
Returns a markdown report covering anomalies, wait events, top queries, and other activity over the given window. Suitable for a daily glance, or pasted into a chat with an LLM for triage.
SELECT pgfr_analyze.report('1 hour');
Incident response
-- Switch to high-frequency collection SELECT * FROM pgfr_record.apply_profile('troubleshooting'); -- What was happening at a specific time? SELECT * FROM pgfr_analyze.what_happened_at('2024-01-15 14:32'); -- Reconstruct an incident timeline SELECT * FROM pgfr_analyze.incident_timeline( '2024-01-15 14:00', '2024-01-15 15:00' ); -- Return to normal after incident SELECT * FROM pgfr_record.apply_profile('default');
XID / MultiXID wraparound monitoring
-- Current XID and MultiXID ages at database level (from the latest snapshot) SELECT datfrozenxid_age, datminmxid_age FROM pgfr_record.snapshots ORDER BY captured_at DESC LIMIT 1; -- Top 10 tables by XID or MultiXID age SELECT relid::regclass, relfrozenxid_age, relminmxid_age FROM pgfr_record.table_snapshots WHERE snapshot_id = (SELECT max(id) FROM pgfr_record.snapshots) ORDER BY greatest(relfrozenxid_age, relminmxid_age) DESC NULLS LAST LIMIT 10; -- Wraparound anomalies (XID + MultiXID, cluster + per-table) SELECT anomaly_type, severity, metric_value, recommendation FROM pgfr_analyze.anomaly_report(now() - interval '1 hour', now()) WHERE anomaly_type LIKE '%WRAPAROUND%'; -- Tune thresholds (lower warning ratio to alert earlier on busy clusters) INSERT INTO pgfr_record.config (key, value) VALUES ('xid_warning_ratio', '0.25'), ('mxid_warning_ratio', '0.25') ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value;
See REFERENCE.md for the full config key list and rationale (postgres-howto #0044 guidance).
Performance analysis
-- Find performance regressions SELECT * FROM pgfr_analyze.detect_regressions('1 day'); -- Find query storms SELECT * FROM pgfr_analyze.detect_query_storms('1 hour'); -- Table hotspots SELECT * FROM pgfr_analyze.table_hotspots(now() - interval '1 day', now()); -- Unused indexes SELECT * FROM pgfr_analyze.unused_indexes('7 days');
Capacity planning
SELECT * FROM pgfr_analyze.capacity_summary('7 days'); SELECT * FROM pgfr_analyze.quarterly_review(); SELECT * FROM pgfr_analyze.capacity_dashboard;
Configuration profiles
Profiles are pre-configured settings for different environments:
| Profile | Sample Interval | Use Case |
|---|---|---|
default |
60s | General purpose monitoring |
production_safe |
300s | Production with maximum safety margins |
development |
60s | Staging and development |
troubleshooting |
60s | Active incident response |
minimal_overhead |
300s | Resource-constrained systems |
SELECT * FROM pgfr_record.list_profiles(); SELECT * FROM pgfr_record.explain_profile('production_safe'); SELECT * FROM pgfr_record.apply_profile('production_safe');
Safety
Flight Recorder includes automatic protections:
| Protection | Description |
|---|---|
| Circuit Breaker | Skips collection if recent runs averaged > 1s |
| Load Shedding | Skips collection when > 70% connections active |
| Section Timeouts | Per-query timeout (250ms) prevents catalog lock hangs |
| Job Timeouts | Outer statement_timeout on pg_cron collector jobs (500ms–60s) |
Collection modes provide manual control: normal, light, emergency.
-- Drop progress tracking but keep 60s sampling (lower overhead, same cadence) SELECT pgfr_record.set_mode('light'); -- Reduce to minimum collection (300s sampling, locks/progress off) SELECT pgfr_record.set_mode('emergency'); -- Return to full collection SELECT pgfr_record.set_mode('normal'); -- Full stop: unschedule all pg_cron jobs SELECT pgfr_record.disable(); -- Resume SELECT pgfr_record.enable();
pg_cron run history
pg_cron logs every job execution to cron.job_run_details with no built-in purge. pgfr_record schedules roughly ten jobs (four of them every minute), so at default cadence this adds about 5,000 rows/day of unbounded growth on top of any other pg_cron jobs you run.
Recommended: disable cron.log_run. Errors from failed jobs still appear in the Postgres server log (cron.log_min_messages defaults to WARNING) — you lose nothing important, only the job_run_details table entries.
ALTER SYSTEM SET cron.log_run = off; -- requires Postgres restart (postmaster context)
If you need successful-run history for other pg_cron jobs (as of pg_cron 1.6 there is no per-job logging toggle), schedule a periodic purge instead:
SELECT cron.schedule( 'pgfr_purge_cron_log', '0 * * * *', $$DELETE FROM cron.job_run_details WHERE end_time < now() - interval '1 day'$$ );
pgfr_record.enable() raises a WARNING if cron.log_run is left on.
Export
With default retention: ~2.5GB uncompressed, ~150MB compressed.
# Without compression pg_dump -d your_database -n pgfr_record --data-only -f pgfr_data.sql # With compression (PostgreSQL 16+) pg_dump -d your_database -n pgfr_record --data-only --compress=gzip:9 -f pgfr_data.sql.gz # With compression (PostgreSQL 15) pg_dump -d your_database -n pgfr_record --data-only | gzip > pgfr_data.sql.gz
Upgrade
Re-running install scripts is safe -- they use CREATE OR REPLACE and IF NOT EXISTS, updating functions and views while preserving all data.
psql --single-transaction -f pgfr_record/install.sql psql --single-transaction -f pgfr_analyze/install.sql
Uninstall
# Remove everything (stops jobs, drops all schemas and data) psql --single-transaction -f pgfr_record/uninstall.sql # Remove only reporting functions (keeps core + data) psql --single-transaction -f pgfr_analyze/uninstall.sql
Testing
./test.sh # Test all PostgreSQL versions in parallel (requires Docker) ./test.sh 17 # Test a specific PostgreSQL version (15, 16, 17, or 18)
Reference
See REFERENCE.md for the full function reference, table schemas, configuration settings, and detailed documentation.
License
Licensed under the Apache License, Version 2.0. See NOTICE for attribution.