GitHub - dventimisupabase/pg_flight_recorder: Server-side performance telemetry for PostgreSQL 15+ - extracted from SA-Toolkit

6 min read Original article ↗

GitHub release Test Suite Lint

Server-side flight recorder for PostgreSQL. Answers "what was happening in my database?"

View the project website

pg_flight_recorder continuously samples PostgreSQL system state in the background via pg_cron -- no external agents, sidecars, or polling required. It captures wait events, active sessions, locks, WAL activity, checkpoints, I/O, table and index stats, query performance, replication state, and configuration changes. When something goes wrong, the data is already there.

Architecture

Flight Recorder collects two types of data:

System What it captures Frequency Retention
Sampled Activity Wait events, sessions, locks 1 min Ring buffer: 2h, Archives: 7d
Snapshots WAL, checkpoints, I/O, tables, indexes 1 min 30 days

Sampled activity flows through ring buffers (hot, low-overhead) into durable archives and aggregates (cold, long-retention). Snapshots are written directly into retention tables -- no ring buffer in the path. Safety mechanisms -- circuit breaker, load shedding, per-section timeouts, and pg_cron job timeouts -- prevent the recorder from impacting production workloads.

Extensions

Two extensions, each published as a separate dbdev package:

Extension Schema Purpose README
pgfr_record pgfr_record Core: tables, collection, scheduling, ring buffers pgfr_record/README.md
pgfr_analyze pgfr_analyze Optional: reporting, anomaly detection, time travel pgfr_analyze/README.md

Requirements

  • PostgreSQL 15, 16, 17, or 18
  • pg_cron extension
  • Optional: pg_stat_statements for query-level analysis

Quick start

Download from GitHub Releases or clone the repo, cd into the project root, then:

# Install core + optional analysis extension
psql --single-transaction -f pgfr_record/install.sql
psql --single-transaction -f pgfr_analyze/install.sql
-- Enable collection
SELECT pgfr_record.enable();

-- Verify the recorder is running
SELECT * FROM pgfr_record.health_check();

-- Generate a diagnostic report on the database
SELECT pgfr_analyze.report('1 hour');

Common workflows

Verifying the recorder

Confirms that collection is running, pg_cron jobs are active, the circuit breaker isn't tripping, schema size is in range, and pg_stat_statements (if installed) isn't churning. Run after install, after upgrades, or whenever a report looks thin -- not on a daily cadence.

SELECT * FROM pgfr_record.health_check();

Daily monitoring

Returns a markdown report covering anomalies, wait events, top queries, and other activity over the given window. Suitable for a daily glance, or pasted into a chat with an LLM for triage.

SELECT pgfr_analyze.report('1 hour');

Incident response

-- Switch to high-frequency collection
SELECT * FROM pgfr_record.apply_profile('troubleshooting');

-- What was happening at a specific time?
SELECT * FROM pgfr_analyze.what_happened_at('2024-01-15 14:32');

-- Reconstruct an incident timeline
SELECT * FROM pgfr_analyze.incident_timeline(
    '2024-01-15 14:00',
    '2024-01-15 15:00'
);

-- Return to normal after incident
SELECT * FROM pgfr_record.apply_profile('default');

XID / MultiXID wraparound monitoring

-- Current XID and MultiXID ages at database level (from the latest snapshot)
SELECT datfrozenxid_age, datminmxid_age
FROM pgfr_record.snapshots
ORDER BY captured_at DESC LIMIT 1;

-- Top 10 tables by XID or MultiXID age
SELECT relid::regclass, relfrozenxid_age, relminmxid_age
FROM pgfr_record.table_snapshots
WHERE snapshot_id = (SELECT max(id) FROM pgfr_record.snapshots)
ORDER BY greatest(relfrozenxid_age, relminmxid_age) DESC NULLS LAST
LIMIT 10;

-- Wraparound anomalies (XID + MultiXID, cluster + per-table)
SELECT anomaly_type, severity, metric_value, recommendation
FROM pgfr_analyze.anomaly_report(now() - interval '1 hour', now())
WHERE anomaly_type LIKE '%WRAPAROUND%';

-- Tune thresholds (lower warning ratio to alert earlier on busy clusters)
INSERT INTO pgfr_record.config (key, value) VALUES
    ('xid_warning_ratio',  '0.25'),
    ('mxid_warning_ratio', '0.25')
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value;

See REFERENCE.md for the full config key list and rationale (postgres-howto #0044 guidance).

Performance analysis

-- Find performance regressions
SELECT * FROM pgfr_analyze.detect_regressions('1 day');

-- Find query storms
SELECT * FROM pgfr_analyze.detect_query_storms('1 hour');

-- Table hotspots
SELECT * FROM pgfr_analyze.table_hotspots(now() - interval '1 day', now());

-- Unused indexes
SELECT * FROM pgfr_analyze.unused_indexes('7 days');

Capacity planning

SELECT * FROM pgfr_analyze.capacity_summary('7 days');
SELECT * FROM pgfr_analyze.quarterly_review();
SELECT * FROM pgfr_analyze.capacity_dashboard;

Configuration profiles

Profiles are pre-configured settings for different environments:

Profile Sample Interval Use Case
default 60s General purpose monitoring
production_safe 300s Production with maximum safety margins
development 60s Staging and development
troubleshooting 60s Active incident response
minimal_overhead 300s Resource-constrained systems
SELECT * FROM pgfr_record.list_profiles();
SELECT * FROM pgfr_record.explain_profile('production_safe');
SELECT * FROM pgfr_record.apply_profile('production_safe');

Safety

Flight Recorder includes automatic protections:

Protection Description
Circuit Breaker Skips collection if recent runs averaged > 1s
Load Shedding Skips collection when > 70% connections active
Section Timeouts Per-query timeout (250ms) prevents catalog lock hangs
Job Timeouts Outer statement_timeout on pg_cron collector jobs (500ms–60s)

Collection modes provide manual control: normal, light, emergency.

-- Drop progress tracking but keep 60s sampling (lower overhead, same cadence)
SELECT pgfr_record.set_mode('light');

-- Reduce to minimum collection (300s sampling, locks/progress off)
SELECT pgfr_record.set_mode('emergency');

-- Return to full collection
SELECT pgfr_record.set_mode('normal');

-- Full stop: unschedule all pg_cron jobs
SELECT pgfr_record.disable();

-- Resume
SELECT pgfr_record.enable();

pg_cron run history

pg_cron logs every job execution to cron.job_run_details with no built-in purge. pgfr_record schedules roughly ten jobs (four of them every minute), so at default cadence this adds about 5,000 rows/day of unbounded growth on top of any other pg_cron jobs you run.

Recommended: disable cron.log_run. Errors from failed jobs still appear in the Postgres server log (cron.log_min_messages defaults to WARNING) — you lose nothing important, only the job_run_details table entries.

ALTER SYSTEM SET cron.log_run = off;
-- requires Postgres restart (postmaster context)

If you need successful-run history for other pg_cron jobs (as of pg_cron 1.6 there is no per-job logging toggle), schedule a periodic purge instead:

SELECT cron.schedule(
  'pgfr_purge_cron_log',
  '0 * * * *',
  $$DELETE FROM cron.job_run_details WHERE end_time < now() - interval '1 day'$$
);

pgfr_record.enable() raises a WARNING if cron.log_run is left on.

Export

With default retention: ~2.5GB uncompressed, ~150MB compressed.

# Without compression
pg_dump -d your_database -n pgfr_record --data-only -f pgfr_data.sql

# With compression (PostgreSQL 16+)
pg_dump -d your_database -n pgfr_record --data-only --compress=gzip:9 -f pgfr_data.sql.gz

# With compression (PostgreSQL 15)
pg_dump -d your_database -n pgfr_record --data-only | gzip > pgfr_data.sql.gz

Upgrade

Re-running install scripts is safe -- they use CREATE OR REPLACE and IF NOT EXISTS, updating functions and views while preserving all data.

psql --single-transaction -f pgfr_record/install.sql
psql --single-transaction -f pgfr_analyze/install.sql

Uninstall

# Remove everything (stops jobs, drops all schemas and data)
psql --single-transaction -f pgfr_record/uninstall.sql

# Remove only reporting functions (keeps core + data)
psql --single-transaction -f pgfr_analyze/uninstall.sql

Testing

./test.sh           # Test all PostgreSQL versions in parallel (requires Docker)
./test.sh 17        # Test a specific PostgreSQL version (15, 16, 17, or 18)

Reference

See REFERENCE.md for the full function reference, table schemas, configuration settings, and detailed documentation.

License

Licensed under the Apache License, Version 2.0. See NOTICE for attribution.