The Post-Mortem We Never Had to Write - ScryData Blog

Staging passes. CI is green. A teammate adds a region column to the orders table, backfills 25,000 rows, and opens a PR. Everything looks clean.

The migration introduces a 92x query regression that nobody will catch until production.

Except this time, Scry caught it first.

The Migration

Picture a typical e-commerce app. The orders table has ~25,000 rows, a handful of indexes, and a steady workload of filters and joins. A teammate adds a region column for a new geo-filtering feature:

-- Add the region column
ALTER TABLE orders ADD COLUMN region VARCHAR(50);

-- Backfill existing orders with region data
UPDATE orders SET region = CASE
    WHEN (id % 5) = 0 THEN 'west'
    WHEN (id % 5) = 1 THEN 'east'
    WHEN (id % 5) = 2 THEN 'central'
    WHEN (id % 5) = 3 THEN 'south'
    ELSE 'north'
END;

-- NOTE: No index added!

In staging, with 100 rows, the filter query returns in under a millisecond. CI is green. The PR gets approved.

Why Staging Alone Missed It

Staging has two structural blind spots:

Wrong data volume. 100 rows vs. 25,000 (or 25 million). PostgreSQL’s query planner makes different choices at different scales – a sequential scan on 100 rows is faster than an index lookup. On 25,000 rows, it’s a performance cliff.

Wrong query patterns. Test suites run a handful of known queries. Production runs hundreds of distinct patterns with real concurrency and real data skew. The interaction between a new column and existing queries only surfaces when you replay actual traffic.

Staging tests correctness, not performance at scale. You need both. (See 100 Migrations Later for the longer argument.)

But this migration was also running through Scry.

What Scry Found

Here’s what happens when this migration runs through Scry’s pipeline:

scry-proxy is already capturing production queries transparently – no application changes needed.
The migration is applied to a shadow database, a CDC-replicated copy of production that maintains real data volume and distribution.
Scry replays the captured query workload against the shadow, comparing latency before and after the migration.

The replay report:

Scry replay report showing 3 regressed query patterns with 92x, 42x, and 20x slowdowns caused by missing index

The EXPLAIN ANALYZE confirms the root cause:

-- Without index: Seq Scan (184ms)
Seq Scan on orders  (cost=0.00..1250.00 rows=5000 width=120)
                    (actual time=0.045..184.23 rows=5000 loops=1)
  Filter: (region = 'west'::text)
  Rows Removed by Filter: 20000
Planning Time: 0.089 ms
Execution Time: 184.67 ms

The Fix

Two indexes:

CREATE INDEX CONCURRENTLY idx_orders_region ON orders(region);

-- For queries that filter by region AND status, a composite index is even better:
CREATE INDEX CONCURRENTLY idx_orders_region_status ON orders(region, status);

After applying the fix to the shadow and re-running the replay, every pattern is back to baseline:

Before vs after latency comparison showing queries dropping from 184ms, 210ms, 165ms down to 3ms, 5ms, 8ms after adding indexes

-- With index: Index Scan (3ms)
Index Scan using idx_orders_region on orders
                    (cost=0.29..125.40 rows=5000 width=120)
                    (actual time=0.032..2.89 rows=5000 loops=1)
  Index Cond: (region = 'west'::text)
Planning Time: 0.112 ms
Execution Time: 3.14 ms

184ms to 3ms. Regression eliminated before it ever touched production – no pages, no customer impact, no incident channel. In CI, the whole cycle – apply migration, replay traffic, detect regression, apply fix, re-validate – fits in a single command:

scry ci test-migration prod-db/ci-main -- alembic upgrade head

Exit code 0: safe to ship. Exit code 5: regressions detected, pipeline fails before production.

Lessons from Missing-Index Regressions

This class of bug — schema change adds a new query pattern, nobody adds the index — is probably the most common source of unexpected production slowdowns. It shows up constantly, for a simple reason: the person writing the migration is thinking about the data model, not the query planner.

The danger zone is the ALTER TABLE + UPDATE backfill pattern. Adding a column is cheap. The backfill UPDATE is where teams are focused on getting the data right. What they don’t do is ask "which existing queries now touch this column?" If the answer is "any query that filters, sorts, or joins on it," and there’s no index, you’ve just silently degraded those queries — at production data volume, not staging volume.

Reading EXPLAIN ANALYZE output is the fastest way to spot the problem after the fact. The signal is unambiguous:

-- Bad: Seq Scan + large Rows Removed = missing index.
Seq Scan on orders  (cost=0.00..1250.00 rows=5000 width=120)
  Filter: (region = ‘west’::text)
  Rows Removed by Filter: 20000

-- Good: Index Scan, no wasted rows.
Index Scan using idx_orders_region on orders
  Index Cond: (region = ‘west’::text)

On deciding between a single-column index and a composite index: if you see a query that always pairs region with another high-selectivity column — say status or customer_id — a composite index on (region, status) will outperform two single-column indexes. PostgreSQL can use a composite index to satisfy both predicates in a single tree walk. The rule of thumb: lead with the equality predicate, follow with the range predicate. If queries filter by region = ‘west’ AND status = ‘pending’, index on (region, status). If they filter by region = ‘west’ AND created_at > now() - interval ‘7 days’, index on (region, created_at).

A single-column index on just region is the right starting point when you don’t yet have data on query patterns — it eliminates the sequential scan immediately, and you can layer in composite indexes once you’ve observed which predicates co-occur in production traffic.

The broader pattern to watch for: any migration that adds a column and then immediately runs a backfill UPDATE is a candidate for a missing-index bug. Make it a checklist item in your PR template: "If this migration adds a column that will be used in WHERE, ORDER BY, or JOIN clauses, does the migration also add the index?"

Configuring ScryData for This Scenario

Here’s the practical setup. You need three things: scry-proxy in front of your PostgreSQL instance, a shadow database populated via CDC, and a scry.yaml config that wires them together.

Step 1: Run scry-proxy. scry-proxy sits between your application and PostgreSQL. It forwards all queries transparently and writes a capture log. No application changes — just update your database URL to point at the proxy:

# Install the CLI
curl -sSfL https://install.scrydata.com | sh

# Start scry-proxy in front of your production database
scry proxy start \
  --upstream postgres://prod-db.internal:5432/myapp \
  --listen 0.0.0.0:5433 \
  --capture-dir /var/scry/captures

Your app connects to port 5433. Queries go through to production unchanged, and scry-proxy writes a rolling capture log that Scry uses for replay.

Step 2: Point the shadow at a CDC replica. The shadow database should be a logical replication target of production — same data volume, same distribution. If you’re already running a read replica, you can use that. If not, scry shadow init will set up logical replication from your primary:

scry shadow init \
  --source postgres://prod-db.internal:5432/myapp \
  --target postgres://shadow-db.internal:5432/myapp_shadow \
  --slot scry_shadow_slot

Step 3: Configure scry.yaml. Drop this file at the root of your repo:

version: 1

capture:
  proxy_url: postgres://localhost:5433/myapp
  capture_dir: /var/scry/captures
  window: 30m          # replay the last 30 minutes of traffic

shadow:
  url: postgres://shadow-db.internal:5432/myapp_shadow
  migration_cmd: alembic upgrade head

thresholds:
  latency_regression_pct: 20    # flag any query 20% slower than baseline
  latency_regression_abs_ms: 5  # ignore regressions smaller than 5ms absolute

report:
  format: json
  output: scry-report.json

Step 4: Wire it into GitHub Actions. Add a job that runs after your migration step:

name: Migration Regression Check

on:
  pull_request:
    paths:
      - ‘migrations/**’
      - ‘alembic/**’

jobs:
  scry-replay:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Scry CLI
        run: curl -sSfL https://install.scrydata.com | sh

      - name: Run migration regression test
        env:
          SCRY_SHADOW_URL: ${{ secrets.SCRY_SHADOW_URL }}
          SCRY_CAPTURE_DIR: ${{ secrets.SCRY_CAPTURE_MOUNT }}
        run: |
          scry ci test-migration prod-db/ci-main -- alembic upgrade head

      - name: Upload regression report
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: scry-regression-report
          path: scry-report.json

scry ci applies the migration to the shadow, replays the captured workload, compares latency distributions, and exits with code 0 if everything is within thresholds or code 5 if regressions are detected. The GitHub Actions job fails on exit code 5, blocking the merge. The uploaded artifact gives you the full per-query breakdown to debug from.

The full cycle — apply migration, capture baseline, replay workload, diff latencies — runs in about 90 seconds for a workload of a few thousand captured queries. For larger capture windows, you can narrow the replay to a random sample with --sample-pct 10.

A 92x regression on a core query path would have meant pages firing, customers timing out, and an engineer reverse-engineering what changed. Instead, it was a three-line fix before the PR merged.

You can replay this exact scenario locally in under two minutes:

scry demo

Run the full end-to-end test yourself

We’re looking for design partners – teams shipping migrations against production PostgreSQL who want to close the gap between staging and prod. Request early access.