The infinitely patient vacuum - a case study of what happens when the PostgreSQL VACUUM never completes

Timeouts? What timeouts?

I was recently called into a situation where the same query was running on three different environments, but one environment was running considerably slower than the other two. I requested an extended EXPLAIN of the query on all three environments, along with the output of pg_locks and pg_stat_activity on the troublesome environment, for starters.

For reference, the original query was:

select count(*) from <redacted>;

Which was then run as:

EXPLAIN (ANALYZE, BUFFERS, SETTINGS) select count(*) from <redacted>;

…on all three environments.

Here’s the output from the three EXPLAINs:

Environment 1:

Aggregate  (cost=9439.45..9439.46 rows=1 width=8) (actual time=60.239..60.240 rows=1 loops=1)
  Buffers: shared hit=8823
  ->  Seq Scan on <redacted>  (cost=0.00..9316.16 rows=49316 width=0) (actual time=0.026..57.171 rows=49330 loops=1)
        Buffers: shared hit=8823
Settings: effective_cache_size = '8GB', max_parallel_workers_per_gather = '0'
Planning Time: 0.041 ms
Execution Time: 60.263 ms

Environment 2:

Aggregate  (cost=9123.14..9123.15 rows=1 width=8) (actual time=45.715..45.717 rows=1 loops=1)
  Buffers: shared hit=8462 dirtied=166
  ->  Seq Scan on <redacted>  (cost=0.00..8990.91 rows=52891 width=0) (actual time=0.015..43.124 rows=52882 loops=1)
        Buffers: shared hit=8462 dirtied=166
Settings: effective_cache_size = '8GB', max_parallel_workers_per_gather = '0'
Planning:
  Buffers: shared hit=67
Planning Time: 0.298 ms
Execution Time: 45.772 ms

Environment 3:

Aggregate  (cost=393885.57..393885.58 rows=1 width=8) (actual time=1112333.054..1112333.057 rows=1 loops=1)
  Buffers: shared hit=1014858 read=2211856 dirtied=1 written=39493
  ->  Bitmap Heap Scan on <redacted>  (cost=333899.49..393846.01 rows=15825 width=0) (actual time=37140.459..1112330.910 rows=22097 loops=1)
        Heap Blocks: lossy=7219
        Buffers: shared hit=1014858 read=2211856 dirtied=1 written=39493
        ->  Bitmap Index Scan on <redacted>  (cost=0.00..333895.53 rows=15825 width=0) (actual time=36836.658..36836.659 rows=22806599 loops=1)
              Buffers: shared hit=226 read=82463 written=1273
Settings: effective_cache_size = '8GB', max_parallel_workers_per_gather = '0'
Planning:
  Buffers: shared hit=1013 read=747 written=12
Planning Time: 555.159 ms
Execution Time: 1112337.830 ms

Setting aside the index on the third environment, where the slowdown occurs, we can see two things:

1. Environment 3 runs for close to 20 minutes, in contrast to less than 100 ms for the other two.
2. Environment 3 hits more than a million shared buffers, compared to less than 10 000 for the other two.

Looking at this, my hunch was that the query on 3 had to trawl through a lot of irrelevant data to find the results it needed to, and the fact that this was even the case given that it was a count(*) of all rows, means that it’s probably dead rows causing the problem.

An excess of dead rows would mean that VACUUM isn’t running or isn’t completing as it should, and that is usually caused by long running queries preventing VACUUM from doing its job. Looking at the queries on pg_stat_activity, I saw one had been running for a week, and the other for two weeks. I could also see that the automated VACUUMs on the relevant tables participating in that query reached timeouts.

The final confirmation came with a look at the output from pg_stat_all_tables¹, after running:

select n_dead_tup, relname from pg_stat_all_tables;

Resulting in the following:

n_dead_tup	relname
203,494,741	<redacted>
111,624,190	<redacted>
25,415,790	<redacted> – This was the table in the original query
22,728,996	<redacted>
14,406,220	<redacted>
5,391,193	<redacted>
4,137,013	<redacted>
3,615,973	<redacted>
3,134,223	<redacted>
3,111,695	<redacted>
3,030,976	<redacted>
1,315,229	<redacted>
771,827	<redacted>
690,015	<redacted>
249,232	<redacted>

Millions and millions of dead tuples.

I recommended the two long-running queries (which were not needed, and long forgotten) be terminated to allow the automated VACUUM to do its job. After leaving it be for an evening, things looked much better in the morning. All in all, probably a fairly minor scare to instate query timeouts.

It’s worth adding that the application suffered a performance hit on most queries, it was just that the one I was given as an example was especially egregious, and just so happened to be a perfect use case to find the issue.

Do you have any similar stories of your own to share? I’d love to hear about them in the comments below.

This is quite a useful view, with many more columns that can serve you well if a performance issue crops up in PostgreSQL ↩︎