PostgreSQL Recovery Internals

Modern databases must know how to handle failures gracefully, whether they are system failures, power failures, or software bugs, while also ensuring that committed data is not lost. PostgreSQL achieves this with its recovery mechanism; it allows the recreation of a valid functioning system state from a failed one. The core component that makes this possible is Write-Ahead Logging (WAL); this means PostgreSQL records all the changes before they are applied to the data files. This way, WAL makes the recovery smooth and robust.

In this article, we are going to look at the under-the-hood mechanism for how PostgreSQL undergoes recovery and stays consistent and how the same mechanism powers different parts of the database. We will see the recovery lifecycle, recovery type selection, initialization and execution, how consistent states are determined, and reading WAL segment files for the replay.

We will show how PostgreSQL achieves durability (the "D" in ACID), as database recovery and the WAL mechanism together ensure that all the committed transactions are preserved. This plays a fundamental role in making PostgreSQL fully ACID compliant so that users can trust that their data is safe at all times.

Note: The recovery internals described in this article are based on the PostgreSQL version 18.1.

Overview

PostgreSQL recovery involves replaying the WAL records on the server to restore the database to a consistent state. This process ensures data integrity and protects against data loss in the event of system failures. In such scenarios, PostgreSQL efficiently manages its recovery processes, returning the system to a healthy operational state. Furthermore, in addition to addressing system failures and crashes, PostgreSQL's core recovery mechanism performs several other critical functions.

The recovery mechanism, powered by WAL and involving the replay of records until a consistent state is achieved (WAL → Redo → Consistency), facilitates several advanced database capabilities:

Replication & Standby Servers: Applying of WAL records on a standby sent by the primary is actually a continuous form of recovery happening.
Restoring from Backups: Backups serve as a starting point for the recovery mechanism and then the recovery replays records until a consistent final state is attained.
Point-in-Time Recovery (PITR): PITR employs the same recovery mechanism, with the added capability of halting the recovery process at a user-defined point in time (recovery_target).
And in the Crash Recovery, the recovery begins from the redo (got from the last successful checkpoint) and continues until the end of available WAL is reached.

Lifecycle

Now let's begin the lifecycle and flow of a simple crash recovery mechanism. Here, we will focus solely on the main processes that facilitate recovery. Of course, many other checks are performed, but we will limit the details to those directly related to the recovery mechanism.

This all starts from StartupProcessMain, which calls StartupXLOG(), coordinating with the server startup sequence. Before doing anything further, it will perform a couple of checks to determine whether the server crashed or was properly shut down. This information can be gathered by examining the controlfile. If a crash occurred, we will take a couple of actions. First, we will remove any temporary wal segments under the pg_wal directory. Second, we will sync the entire data directory, as there may be some data that we had written waiting for fsync.
Then initwalrecovery will indicate whether a recovery is needed; if so, we update the control file accordingly and initiate the recovery. InitWalRecovery analyzes the control file and the backup label file, if present, and sets InRecovery if recovery is necessary. If not, the server will proceed with the XLOG startup and perform related checks.
If a decision has been made to perform recovery, we will first update the control file. This will reflect the server's recovery state and the selected checkpoint from which recovery is starting. At this point, a hot standby is initialized if requested, allowing connections and queries during recovery.
Performwalrecovery handles the core loop of the recovery, which involves WAL replaying.
Now, some cleanup tasks are initiated to complete the recovery.
- In the case of archive recovery, a special WAL record is inserted to mark the end of recovery.
- Otherwise, an end-of-recovery checkpoint is requested.
After a successful recovery, the controlfile is updated again to reflect that the server is promoted to carry out normal read and write operations.

In case of a promotion, if there are cascading standby servers connected to us, we will notify any WAL sender process that we've been promoted. Also, in case of a promotion, an (online) checkpoint is requested.

Recovery Initialization

At this stage, we determine whether to initiate a recovery. The recovery initialization conducts a series of checks by analyzing the control file, backup label file, and any recovery signal files (recovery.signal and standby.signal) to decide whether to perform crash recovery or archive recovery. Based on the results, it sets InRecovery to true or false. If the standby signal file is present, it takes precedence. If neither recovery.signal nor standby.signal is available, we will not enter archive recovery. Some important files that we check at this stage include:

Recovery Signal File
Standby Signal File
Pg Control File
Backup Label File

Based on the data collected from the available files, the checkpoint is identified, determining how far we need to replay to reach a consistent state. Checkpoint validation is also performed at this stage. If a tablespace_map file exists, created during the backup process, symlinks are established based on the data from the map file.

At the end of this initialization, we are left with some useful global values that can influence the server's future actions.

ArchiveRecoveryRequested
StandbyModeRequested
InRecovery

Perform WAL Recovery (Core)

This is the core part where the actual recovery happens, after the initialization done perforwalrecovery() is called that starts the redo.

Notifies the postmaster that recovery has started so it can begin the archiver if necessary.
Finds and reads the first record; this will serve as the recovery starting point, and the redo loop will begin from here.
Starts the redo loop and runs the resource manager to track resources and time consumed while replaying the WALs.
Since this loop can continue for an extended time depending on the recovery targets, to prevent the system from being stuck in the loop, it is designed to respond to interrupt signals.
Before applying the record, we check whether the recovery target has been reached using the recovery target parameters. If the target is reached, we stop the recovery. This is done to accommodate the parameter recovery_target_inclusive=off.
- recovery_target_time
- recovery_target_lsn
- recovery_target_xid
- recovery_target_time
PITR (Point-in-Time Recovery) works by stopping at an exact point. It ensures that you don’t accidentally replay more WAL than required.
If the standby is configured to lag (e.g., recovery_min_apply_delay = '2min'), we will wait until the age of the WAL record exceeds the configured delay.
At this point, we reach the heart of recovery. The Applywalrecord() function will handle applying the decoded WAL record. Each access method has its own implementation for how to redo a record. The resource manager will choose the specific access method. ApplyWalRecord() internally dispatches to a dedicated resource manager based on the record type:
- heap → change tuples/pages
- btree → update indexes
- xlog → update timeline metadata
- smgr → extend/create files
- clog → commit/abort overflow

For example, in the case of heap, we may see the following functions carrying out the redo:

heap_xlog_insert()
heap_xlog_update()

Check again if the target has been reached. This is used for inclusive recovery targets.
Calls ReadRecord() to fetch the next WAL record. This will continue the redo loop and advance the recovery process.
After replaying all the records, the recovery may shut down or pause, depending on the setting defined by the user in recovery_target_action.

When will consistency be reached?

Now the main thing to consider here is when the actual recovery loop will stop. It depends on the consistent state, which further depends on the type of recovery being executed.

A consistent state in PostgreSQL recovery is a point where all the data blocks represent a valid and correct database state, with all required WAL records replayed sufficiently to reflect all committed transactions up to that moment. The consistency point can vary depending on the type of recovery being performed.

In crash recovery, consistency is reached once PostgreSQL has replayed enough WAL (all the available wal) to safely complete any interrupted operations and return the database to a state where it can accept normal read and write traffic. Crash recovery duration is indirectly influenced by checkpoint-related settings such as:

checkpoint_timeout
max_wal_size
checkpoint_completion_target

In archive recovery (including PITR), the consistent state is defined not only by correctness but also by the configured recovery target, meaning recovery may intentionally stop at an earlier point in time or WAL location. The following parameters can help set the consistent state:

recovery_target_lsn
recovery_target_time
recovery_target_lsn
recovery_target_xid
recovery_target_name
recovery_target_inclusive
recovery_target_action

In standby recovery, PostgreSQL can reach a consistency point that is sufficient to allow read-only queries while WAL replay continues in the background, even though the system is not yet writable. These parameters influence consistency and query behavior while WAL replay continues:

max_standby_streaming_delay, max_standby_archive_delay – control query conflicts during replay
hot_standby – allows read-only queries once a consistent standby state is reached
recovery_min_apply_delay – intentionally delays WAL application

You can also find in detail configuration settings for each type of recovery in postgresql docs.

WAL Reading Internals

Keep in mind that to ensure a smooth recovery, we need to read the WAL efficiently. PostgreSQL achieves this using xlogprefetcher, which reads and prefetches all records in advance to maintain smooth recovery. Consider the following parameters to improve prefetching and, consequently, recovery:

wal_decode_buffer_size: determines how far to prefetch
wal_buffers
recovery_prefetch

There are various implementations for reading the WAL segment files, each serving a specialized purpose. The recovery module has its own implementation for reading the segment files. If you ever decide to read the WAL segment file while hacking, you have to choose a specific implementation or maybe create your own. Mostly in case of extension development read_local_xlog_page_no_wait is used.

XLogPageRead(): used during recovery
read_local_xlog_page(): performs a simple read
read_local_xlog_page_no_wait: performs a simple read without waiting
summarizer_read_local_xlog_page: used by walsummarizer
SimpleXLogPageRead: used by pg_rewind
logical_read_xlog_page: used by the walsender
WALDumpReadPage: used by pg_waldump

Final Thoughts

PostgreSQL’s recovery mechanism is built on a simple mechanism, replaying the WAL until the system reaches a consistent point. However, behind this simplicity there is a carefully engineered system that coordinates with the checkpoints, timelines, and resource managers to ensure correctness under crashes, replication, and point-in-time recovery.

What makes recovery particularly powerful is that the same core mechanism supports the crash recovery, PITR, replication, and hot standby. Understanding recovery internals is crucial for those working on replication, storage, WAL or extension development.