Validating Postgres Distributed for High Availability and Consistency

Learn how to achieve high availability PostgreSQL clusters and reliable transactions in geo-distributed databases

EDB Postgres Distributed (PGD) provides best-in-class high availability for always-on, enterprise-grade Postgres performance across fully managed or self-managed deployments in any cloud. With EDB’s Postgres technology supporting robust, globally distributed applications that process thousands of transactions per second, customers regard EDB Postgres Distributed as their go-to for delivering up to 99.999% uptime and improving data replication performance by up to 5X.

As part of internal activities leading up to our latest EDB Postgres Distributed release, our Engineering team focused on delivering more robust releases that enhance customer confidence in PGD. Setting correct expectations for customers prompted us to improve our internal testing processes in efforts influenced by the Jepsen database testing framework.

Validating EDB Postgres Distributed for Safety, Correctness, and Consistency

In the Jepsen framework, a Jepsen control node has the Jepsen libraries and plugins for PGD and tests. This talks to the PGD cluster, as shown below.

This blog is the first in a series of posts, as EDB extends our commitment to validate PGD’s availability and consistency in geo-distributed databases.

In the following subsections, we’ll walk through our internal testing and how it validates EDB Postgres Distributed’s availability and consistency in active/active architectures.

Background and Terminology for Database Transactions

Before describing our internal PGD testing enhancements, let’s provide some background and terminology for database transactions.

Operation: An operation (e.g., read or write) is carried out by a client. This operation can take some time to complete. A modifying operation such as a write can occur sometime between invocation and completion. Multiple operations can be in progress simultaneously.

Process: An operation is performed by a logical process. A process can do only one operation at a time.

History: A history is a set of operation invocations and completions with their timestamps. An example is provided below.

invocation process 0:

{:process 0, :type :invoke, :f :read, :value nil}

invocation process 1:

{:process 1, :type :invoke, :f :write, :value 3}

completion process 1:

{:process 1, :type :info, :f :write, :value 3}

completion process 0:

{:process 0, :type :ok, :f :read, :value 3}

Testing Linearizability in Database Transactions

We used the Knossos checker to verify that a history generated from a set of concurrent operations is linearizable. It models a register with operations read, write, and compare-and-swap (CAS). Given a set of concurrent operations, the system is validated as linearizable for that set if there is at least one possible path in the history that is linearizable.

Linearizable system

A linearizable implementation implies that the following properties are met:

The system should support these operations:
1. read (key)
2. write (key, value)
3. compare_and_swap (key, old_value, new_value)
In the presence of multiple concurrent reads and writes from multiple nodes, a read should never return a stale value. A stale read returns, say, a value V, and there is at least one modification to the key that was successful and completed before the key was read that set the value to V2.
As the system moves from one state to another, the state of the system seen should always progress forward. Returning a stale value is like going backward.
There should be a total order on all the operations.

In a distributed system, an operation is applied to the system at a certain time. It may take some time to complete and if it is a modifying operation, the modification takes effect sometime between start time and completion time. The diagrams that follow show how operations proceed concurrently on a distributed system.

The first diagram below provides an example of non-linearizable behavior. After a write has been applied and completed, a subsequent read returns an older value. A linearizable system cannot go “backward”.

The second diagram below shows the behavior of a linearizable system. There is a total order on the operations because the system moves from a state of A = 2, A = 4, A = 8… and clients see the system moving forward. An operation takes some time to take effect, but once it takes effect, every node sees the effect of that operation. Thus, the system behaves as one system.

A look at EDB’s implementation of Raft in PGD

PGD uses Raft as a consensus protocol for certain key operations. While Raft does not come in the data path of transactions, it is used for configuration management, electing a write leader for a subgroup, serializing DDL and global locks, and arriving at a consensus in distributed transaction reconciliation.

One of our first steps was to test the linearizability of PGD’s Raft implementation. We used a key-value API provided by the Raft subsystem to test linearizable behavior. The read, write, and CAS calls are implemented through Raft.

A failed operation appears in PINK, and a successful operation appears in BLUE. Operations may also return with uncertainty, which means it may or may not have been executed, and the Knossos checkers factor this.

Using this API to test linearizability yields timeline charts provided in the following sample exhibits.

Example of correct execution

Example of invalid analysis

Linearizable reads and Raft

The tests showed some processes reading stale values, as seen in the above example. This is a bug or at least a linearizability violation. The reason is that key-value Raft implementation in PGD reads from the leader. The leader should have the latest value because it acknowledges the request only after getting an acknowledgment from the majority of replicas that they have committed the change and then applying the changed state locally. But if there is a leader change, a new leader is elected. The new leader must have the change committed in its Raft logs, which may not have been applied to the key value store. For the new leader, the apply_index may lag the commit_index. And on a read, it may return a stale value.

PGD code that uses Raft for various use cases, such as configuration management, waits if the applied value needs to be seen locally. We created an explicit SQL call to wait until the apply index catches up with the commit index, and when using this, the test succeeded.

Lessons Learned

During this process, we realized some valuable lessons.

An API definition needs to be precise and allow callers to infer if something has failed or there is uncertainty. For example, throwing errors with the right details, such as a request failing due to a timeout expiring, allows the caller to infer that there is uncertainty in the result of the request. It may or may not have succeeded. For tests to accurately check results, the requests need to return :ok (success), :fail (failure), and :info (uncertainty). An :info result can mean the request could have succeeded or failed. Wrongly considering a request as failed when it could have succeeded gives false alarms in the tests.
Some tests that use a register workload with the Knossos checker have very high space and time complexities. If the test runs with fewer keys, it can take a very long time to verify and uses up memory as well. Its duration needs to be short. The verification process can run out of memory if the duration is long. If the duration is kept short, bugs may not get caught. Also, if the number of keys increases, the bug may not be hit in the test. Similarly, creating a lot of contention can result in the code hitting deadlocks and slowing down the progress of the test. We are working to strike a balance between these to make sure the code gets tested better.

Status and Work-in-Progress Activities

As a result of writing and running these tests, we were able to target a few areas of next-stage focus, including:

Helping identify some rare cases when update conflicts, in the presence of write skew, cause data divergence among nodes.
Improving PGD error handling on replicas in the presence of conflicts.
Helping identify gaps in eager and async conflict resolution.

We continue to refine our efforts to validate PGD’s availability and consistency in geo-distributed databases with various configurations, and we’ll share those findings in future blogs, focusing on:

Conflict resolution with CRDTs
Qualify write lead with majority commit scope for strict serializability
Qualify write lead with a synchronous commit for linearizability
Adding tests for the correctness of global locks

For more information about the latest EDB Postgres Distributed release, check out our EDB Docs.