Multigres Architecture Overview | Multigres

10 min read Original article ↗

Principles

Multigres will follow a set of principles suited for large scale distributed systems. They are as follows:

Scalability

In a distributed system, scalability is mainly achieved by removing all possible bottlenecks. Among them, the most challenging one is the database. Multigres will be designed to scale the database horizontally by sharding it across multiple Postgres instances. Multigres will also provide an additional scalability option by managing read replicas.

High Availability

Multigres strives for enterprise grade availability. To achieve this, it will use a combination of techniques:

  • A consensus protocol for leader election and failover management.
  • Fully automated cluster management.
  • No disruption of service during upgrades or maintenance.

Data Durability

Multigres will ensure that data is durable using a consensus protocol. It will guarantee that a write that has been acknowledged as success to a client must not be lost.

Additionally, it will provide a backup and restore mechanism to ensure that data can be recovered in case of catastrophic failures.

All other metadata will be stored in a distributed key-value store like etcd, which can also be backed up regularly, or can be manually reconstructed.

Resilience

Multigres will provide resilience against spikes and overloads by employing queuing and load shedding mechanisms.

To protect from cascading failures, it will implement adaptive timeouts and exponential backoffs on retries.

Observability

In spite of all precautions, incidents can happen. Multigres will provide a comprehensive set of metrics and logs to help diagnose issues.

Features

A primary goal of Multigres is to provide full Postgres compatibility while enhancing scalability, availability, and performance. Its key features will include:

  • Proxy layer and Connection pooling
  • Performance and High Availability
  • Cluster management across multiple zones
  • Indefinite scaling through sharding

Multigres can be deployed to suit different needs. We will introduce you to the key components as we illustrate various deployment scenarios.

Single database deployment

In a single database deployment, Multigres will act as a proxy layer in front of a single PostgreSQL instance. This setup will be ideal for small applications or development environments where simplicity is key.

Single database deployment

The main components involved will be:

  • MultiGateway: MultiGateway will speak the Postgres protocol and route queries to MultiPooler through a single multiplexed gRPC connection.
  • MultiPooler: The MultiPooler will be connected to a single Postgres server, and will manage a pool of connections to the database. They will both run on the same host, which will typically be a Kubernetes pod.

In this scenario, Multigres will not address the durability of the underlying data. Therefore, it is recommended to use a resilient form of cloud storage to ensure data safety.

For a Multigres cluster to operate, two other components will be required:

  • Provisioner: This will typically be a Kubernetes operator that handles provisioning of resources for the cluster. For example, a CREATE DATABASE command will be redirected to the provisioner that will allocate the necessary resources and launch the MultiPooler along with its associated Postgres instance.
  • Topo Server: This will typically be an etcd cluster. The Provisioner will store the existence of the newly created database in the Topo Server. The MultiPooler will also register itself in the Topo Server to allow MultiGateway to discover it.

Multiple database deployment

Unlike a traditional Postgres server, every Multigres database will be created in a brand new Postgres instance coupled with its own MultiPooler.

Multiple database deployment

The MultiGateways will be scalable independently based on resource needs. The application will connect to any MultiGateway, which will route the queries to the appropriate MultiPooler based on the database name.

This deployment style will allow for a large number of databases to be deployed under a single Multigres cluster.

The figure does not show the Topo Server and Provisioner components, but they will still be required for the cluster to operate.

Performance and High Availability

Multigres can be configured to add replicas as standbys. In this setup, we introduce the MultiOrch component, which manages the health of replication across replicas. It monitors replication, repairs broken streams, and coordinates failover, ensuring replicas remain in sync with the primary database.

High Availability and Performance

MultiOrch will implement a distributed consensus algorithm that will provide the following benefits:

  • High Availability: by promoting one of the replicas to be the new primary in case of a failure.
  • Data Durability: by ensuring that all writes are acknowledged by a quorum of replicas before being considered successful.
  • Performance: because the data can be stored on a local NVMe for faster access.

MultiOrch will operate on an unmodified Postgres engine by using full sync replication. For a better experience, we recommend using the two-phase sync plug in (more details on this later).

MultiOrch will be configurable to use a RAFT style majority quorum. It will also be configurable to support more advanced durability policies that don't depend on the quorum size. This will be achieved by using a new generalized consensus approach (covered later).

MultiGateway will make use of the replicas to scale reads for situations where the application can tolerate eventual consistency. It will also be configurable to support consistent reads from replicas at the cost of waiting for writes to finish transmitting the data to the replicas.

Cluster management

A Multigres cluster will be deployable across multiple zones or geographical regions. In the previous examples, the components will be deployed in a single zone. Under the covers, they will be deployed in the default cell, which will be implicitly created for every new database. In the case of a multi-zone deployment, you will have to explicitly create cells and deploy the components in those cells. You will not have to preserve the original default cell.

In Multigres parlance, a cell will be a user-defined grouping of components. It will represent a zone or a region.

Cluster management

Topo Servers

In a multi-cell deployment, the Topo Server will be splittable into multiple instances. It is recommended that the Global Topo Server be deployed with nodes in multiple cells. For every cell, a cell-specific topo server will be deployable.

The Global Topo Server will contain the list of databases and the cells in which replicas are deployed. This information will be used more sparingly.

The cell-specific topo servers will contain the list of components deployed in that cell, such as MultiGateways and MultiPoolers. The purpose of this design is to ensure that a cell that is partitioned from the rest of the system can continue to operate independently for as long as the data is not stale.

Single Primary

Irrespective of the number of cells, there will exist only one primary database at any given time. The MultiGateways will route all requests meant for the primary to the current primary even if it is not in the same cell.

However, read traffic directed at replicas will be served from the local cell.

MultiOrch

It will be recommended that one MultiOrch be deployed per cell to ensure that failovers can be successfully performed even if the network is partitioned.

The consensus protocol will ensure safety even if the MultiOrchs are not able to communicate with each other.

The durability policies will be settable to survive network partitions. For example, you may request a cross-cell durability policy that will require an acknowledgment from a replica in a different cell before considering a write successful.

You will also be able to request MultiOrch to prefer appointing a primary within the same cell as the previous primary to avoid unnecessary churn.

Backup and Restore

Multigres will perform regular backups of the databases. These backups will be restored when new replicas are brought online.

Sharding

In the previous examples where the databases were unsharded, all the tables would have been stored on a single Postgres database. In this situation, there will be a one-to-one mapping between a Multigres database and the Postgres instance.

In reality, a Multigres database will be distributable across multiple Postgres instances. These will be known as TableGroups. Additionally, each TableGroup will be shardable independently, which will result in more Postgres instances within a TableGroup. When a Multigres database is created, a default TableGroup will be created, which will be an unsharded Postgres instance. This will be where all the initial tables are created. When you decide to shard a set of tables, you will be able to create a new sharded TableGroup and migrate those tables to it. From the application's perspective, the tables will appear as if they are part of a single database.

You will also be able to create separate unsharded TableGroups.

Cluster management

In the above example:

  • t1 is stored in the original default unsharded TableGroup. The single shard in this TableGroup is named default:0-inf.
  • t2 is split in the two shards of tg1. The shards are named tg1:0-8 and tg1:8-inf. Note that the digits are hexadecimal.
  • t3 is stored in TableGroup tg2. The single shard in this TableGroup is named tg2:0-inf.

The rest of the Multigres features like cluster management, HA, etc. will be designed to work seamlessly with these sharded TableGroups.

We will cover the Multigres sharding model in more detail in a separate document.

Combining Shards and Cells

If cells were one axis of a Multigres cluster, then shards would be another axis. The two multiply with each other to produce a matrix of components. This is illustrated in Figure 5 below:

Cluster management

In a fully deployed cluster, the components will function in the following ways:

  • MultiGateway: There can be multiple instances of MultiGateways in each cell. A MultiGateway uses MultiPoolers within the current cell to serve traffic. If necessary, a MultiGateway would go cross-cell to access a primary if it's not in the current cell. The user or application can connect to any MultiGateway to run their queries.

  • MultiPooler: There will be one instance of MultiPooler per Postgres instance. Such an instance contains the data for one shard. Among all MultiPoolers, one of them will be the primary and the others will be standbies or replicas.

  • MultiOrch: A MultiOrch watches over a single shard across all cells. For a given shard, it is recommended that a MultiOrch be provisioned for each cell. This allows for resilience against network partitions. If one MultiOrch does not have the connectivity to perform a failover, another one in a different can take over. The MultiOrchs are capable of operating without interfering with each other.

  • Local Toposerver: One local toposerver per cell is required. For smaller deployments, the global toposerver could also be reused for this purpose. However, it is recommended that separate local toposervers be provisioned for each cell in case of larger deployments. This server is used for components within a cell to discover each other. For example, a MultiPooler will publish itself through the toposerver, which will allow MultiGateway to discover its existence.

  • Global Toposerver: One global toposerver is needed to store the list of cells, the list of databases, and their backup locations. The global toposerver is typically deployed across multiple cells to survive network partitions.

Upcoming Topics

We will cover the following topics in more detail in the future:

  • Two-phase sync replication
  • Generalized consensus for durability policies
  • Multigres sharding model