Optimizing MySQL Replication Lag with Parallel Replication and Writeset-based Dependency Tracking

21 min read Original article ↗

Kun Zhou

By: Kun Zhou, Matthew Jin

Introduction

At Airtable, MySQL replication plays a crucial role in scaling our reads and achieving our long-term vision for high availability. Having low-lag replicas allows us to offload a significant chunk of read traffic to replicas and eases the operational burden of performing lag-sensitive tasks such as failovers and blue-green deployment switchover.

Our largest database is a monolithic MySQL 8 instance (we previously blogged about upgrading main from MySQL 5.6 to 8), deployed on AWS RDS as a Multi-AZ deployment with one standby. It handles approximately 50,000 read queries per second (QPS) and 150 writes per second, with occasional spikes reaching up to 1,000 writes per second. While this is a very manageable load for the instance type we are using, we have had repeated hours-long replication lag spikes, making it unsuitable for live interactive queries. We wanted to solve this problem to effectively scale to meet our growing customer usage. Over the last two quarters, we spent a significant amount of time investigating and tuning our replication configuration and successfully brought the replication lag down to consistent single-digit seconds. We also built a binary log parser, which we used to guide this tuning process.

Here is a peek at the before and after lag metrics:

Press enter or click to view image in full size

Press enter or click to view image in full size

In the rest of the post, we will touch on:

  1. A quick overview of MySQL replication
  2. Challenges with rolling out multi-threaded replication
  3. Disabling Binary Logs on the Replicas
  4. Rolling out GTID for a more stable multi-threaded replication setup
  5. Tuning binlog_sync_delay and binlog_transaction_dependency_tracking to improve replication parallelizability
  6. Conclusion

Architecture of MySQL Replication

Replication is the mechanism used by MySQL to propagate changes from a primary instance to replica instances. Let’s look at how a single transaction is replicated from a MySQL primary to a replica.

Press enter or click to view image in full size

When a transaction is committed on the primary node, the associated data changes are recorded in a special log called the binary log (binlog). This log consists of “events” that describe database changes, such as DDL or DML changes, and is kept in addition to the write-ahead log, or the redo log as MySQL calls it. On the replica server, a special IO thread maintains a persistent connection with the primary server, reading binlog events as they get written on the primary, thereby keeping up-to-date with every change event. The replica’s IO thread writes these events to a special log called the relay log. A separate SQL thread on the replica server then reads from relay logs and executes the change events contained within. In effect, we are replaying all the data changes that took place on the primary on the replica. Replication lag, as reported by the replica, is the difference in time between the last change event written to the relay log (by the IO thread) and the last event applied from the relay log by the SQL thread. Note that the reported lag doesn’t take into account a slow IO thread.

MySQL supports several modes of replication, such as semi-synchronous replication, group replication, and asynchronous replication. Our RDS setup only allowed for asynchronous replication, and in this blog post, we will only focus on that.

Challenges of Multi-threaded Replication

The IO thread rarely becomes a bottleneck in replication because its primary task is to stream data from the network to files. The SQL thread, on the other hand, responsible for committing transactions to the database, often lags behind. It’s easy to see why: the primary server often has a few dozen threads processing transactions concurrently but the replica only has one SQL thread to apply all those transactions. Since 5.6.3, MySQL allows multiple SQL threads to apply changes concurrently to close the gap. This setup is known as multi-threaded replication.

Both our team and the MySQL community’s consensus is that multi-threaded replication is the key to bringing the lag down and it did show in the numbers: when we enabled it last year, we saw the lag come down from O (hours) to O (seconds) with infrequent multi-minute lags.

Press enter or click to view image in full size

Unfortunately, we ran into two issues that forced us to go back to single-threaded replication two weeks later.

Replication Deadlocking

The first issue was a MySQL bug that caused replication to hang indefinitely due to deadlocks. Quoting from the changelog of MySQL 8.0.23, which fixes this issue:

On a multi-threaded replica where the commit order is preserved, worker threads must wait for all transactions that occur earlier in the relay log to commit before committing their own transactions. If a deadlock occurs because a thread waiting to commit a transaction later in the commit order has locked rows needed by a transaction earlier in the commit order, a deadlock detection algorithm signals the waiting thread to roll back its transaction. Previously, if transaction retries were not available, the worker thread that rolled back its transaction would exit immediately without signalling other worker threads in the commit order, which could stall replication. A worker thread in this situation now waits for its turn to call the rollback function, which means it signals the other threads correctly.

Preserving commit order means the replica must commit transactions in the same order as the primary, ensuring strong consistency between source and replica. This is a setting we had enabled on our replicas. It still allows for parallel replication, but the final commit phase needs to be synchronized to respect the original commit order. To work around the bug, we increased the retry count for replication.

Failed Crash Recovery of Relay Log Metadata

The other issue we encountered was that failing over the database instance to a standby would reliably break multi-threaded replication for the replicas. We would see the failure message

slave failed to initialize relay log info structure from the repository

and the only fix was re-provisioning a new replica from backup to replace the broken one. This is in part due to RDS’s management of replication, meaning we had limited tools in our toolbox and couldn’t reset the replication metadata and reconfigure the replica’s binary log offset. One workaround we considered was switching our replica to single-AZ instead of multi-AZ with a standby. Single-AZ removes the standby instance, leaving no failover options, which would have reduced our availability guarantees. This wasn’t ideal, as we only had one live-serving replica and couldn’t afford the availability hit.

This was an unexpected issue because since MySQL 8.0, the relay log info structure (see MySQL 8.0 Reference Manual: 17.2.4.2 Replication Metadata Repositories) is stored in the system tablespace, which is durable and consistent, unlike raw files in previous versions. We later realized the cause was a very peculiar bug surrounding relay log recovery. To better understand this issue, let’s dive a little deeper into how replication resumes following a database crash.

Replication Crash Recovery and Relay Logs

Jean-François Gagné’s blog post provides a nice overview of replication crash safety and relay logs. To summarize, relay log recovery is a MySQL 8 opt-in feature that is meant to invalidate existing relay logs following a server crash for crash safety. This can happen since relay logs are often not synced to the disk. In single-threaded replication, if the database crashes, all relay logs will be discarded during recovery. The IO thread will resume replication from the last recorded Exec_Source_Log_Pos, which is the file offset in the primary database’s binary log that the replica has already committed. This watermark is part of the relay log metadata. The SQL thread then resumes from the top of the new relay log. This whole process works because relay log metadata is stored in the system tablespace.

Recovery becomes a lot more complicated with multi-threaded replication because binary log events are no longer applied in order. Different SQL threads are responsible for different transactions, and a later transaction could commit on the replica before an earlier one does. Because of this, we can no longer guarantee a global watermark where transactions preceding it are replicated and transactions after it are not.

Instead, each SQL thread now sporadically commits checkpoints where all transactions preceding it are replicated. In other words, Exec_Source_Log_Pos is now a low watermark (see here) for the chunk of binary log that’s been replicated, and it is likely that some transactions after this checkpoint have also been replicated. For this reason, on crash recovery, a multi-threaded replica must scan the existing relay logs starting at Exec_Source_Log_Pos up until the most recently replicated transaction to fill in the gaps of transactions replicated (this process is described in painstaking detail here) using the potentially corrupted relay logs. Only then can MySQL discard the existing relay logs and start replication again with a refreshed Exec_Source_Log_Pos, which is now an exact watermark.

Now you might ask, what if the potentially corrupted relay logs lost the events we have already replicated due to a crash? Well, then MySQL will be unable to fill in the gaps, relay log recovery would fail, and replication would break. The only sure way to recover is to re-provision the replica from a snapshot.

However, even if there are no gaps in the transactions applied, which is the case in our setup (we had replica_preserve_commit_order enabled), the existing implementation still walks through the same recovery steps to identify gaps in transactions. And in the event of a crash and relay log data loss, replication breaks. It’s plausible that MySQL should be able to reconstruct the relay logs by downloading the missing events from the primary, and there is an open and verified feature request on this.

The good news is that this limitation for relay log crash recovery is confined to binary log file offset-based replication as of version 8.0.18. Since then, MySQL has reworked how the recovery process looks like for GTID-enabled replicas and it no longer requires the old relay logs to be intact (MySQL Bugs: #92882: MTS not replication crash-safe with GTID and all the right parameters.). We will discuss GTID in a later section.

Disabling Binary Logs on the Replicas

When a transaction commits, its changes are initially written to an in-memory buffer to enhance write performance. To ensure durability and guard against potential crashes, MySQL writes the data changes to its write-ahead log, known as the redo log, during the transaction. And at commit time, these WAL writes are synced to the disk. This mechanism allows MySQL to recover committed transactions in the event of a crash.

When binary logging is enabled on an instance, it becomes part of the commit process, with changes being synced to the binary log file on disk immediately after the redo log is synced. This effectively doubles the number of expensive file syncs at commit time (you can learn more about this process here, which goes into greater detail).

In our setup, these fsyncs significantly impact commit latency, particularly because we use EBS (Elastic Block Store). To put things in context, the best available EBS type for RDS at the time was io1, which only guaranteed single-digit millisecond latency for read/write operations 99.9% of the time. In contrast, local SSDs typically offer latency in the range of tens of microseconds.

Earlier this year, we experimented with turning off binary logs on the replicas since it was not strictly required at that time. It was a delight to see replication latency drop dramatically from O (hours) to the O (minutes) depicted in this graph.

Press enter or click to view image in full size

This was a huge improvement, and we got here without multi-threaded replication. Nevertheless, this was a bandaid we hoped to take off with multi-threaded replication because having binary logging enabled is a requirement for a replica to be a failover target. Without binary logs, a replica cannot be promoted to the primary because other instances will not be able to replicate from it. Currently, we perform failovers using RDS’s Multi-AZ Deployments feature. This is not ideal due to its minute-long recovery time. Our long-term vision is to use ProxySQL to orchestrate failover with an O (seconds) recovery time objective.

Utilizing GTID to Remedy the Rough Edges of Multi-threaded Replication

To work around the rough edges of multi-threaded replication we went over earlier and to further improve the ease and reliability of managing replication, we turned our attention to GTID. GTID stands for Global Transaction Identifier, where ‘global’ indicates that the identifier remains consistent throughout the replication topology. This means that a transaction on the primary and its corresponding replicated transaction on the replica share the same GTID.

With GTID, it is no longer needed to keep track of the binary log file offset for replication, which is often a pain to manage. Instead, the correct binary log file position can be inferred from the GTID of the transactions executed on a database instance. In addition to what’s already discussed previously around better relay log crash recovery handling, it also allows a replica to automatically position its replication starting point when recovered from a snapshot. Previously, one must have taken note of the binary log file offset the snapshot was at and recovered the replication with the explicit offset. Another important feature is that auto-positioning allows a replica to change its replication source without having to restore it from a snapshot of the new source due to the globality of GTID.

Earlier this year, we rolled out GTID to all the MySQL instances at Airtable, resolving the relay log crash recovery limitation that broke replication on failover. We also upgraded our MySQL version to 8.0.28, which contained the fix for the replication deadlock bug we encountered. With these two changes, we enabled multi-threaded replication once again without a hitch, and MySQL replication lag is now quite stable around 1 second, with binary logs disabled.

We attempted to enable binary logs now that multi-threaded replication has been enabled, and the result was the replication lag having frequent spikes up to double-digit seconds throughout the day, which was not the most ideal since any failover would need to wait for the replication lag to catch up.

Binary Log Transaction Dependency Tuning

We weren’t previously aware that having more replication threads on the replica does not automatically result in more concurrent replication. During replication, one transaction may depend on the result of another transaction. So even if multiple threads were available, dependent transactions must be executed in a specific order to maintain consistency with the primary.

Binary Log Dependencies

The database decides this dependency relationship by looking for overlapping commit windows (MySQL :: MySQL Replication :: 2.6.4 Binary Logging Options and Variables). Because every transaction holds the row locks of the records being modified, transactions whose commit windows overlap can never update the same record. These transactions are marked as independent in the binary logs and therefore parallelizable on the replica.

This dependency information is captured in the last_committed and sequence_number fields in the GTID events in the binary logs:

  • sequence_number is an auto-incrementing integer tagged to each transaction to uniquely identify a transaction within a single binary log file
  • Last_committed is the most recent transaction on which the current transaction depends.

If the last_committed field of transaction B is less than the sequence_number of a previous transaction, then they can be applied in parallel. Take the following simplified binary log excerpt as an example: transactions with the sequence_number in [1786, 1789] and 1791 are parallelizable because their last committed timestamps all point to a transaction previous to this set, and 1790 cannot be replicated in parallel with them because it depends on transactions in the block.

#204512 13:45:59... last committed=1784 sequence number=1785
#204512 13:45:59... last committed=1785 sequence number=1786
#204512 13:45:59... last committed: 1785; sequence number: 1787
#204512 13:45:59... last committed=1785 sequence number=1789
#204512 13:45:59... last committed=1787 sequence number=1790
#204512 13:45:59... last committed=1784 sequence number=1791

binlog_sync_delay

Therefore, to improve parallelism, more transactions must overlap in their commit windows, and we can do that by introducing delays in the transaction commit phase through binlog_group_commit_sync_delay. The delay is implemented by blocking the binary log write during the commit phase to wait for more transactions to enter that phase, thereby increasing the number of independent transactions.

To tune this delay, we needed a way to measure the parallelizability of the transactions. Since MySQL does not offer a native metric for this, we decided to build a parser to aggregate the dependency information already present in the binary logs. To do this, we implemented a binary log parser called binparse to examine the dependency relationships among the transactions in a binary log file. Somewhat similar to A Metric for Tuning Parallel Replication in MySQL 5.7, the parser partitions the list of transactions in a binary log as follows: each partition is greedily determined (in order) to be the largest partition where all transactions can be applied in parallel. We use the average size or distribution of such partitions as a measure of parallelizability. Let us refer to the size of each partition as binary log group commit size.

Rather than wait for days to observe the effects of sync_delay on replication lag, binparse provided immediate insights into how different values of binlog_group_commit_sync_delay could affect theoretical parallelizability. Using this tooling on our monolith database, we started observing discernible shifts in the average group commit sizes once we ramped up to around 15ms of sync delay. At this level of sync delay, the average group commit size went from 1.1 to 1.3, which isn’t a huge improvement. The reason is that the monolith database simply didn’t have that many writes per second, so even instituting a multi-millisecond delay didn’t get more transactions to commit around the same time. We also experimented with another MySQL cluster that’s significantly more write-heavy and observed close to a 2x increase in average binary log group commit size. Our final conclusion was that without significantly delaying transactions on our monolith database, we were unable to produce a meaningful increase in replication parallelizability with sync_delay alone.

binlog_transaction_dependency_tracking

After more research, we turned our attention to how MySQL tracks this dependency information of transactions. So far in this post, we have assumed that only transactions with overlapping commit windows are independent, but that doesn’t have to be the case. MySQL 8.0.1 introduced a new way to track dependency information using write-sets. It works by caching the rows each transaction modified, i.e. the write sets, and deciding on independence based on whether the write sets are disjoint. Using this approach, you can have two transactions many seconds apart and still have them marked as independent and therefore parallelizable so long as they each edit disjoint sets of rows, obviating binlog_group_commit_sync_delay.

Changing to this new way of tracking transaction dependencies, we were able to increase our theoretical transaction parallelizability, or average binary log group commit size, from 1 to ~30 on our monolith database. Here is a before and after comparison of the average commit group sizes from analyzing sampled binary logs with binparse.

Average Commit Group Size: 1.31
+ - - - - - - - - - - -+ - - - - - - - - - - - + - - - - - - - - - - - - - -+ - - - - - - - - - - + - - - - - - - - -+
| Commit Group Size | Number of Transactions | % of Total Transactions | Number of Groups | % of Total Groups
| - - - - - - - - - - -+ - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - |
| 1 | 3008 | 57 | 3008 | 74 |
| 2 | 1764 | 33 | 882 | 22 |
| 3 | 429 | 8 | 143 | 4 |
| 4 | 80 | 2 | 20 | 0 |
| 5 | 5 | 0 | 1 | 0 |
| 7 | 7 | 0 | 1 | 0 |
+ - - - - - - - - - - -+ - - - - - - - - - - - + - - - - - - - - - - - - - -+ - - - - - - - - - - + - - - - - - - - -+

Exhibit 1: binlog_group_commit_sync_delay set to 15ms with binlog_transaction_dependency_tracking set to default

Average Commit Group Size: 37.27
+ - - - - - - - - - - -+ - - - - - - - - - - - + - - - - - - - - - - - - - -+ - - - - - - - - - - + - - - - - - - - -+
| Commit Group Size | Number of Transactions | % of Total Transactions | Number of Groups | % of Total Groups
| - - - - - - - - - - -+ - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - |
| 1 | 2 | 0 | 2 | 13 |
| 2 | 4 | 1 | 2 | 13 |
| 8 | 1 | 1 | 7 |
| 14 | 28 | 5 | 2 | 13 |
| 22 | 22 | 4 | 1 | 7 |
| 51 | 51 | 9 | 1 | 7 |
| 59 | 59 | 11 | 1 | 7
| 62 | 62 | 11 | 1 | 7
| 64 | 64 | 11 | 1 | 7 |
| 73 | 73 | 13 | 1 | 7
| 77 | 77 | 14 | 1 | 7 |
| 109 | 109 | 19 | 1 | 7 |
+ - - - - - - - - - - -+ - - - - - - - - - - - + - - - - - - - - - - - - - -+ - - - - - - - - - - + - - - - - - - - -+

Exhibit 2: binlog_group_commit_sync_delay set to 15ms with binlog_transaction_dependency_tracking set to WRITESET

Note that the actual degree of parallelization is still constrained by the number of replica threads on the replica. Moreover, due to replica_preserve_commit_order, we also cannot commit replicated transactions out of order as the primary, but all the operations that happen before the final transaction commit are parallelized.

With this new dependency tracking method, we significantly cut down on the replication lag spikes even with binary logs enabled. You can see the lag after we enabled it.

Press enter or click to view image in full size

Conclusion

Our experience at Airtable in attempting to reduce replica lag in our largest MySQL 8 database has been a process of exploration, learning, and gradual refinement. Despite a long path of trial and error and dealing with MySQL bugs, it’s extremely gratifying to see the lag down to consistent single-digit seconds.

To summarize, here are the most impactful improvements we implemented to reduce lag:

  1. Multi-threaded replication is the single most important thing you can do to reduce lag. We have found enabling GTID and upgrading your MySQL version past 8.0.18 will remediate the reliability issues we encountered.
  2. Consider designating separate replicas as failover targets that do not serve live traffic, and disable binary logs on those non-failover replicas to minimize lag.
  3. Set binlog_transaction_dependency_tracking to WRITESET to increase the parallelizability of transactions replication. This has been made the default starting 8.2.0.

Our long-term vision is to have consistent sub-second replication lag. Some future work in this space is:

  • More granular lag monitoring: MySQL’s native replica lag monitoring is measured in seconds; we hope to use pt-heartbeat to get millisecond-level lag monitoring.
  • Backpressure on writes: we think this will be our silver bullet to controlling replication lag spikes. AWS Aurora, a comparable database, has built-in lag-aware write throttling; Github implemented freno, a co-operative lag-aware throttle, for their MySQL databases to realize sub-second lag.

If you’re interested in helping us build them, we’re hiring!

Acknowledgement

We would like to extend our special thanks to Gavin Towey and Brian Zhang for their contributions to this effort, as well as to our management, Micheal Benedict and Andrew Wang, for their support and valuable editorial guidance. We also appreciate the valuable suggestions provided by Daniel Kozlowski, Brian Larson, Brian O’Rourke, and Alexander Sorokin.