Postmortem of March 2026 AO3 Downtime - AO3_Systems - Fandom

Introduction

This work is a postmortem of the periods of downtime between March 1st, 2026 and March 3rd, 2026. We apologize for the technical nature of this writeup, but it is a bit unavoidable since the underlying bug that caused the downtime involves multiple subsystems of our database software. To be clear, this downtime was not due to any sort of attack, compromise or traffic to any specific fandoms.

We also want to remind folks that the Systems committee is comprised of 9 volunteers at the time of writing, and all of these events were squeezed in between sleeping and our day jobs.

All times are in UTC unless otherwise stated.

Background

AO3 uses MariaDB, a fork of the popular MySQL database software. Specifically, we utilize the enterprise version of MariaDB, and we maintain a support contract with them so they can assist us with any issues we run into with the database.

AO3 is much too large to run off a single database server, so we currently utilize 5 database servers which are connected in what is known as a Galera Cluster. This is an active-active database cluster, meaning that all of the servers are simultaneously responsible for serving requests. This also allows us to do maintenance on machines without taking AO3 offline. Galera is responsible for making sure all changes to the database are replicated across all of the nodes in the cluster.

We additionally utilize MariaDB’s MaxScale, a database proxy, on top of the Galera cluster. MaxScale runs on each application server, which has the advantage of scaling our proxy capacity as we add more application servers. The primary benefit of MaxScale is that it allows for intelligently routing database requests. The most important example of this is directing write requests to a single server (which is dynamically selected) to avoid conflicts when committing data, while still allowing all nodes to be used for reading, which drastically improves performance.

Diagram showing the flow of database queries from the AO3 application servers, through MariaDB MaxScale, to the database cluster. Database writes are sent to the primary server, while reads are sent to the secondary servers.

This database setup is intended to be what is known as “High Availability”, meaning that failure of a single, or even multiple nodes should not take down the cluster.

Timeline of Events

January 4th, 2026

On January 4th, 2026, we upgraded our staging database cluster from MariaDB 10.6.22.18 to 11.4.9.6. We observed no issues as a result of this, but our staging environment has a massively smaller amount of load when compared to the production environment.

February 28th, 2026

We proceeded to upgrade the production environment from MariaDB 10.6.21.17 to 11.4.9.6. When performing database upgrades, we utilize a rolling upgrade methodology whereby the software is upgraded on one server at a time, allowing it to leave the cluster and rejoin on the new version until every node has been upgraded. The Galera cluster is designed to allow for this, and under normal conditions, the end result is an upgrade with no downtime.

On the servers, we explicitly install MariaDB and some of its support packages to a specific version, rather than just “the latest” so that it doesn’t change unexpectedly. However, we did not enforce this specific versioning for the galera-enterprise-4 package, which contains libraries for the Galera clustering portion of MariaDB.

Following the upgrade, the server responsible for handling database write activity changed from ao3-db17 to ao3-db18, as seen by the changes in load:

Graph showing CPU usage on ao3-db17. While the server is the primary, it is more spiky as writes are more sporadic. This changes to a more flat CPU usage when the server switches to handling reads, as that workload is more constant.

Graph showing CPU usage on ao3-db18. It follows a reverse trend from ao3-db17 as it takes over the primary role - its usage goes from a flatter trend to a more sporadic one.

After completing the upgrade and letting things settle, overall, the CPU usage seemed okay, if a bit higher, so we assumed all was well.

March 1st, 2026

On March 1st, we began to see the first signs of issues. At about 17:13 UTC, there was a small but noticeable dip in CPU load on the application servers, and writes to the database seemingly began getting stuck. AO3 became slow and started generating 5XX errors.

Graph showing CPU usage on the application servers. A small dip is noticeable around 17:13 UTC.

The database server CPU load pattern around this time was somewhat odd looking:

Graph showing CPU usage on the database servers. The primary server’s values are very spiky, going from ~1% to anywhere between 3 and 20%.

Our Tag Wranglers first raised the issue internally at 17:29 UTC, and we began investigating shortly thereafter. Upon looking at the MariaDB logs on the primary database server at the time, we saw the following log entry:

Mar 01 17:13:19 ao3-db18 mariadbd[2055529]: 2026-03-01 17:13:19 821 [ERROR] mariadbd: Error writing file '/var/lib/mysql_bin_log/ao3-db18' (errno: 0 "Internal error/check (Not system error)")

On the secondary database servers, we saw messages similar to the following:

Mar 01 17:13:19 ao3-db17 mariadbd[2362947]: 2026-03-01 17:13:19 15 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: -1731559940, event_type: -1 Mar 01 17:13:19 ao3-db17 mariadbd[2362947]: 2026-03-01 17:13:19 15 [ERROR] WSREP: applier could not read binlog event, seqno: 7227029338, len: 18446744071146584725

These errors suggested an issue with the primary database server failing to write to the database “binary log”. The binary log contains data which describes changes to the database. This log is created on the primary server, and read by the secondary servers to maintain consistency.

These errors were foreign to us, particularly the one on the primary database server, as there were no real hints as to what the problem was. We confirmed that there were no issues with disk space or similar items that may have been causing problems with writing to the bin log files.

We logged a ticket with MariaDB support at 17:42 UTC and started working to bring the cluster back to a normal state. We went into a partial maintenance mode around 17:56 UTC, starting with only bot traffic, and expanding to traffic that we consider possible bots a few minutes later. At about 18:03 UTC, we went into full maintenance mode.

At 18:22 UTC, a member of Systems hopped on a Zoom call with MariaDB support, which was mainly an information gathering session. At 18:27 UTC, we allowed some traffic back in, but performance had not improved. At 19:17 UTC, we blocked bots and possible bots again, and we started changing some settings relating to how much traffic we allow into the site in an attempt to limit load. However, this didn’t seem to help and at 19:30 UTC we went back into full maintenance mode.

Despite the database servers having no traffic on them, we were still seeing “lock wait timeout” errors. This occurs when a query attempts to “lock” a portion of the database in order to write some data, but fails to do so in a reasonable amount of time. This suggested that the database cluster was, quite literally, locked up, as no write queries were able to succeed. We also started seeing some servers in the database cluster showing an “Inconsistent” state, essentially meaning that they were not synchronising database changes correctly. We also ran into issues with some servers where the MariaDB service would not stop correctly, which was likely because they could not gracefully finish writing before stopping.

By 20:07 UTC, we had managed to stop all database servers as part of an attempt to restart the cluster. However, when trying to bring the cluster back up from this state, we noticed that the nodes were performing what is called a State Snapshot Transfer (SST) rather than an Incremental State Transfer (IST). SSTs essentially transfer the entire contents of the database from one node to another, compared to an IST which only transfers the differences between the node that is behind and the rest of the cluster. SSTs usually only occur when adding a new node to the cluster, when a node has fallen so far behind that it cannot be caught up incrementally, or if there is some form of corruption in a node’s copy of the database. The fact that we were seeing nodes rejoining via SST rather than IST was a sign that there was something causing corruption on these nodes.

A simplified diagram showing State Snapshot Transfer replication, where the donor server is sending the entire dataset to the joiner server.

A simplified diagram showing Incremental State Transfer replication, where the donor server is only sending the portion of the dataset that the joiner server is missing.

The other difficult part with SST versus IST, is that SST takes significantly more time and resources to complete (as it is a full recopy of the data). When a node is being a “donor” for another node performing SST, that donor node can have degraded performance, and in some cases, may be unable to correctly serve requests at all. Given this, and the fact that essentially all of the nodes besides the primary needed to be resynced, AO3 had to remain offline during this process.

We considered bringing AO3 back once we reached the minimum required healthy nodes to run the database cluster, but after internal discussion and based on advice from MariaDB support, we decided to wait until all nodes were healthy.

March 2nd, 2026

We did notice that the new version of MariaDB seemed to perform SSTs a bit faster than our past experiences, and we were able to bring AO3 back for normal users at around 00:19 UTC on March 2nd. Considering we had to wait on 5 nodes, this was relatively “fast”.

When we came out of maintenance mode, we did so with more aggressive traffic shaping, and without allowing requests from bots or possible bots. We hoped these measures would help keep the database cluster from falling over again, and the original responding volunteer went to sleep.

Unfortunately, at approximately 01:43 UTC, the cluster fell into a similar state as before. Another volunteer responded and found that this was accompanied by the same error message on the primary about failure to write to the binary logs, and similar messages on the secondary about a failure to read events from the binary log.

At first, it looked like the cluster may have been somewhat functioning, but just running slowly. It was not entirely clear that we were suffering from the same exact failure, so we loosened the traffic shaping a bit around 02:23 UTC. However, it became apparent that database writes were not functional.

Graph showing the number of running database threads on the primary database server. The line begins close to 0, and shoots up well beyond 2000.

At 02:45 UTC, we went back into maintenance mode with the hope of flushing out the pending database writes, followed by a restart of the database cluster. However, we quickly noticed that the majority of the database cluster had been marked “inconsistent” and essentially all of the nodes were stalled and would not do much of anything.

Screenshot of terminal output showing the status of the database cluster according to MaxScale. 3 of 5 servers report their status as “Running, Inconsistent”.

Once more, the MariaDB service would not gracefully stop on the nodes marked inconsistent, and when we forcefully stopped them and attempted to restart, they began the process of a full resync. However, this time around we ran into nodes that would complete the SST, and immediately fail to process binary logs in a similar fashion when finishing the resync with IST. On the advice of MariaDB support, we completely cleared out the secondary nodes and reattempted to resync them from the primary once more.

During this process, we noticed that the galera-enterprise-4 package we mentioned earlier had not been upgraded and was still the version from MariaDB 10.6. We thought this could have been the source of the issues, and we installed the appropriate version of the package for MariaDB 11.4. We note however that MariaDB’s packages do not seem to force the install of the correct version of this library when upgrading to a later version, which is entirely possible to do with apt (the package manager for Debian, our Linux distribution of choice). Additionally, the package versions are the same between the repositories for MariaDB 10.6 and 11.4, so even if we had installed a specific version, it still would have been unclear if the version we had installed was one that was compiled for a specific version of MariaDB or not.

Regardless, following the upgrade of the Galera package, we were able to finish resynchronizing the cluster, and we took AO3 back out of maintenance mode around 11:44 UTC.

Generally, things seemed okay, although we did notice an odd looking load pattern once again from ao3-db19, the new primary database server:

Graph showing database server load. The value for the primary server intermittently spikes up to 30-40% sharply before dropping back down.

At 14:14 UTC, we allowed possible bot traffic back into the site since things continued to function normally. Separately, we were investigating the load spikes, and we found 67 running queries to the audit table (which stores various events relating to user accounts, such as sign-ins, password changes, etc.) for a particular user on the site. Upon further analysis, we found that this user had over 2 million entries in the table, which is far beyond the normal amount that a typical user would have. We initially thought this could have been malicious activity, or a poorly written bot, so we elected to disable this particular user’s account at around 15:03 UTC, which caused the load pattern to stop.

Graph showing multiple spikes in CPU load on the primary database server. In the later part of the graph, no spikes are visible and load remains relatively stable.

We later found out that this seemed to be an issue with the particular Ruby library that we utilize for authentication, which was causing every request from this particular user to add a new line to our audits table. This was not a new bug, but it seems like between MariaDB 10.6 and 11.4, the query that is run in this circumstance became less efficient. We implemented a known workaround for the bug (since the issue as of writing has not yet been patched in the library itself) and prepared it for later deployment. The user was later reenabled.

At 15:14 UTC, an AD&T volunteer noticed the following error in our error tracking tool:

Mysql2::Error: Local temporary space limit reached (ActiveRecord::StatementInvalid)

We thought this was an unrelated issue, so we noted it as something to look into at a later point.

At 17:31 UTC, all continued to be well, so we allowed bot traffic back into the site.

At 20:40 UTC, we looked into fixing the local temporary space limit error above, but realized it was not the issue we thought it was. While continuing to research, we finished deploying the authentication fix at 20:52 UTC. At 20:58 UTC, we opened a new ticket with MariaDB support regarding the temporary space error, as we weren’t finding very much information about it. They responded with a workaround and requested more information to debug the error. We did not initially apply this workaround as the error seemed benign at the time, we were afraid to disrupt the cluster further, and to be frank, our team needed a breather.

Unfortunately, around 22:14 UTC, we encountered the same cluster issue yet again, and despite some attempts to keep things working, we had to reenter maintenance mode at 22:37 UTC. As before, the database nodes began doing full SSTs when we attempted to rejoin them to the cluster. By this point, we were starting to realize that clearly, we were stuck in a vicious cycle, so we began leaning harder on MariaDB support to diagnose and provide us with a clearer idea of what was going on. We also started entertaining the idea of rolling back to MariaDB 10.6.

We joined another Zoom call with MariaDB support and were asked to continue attempting to restore nodes. We planned to have a subsequent call with MariaDB’s Galera expert to see if they could further debug what was going on.

We continued to send logs and config files to support for review. We requested more information on rolling back MariaDB versions, but we were advised that this was not recommended as the only “safe” way to do so would be to essentially rebuild the cluster from scratch based on a backup. Additionally, since the issue had not yet been identified, we were also cautioned about further upgrading the cluster as there was no guarantee that it would help the issue.

March 3rd, 2026

At 02:49 UTC, all nodes in the cluster finished resynchronizing, but we held off on bringing the site live until we had more information. MariaDB support further advised at 03:13 UTC that their replication engineering team was continuing to investigate our logs.

We were advised by MariaDB support that while they were still investigating, they were not immediately finding anything that seemed relevant, so we were advised to bring the site back online with the intention of capturing the corrupted bin logs for analysis by MariaDB engineers to try and pinpoint what exactly was happening. This is what we referenced in our status update around this time.

At 04:35 UTC, we allowed normal users back into the site. By this point, users had taken notice of our intermittent state, and gave our download servers a bit of a workout. 😉

Graph showing download server CPU usage. At 04:35 UTC, the lines begin rising from zero until they all reach 100% just before 04:50 UTC.

We took some measures to help with the load on the download servers, since they were not very responsive with this amount of load. At 04:54 UTC, we allowed possible bot traffic back into the site, and at 05:11 UTC, all traffic had been allowed back onto the site.

At this point, we were still fully expecting the cluster to crash again. However, we checked our support case again, and noticed an interesting response from MariaDB support which had been sent before at 04:57 UTC. They had noticed our second ticket that we put in regarding the “local temporary space limit” error, and put the pieces together to realize that both issues were actually related. We were provided the same workaround that had been previously suggested for the space limit error. We finished applying the suggested workaround across the cluster at 05:51 UTC.

March 4th, 2026

We held our breath and waited, and fortunately, the cluster did not crash again, so we called the incident resolved at 10:54 UTC.

Analysis

With the events of the incident covered, we’ll now move to try and explain the underlying bug that caused the problem. The bug was known to MariaDB and tracked as MDEV-37808. The bug has been fixed in the community versions of MariaDB, but is not yet available to the enterprise versions which we run.

In MariaDB Community 11.5, a new feature was added which allows for setting limits for the size of temporary files on disk. This feature was backported to MariaDB Enterprise 11.4. The idea of this feature is to prevent a single query from causing the server to run out of disk space due to temporary files that may be created by said query. In the event that a query would cause this to occur, that query alone would return an error, while the server could continue to operate as normal for other queries. MariaDB notes in the documentation for this feature that care should be given when setting small values for this feature when utilizing binary logging, as an aborted query can cause problems with replication.

However, the default size limits for this feature are set to 1099511627776 bytes, which translates to just over a terabyte of storage. That is massively larger than any of the temporary files on disk on our database servers. In fact, it is even much larger than the binary logs themselves. So what gives?

According to the writeup by MariaDB developers, the issue arises specifically in Galera clusters. The temporary file limit feature uses a status variable to track how much temporary space is being used, called tmp_space_used. There is an additional variable known as binlog_cache_size, which determines how much space in memory that binary log changes can occupy. When a transaction exceeds this size, it overflows into a temporary file on disk. When the transaction completes and is committed to the database (and thus the binary log), the memory space is cleared, and the temporary file is truncated (or in other words, the file is emptied, but still exists on disk). When the file is truncated, MariaDB accordingly reduces the tmp_space_used variable, as you would expect.

The issue arises when a database transaction exceeds the default value of binlog_cache_size (which is 32KB), but is smaller than 64KB. In this case, to improve performance, MariaDB does not truncate the temporary file on disk, but rather just resets the write pointer back to the beginning of the file. In a vacuum, this is fine, because MariaDB is still tracking the space this file is taking up in the tmp_space_used variable.

However, all of this changes if the user on the current SQL connection is changed. This is a common occurrence when using a solution such as MaxScale due to SQL connection reuse. When an existing SQL connection is reused, part of the reinitialization process is to change the user, even if it is the same as before. When the user is changed on the MariaDB server, the tmp_space_used variable is reset to zero. But the file on disk has still not been truncated. Therefore, the next time MariaDB writes to this file and truncates it, it performs a calculation which equates to tmp_space_used = tmp_space_used - size_of_temp_file_on_disk, which can result in a negative number.

This seems like it should be fine, however, the devil is in the details. The tmp_space_used variable is what is known as an “unsigned integer”. An unsigned integer is not capable of storing a negative number, unlike a “signed integer”, which can. So, what happens if you try to go below 0 with an unsigned integer? It wraps all the way around to the biggest possible number. That number in this case is over 18 Exabytes, which is almost 16777216 times larger than the default limit!

Therefore, MariaDB throws the temporary space limit error, which causes the binary log to not be written correctly (essentially corrupting it). Subsequent write queries also likely fail with this error, which is likely responsible for the write slowdowns. This broken write state, combined with the corrupted binary logs, seems to have then caused the inconsistent states on the secondary servers that we previously saw, and thus, the general collapse of the cluster.

The code fix from MariaDB was to not reset the tmp_space_used variable when the user changes. The temporary files are also truncated when the user changes for good measure. However, as noted before, this code fix is not yet released in MariaDB Enterprise. Therefore, the workaround that we were provided with was to simply disable this feature entirely, which bypasses all of the faulty logic associated with it. This was achieved with the following commands at the MariaDB console:

set global max_tmp_session_space_usage=0; set global max_tmp_total_space_usage=0;