MySQL transactions per second vs. fsyncs per second (2020)
sirupsen.com121 points by jcartw a month ago
121 points by jcartw a month ago
The concepts being described are called "Furious Flushing". I believe Percona did a writeup on this as well some time ago and I had to assist a DBA with getting systems to keep up with Percona MySQL demands regarding furious flushing and adjusting filesystem and memory options. Here [1] is one of them. OS tuning options have changed a bit since 2014 and there are also more recent articles I believe. The biggest win for us was changing the servers to use SSD's and large capacitor backed raid caches.
[1] - https://ftp.fau.de/macports/distfiles/percona/PerconaServer-...
Are the Percona folks a bunch of core MySQL/MariaDB engineers?
Not as far as I know but here is a brief history. [1] They will commercially support all the databases including MariaDB. Their goal was to collectively update MySQL to take advantage of modern hardware. They are also useful for companies that have complex replication and backup requirements. They have also created a lot of tools for DBA's to manage data. The DBA's I worked with had a great relationship and great experiences with them.
[1] - https://www.percona.com/sites/default/files/The-Story-of-Per...
It hurts my soul that database engines still don’t batch multiple transactions per disk flush as a default setting.
It’s such an insane bottleneck on modern systems.
File system APIs should be able to return events "data is transferred, buffer no longer needed" and "data is safely stored on persistent media". That's what databases really need from the file system.
Interestingly, some IBM mainframe hardware makes this distinction, with "channel end" and "device end". But Linux does not.
Linux does make this distinction.
Linux has two main disk I/O paths, buffered and direct I/O.
Direct I/O indicates "channel end" by the completion of the write operation. By that point, data has been transferred to disk from the buffer and the buffer can be reused. (Notably, direct I/O may be performed asynchronously via the Linux AIO API.) Contrary to popular belief, this does not indicate that the data is safely committed to disk. [1]
Buffered I/O does not require the concept of "channel end", since write calls complete immediately after the buffer contents have been copied into kernel space. (This mode is what PostgreSQL uses by default. I don't know MySQL.)
In either case, `fsync(2)` (or `fdatasync(2)`) is used to indicate "device end". The disk has indicated that data is safely stored. (This can likewise be monitored asynchronously, either via the AIO API for direct I/O, or `sync_file_range(2)` for buffered I/O. The latter is used by PostgreSQL [2].)
Aside – Linux has also recently grown support for this concept in its networking API, via zerocopy `send(2)` functionality in io_uring.
> This mode is what PostgreSQL uses by default. I don't know MySQL.
Assuming the InnoDB storage engine, the default recently changed to Direct I/O as of MySQL 8.4, see https://dev.mysql.com/doc/refman/8.4/en/innodb-parameters.ht...
It was common to set this even before it became the default though, since InnoDB maintains its own buffer pool caching scheme, and there's no sense in duplicating data pages between the InnoDB buffer pool and the OS cache.
The general idea is that you should assign the majority of the system's memory to the InnoDB buffer pool, and then use direct I/O to bypass the system cache, since InnoDB can manage memory more intelligently than the OS. This is because InnoDB has a better understanding of your workload and its own data structures, vs the OS cache being more general-purpose.
Other MySQL storage engines behave differently, for example MyISAM had its own caching only for indexes but relied on the OS cache for row data. (if I'm remembering correctly, that is; been a long while since I've touched MyISAM. It's a non-ACID engine and generally should not be used.)
I thought PostgreSQL did, but now that I check the docs, you are correct, the default `commit_delay` is zero. That would be worth increasing a little if you can afford the latency.
It works even with that setting at zero! Just requires a bit more concurrency.
How would that work without risking loss of committed transactions?
> Fully durable transaction commits are synchronous and report a commit as successful and return control to the client only after the log records for the transaction are written to disk. Delayed durable transaction commits are asynchronous and report a commit as successful before the log records for the transaction are written to disk. Writing the transaction log entries to disk is required for a transaction to be durable. Delayed durable transactions become durable when the transaction log entries are flushed to disk.
https://learn.microsoft.com/en-us/sql/relational-databases/l...
Typically the SQL engine will allow flexibility on this. Not all transactions need to prioritize write-to-disk confirmation over throughput. If you're collecting observability metrics, for instance, these don't have the same data coherency constraints your app model (account etc) demand. In this case you can accept the logical commit and the tiny chance it might not actually hit the disk. Postgres at least allows customizing this per transaction, I believe, although I'm not quite sure how it works if you compose transactions with distinct syncrhonization constraints.
Sure, but the comment I responded to was lamenting that the commits are not asynchronous by default. The documentation I linked to was all about the considerations and behavior for asynchronous commits.
I see this understanding that sql databases should do xyz by default as corporate dogmatism, kind of. A database is only as useful as it's used! I realize you haven't argued for this, but if we're collectively claiming postgres can handle 100% of the persistent database needs of an arbitrary app (a very common claim these days), we also need to accept that people will "abuse" sql databases to prioritize accessibility over coherency, which was always a major draw of NoSQL engines. I suspect most consumer apps can scale with some form of inconsistency just fine, even if this creates a PR rats-nest, but consumers are far more forgiving of incompetency than greediness. This is a very much an "understand your market" sort of decision to make.
So I see what you're saying, but I'd also like more async bindings that lean into customizing the behavior at query- or execution-time. You can build them today but you have to work around whatever sql-binding framework you use and it will still likely result in leaky abstractions.
I see what you mean, but ACID is a fairly foundational expectation for SQL transactions (D being the relevant feature here)
That being said, my background is primarily Microsoft SQL more than Postgres. As such I'm occasionally bemused at the sort-of monoculture here around Postgres, where if Postgres doesn't have it, it may as well not exist*.
And so it is in this case (the DELAYED_DURABILITY documentation I linked above). Alas, this doesn't seem to be something I see in standard SQL, so indeed, as you say, it's too bad that the standard doesn't provide for relaxing the rules.
Relatedly, the other interesting thing is the chatter about fsync. I know on Windows that's not the mechanism that's used, and out of curiosity I looked deeper into what MS-SQL does on Linux, and indeed they were able to get significant improvement by leveraging similar mechanisms to ensure the data is hardened to disk without a separate flush (see https://news.ycombinator.com/item?id=43443703). They contributed to kernel 4.18 to make it happen.
> The repeated use of a write request followed by a flush request may be detrimental to performance and will increase traffic on the I/O bus.
> Prior to the Linux Kernel 4.18 updates, Linux could use Fua but only for the file system journaling writes and not data writes.
> If your system supports Fua and you have the Linux Kernel 4.18 or newer updates, you can enable SQL Server trace flag -T3979 and use /opt/mssql/bin/mssql-conf set control.alternatewritethrough 0. SQL Server will use Fua write behavior patterns instead of Forced Flush. File systems supporting optimized Fua align SQL Server on Linux with SQL Server on Windows behavior and performance.
*I think performance of CTEs/Views is another topic where I noticed it, where it was just taken as given that they can hurt performance, whereas in T-SQL they are simply equivalent to subqueries
> As such I'm occasionally bemused at the sort-of monoculture here around Postgres, where if Postgres doesn't have it, it may as well not exist.
FWIW, I, as a medium-long term PG developer, are also regularly ... bemused by that attitude. We do some stuff well, but we also do a lot of shit not so well, and PG is succeeding despite that, not because of.
> Relatedly, the other interesting thing is the chatter about fsync. I know on Windows that's not the mechanism that's used, and out of curiosity I looked deeper into what MS-SQL does on Linux, and indeed they were able to get significant improvement by leveraging similar mechanisms to ensure the data is hardened to disk without a separate flush (see https://news.ycombinator.com/item?id=43443703). They contributed to kernel 4.18 to make it happen.
Case in point about "we also do a lot of shit not so well" - you can actually get WAL writes utilizing FUA out of postgres, but it's advisable only under somewhat limited circumstances:
Most filesystems are only going to use FUA writes with O_DIRECT. The problem is that for streaming replication PG currently reads back the WAL from the filesystem. So from a performance POV it's not great to use FUA writes, because that then triggers read IO. And some filesystems have, uhm, somewhat odd behaviour if you mix buffered and unbuffered IO.
Another fun angle around this is that some SSDs have *absurdly* slow FUA writes. Many Samsung SSDs, in particular, have FUA write performance 2-3x slower than their already bad whole-cache-flush performance - and it's not even just client and prosumer drives, it's some of the more lightweight enterprise-y drives too.
Edit: fights with HN formatting.
Write data for transaction 1, write data for transaction 2, fsync, signal commit success for transactions 1 and 2.
Up to you how many transactions you want in a batch.
You missed a number of steps. The transactions are independent so they signal completion (to trigger the commit fsync) independently.
You can have the first transaction wait a bit to see if any other commits can be batched in the same fsync. However that’s off by default as the assumption is you want the transaction to complete as fast as possible.
At least that’s how PostgreSQL implements it.
The clever way to do this is to immediately commit the first transaction when the storage engine is idle.
While it is waiting for the fsync to finish it should batch up any incoming WAL writes and then issue the next fsync immediately after the first one finishes, committing the entire batch at once. Then, and only then, it can reply to clients with “transaction complete”.
Some modern database engines now do this, by many older ones don’t because too much of their code assumes one transaction per fsync.
You still have to wait for the final fsync which is only requested after the transaction work has completed. So not sure you’re gaining much if at all from this.
There’s also concurrency issues with writing and fsyncing the same fd: http://oldblog.antirez.com/post/fsync-different-thread-usele...
The key limit is the rate of fsyncs, which is constrained by the user mode to kernel mode transition and physical characteristics of the storage device. In the good old days, it was about a millisecond due to spinning disk latencies, but even on the best SSDs it's about 200 microseconds. This is only about 5K transactions per second, maximum, no matter how trivial the transactions are!
With automatic batching, trivial transactions can be grouped together so that the bottleneck becomes bandwidth, not an absolute rate.
You get to have your cake and eat it too: There's no additional latency added using automatic batching of transactions because when the I/O queue is empty, the next transaction commits immediately, same as normal. If the disk is already in the middle of an fsync, the next one will have to queue up behind it in the storage subsystem anyway, so the DB engine may as well accumulate more transactions in-memory while it is waiting.
Sure, but now when transaction 1 is "committed", it isn't actually guaranteed to be there in the face of interruption. That's a big change to the default behavior...
What state is new for transaction 1?
There was always a gap between "write to disk" and fsync. Now it's a bit longer because we did some other stuff, but that gap was possible before, too.
We still don't tell people it's committed until the fsync returns.
I'm not sure we're using the same terminology. Committed means the transaction has been hardened to disk. That's the D in ACID.
Otherwise, is the suggestion that there be an artificial delay to allow other transactions to piggyback before returning success on commit 1?
Should that be a default? (That was the context of this thread)
> I'm not sure we're using the same terminology. Committed means the transaction has been hardened to disk. That's the D in ACID.
Yes, the transaction is committed when the transaction is durably written to disk. However, there's not a great API for durably writing to disk, you can write on an FD (or on a mmaped file) and it'll get written eventually hopefully. fsync asks the OS to confirm the writes on an FD are committed durably, but is not without its quirks.
> Otherwise, is the suggestion that there be an artificial delay to allow other transactions to piggyback before returning success on commit 1?
Not really an artificial delay. More that if you have multiple transactions waiting to be comitted, you shouldn't commit them to disk one at a time.
Instead, write several to disk, then fsync, then send commit notices.
A responsible database engine writes transaction data to an FD, then does an fsync, then signals completion to the client; then moves onto the next transaction right?
The suggestion is because fsync is rate limited and blocks further writes while it's pending, you can get better throughput by writing several transactions before calling fsync. The database engine still doesn't signal completion until an fsync after a transaction is written, but you have more data written per fsync. There is a latency penalty for the first transaction in the batch, because you must wait for writes for the whole batch to become durable, but because you're increasing throughput, average latency likely decreases.
Really, there's a fundamental mismatch between the capabilities of the system, the requirements of the database engine, and the interface between them. Synchronous fsync meets the requirements, but an asynchronous fsync would be better for throughput. Then the database engine could write transaction 1, call for fsync 1, write transaction 2, call for fsync 2, etc and once the responses came in, signal commits to the relevant clients. Having more requests in pipeline is key to throughput in a communicating system.
I looked into this some more. There are other ways than explicit fsync. See this blurb on FUA (which basically treats the I/O as write-through)
https://techcommunity.microsoft.com/blog/sqlserver/sql-serve...
More from kernel.org: https://www.kernel.org/doc/html/latest/block/writeback_cache...
It is really shocking to me that this isn't more of a thing.
If the I/O block size is 4,096 bytes and you are doing things like updating a user's email address or some boolean preference, you could get a LOT more data packed in per unit space & time. Each block is enough to store an entire page of text or a small image.
The only caveat is that the system needs to be consistently under heavy load, or you would have to force clients to wait unreasonable amounts of time for a sufficient batch to accumulate.
I think the best goldilocks path is to put something like a MPSC ring buffer in front of the command processor and to "take what you can get" in terms of batch size each iteration.
They do, if you issue the transactions concurrently (i.e., multiple DB connections). This is "group commit" as referenced in the article.
You can't batch sequential transactions without violating the semantics of `COMMIT`. (Data is after all not durable if it's not flushed to disk!) Hence such behavior is not the default in quality databases.
If you are issuing many sequential writes and you don't care about them being individually committed, you can and should group them.
>MySQL will group multiple writes with each fsync
in the old days of HDD the Linux IO driver (some of them) would also re-order the writes in the queue to minimize HDD head seeks.
>A modern disk can do ~1000 fsyncs per second
sounds low for SSD. Haven't benchmarked for a while though. Sounds like something a 5-7 HDD disk array would do if i remember numbers correctly.
The article seems to imply that fsyncs need to happen in a linear order, that is, 1ms / fsync -> 1000 fsyncs/sec. It seems to imply that any batching happens in a linear order as well, that is, we have to completely finish one batch of fsyncs before the next one begins. Is that true? Obviously some systems (including databases) will happily let the beginning of one transaction overlap with another, only simulating linearizable transactions when that is required by the isolation level. But I don't have a great mental model of what the file systems are doing under the hood.
The error semantics of fsync are very important here and this can get messy, for example see https://danluu.com/fsyncgate/
So uh... I read a third of that, knowing roughly how the rest will go, based only on the context of the thread.
However in the back of my mind I know Dan Luu is preserving it for a reason. And I don't know if that's explained later, or whatever; what I do know is Dan Luu has pointed on awe that most fs don't error correctly on write failures - as an example.
As in, that isn't even implemented in some fs drivers.
So the very idea that somehow Postgres can do direct IO and it magically gets better - to me that's the joke Dan Luu sees. Maybe.
That Craig person, the OP in the thread. Imagine doing all that work, having other people say, hey, that's an issue; and then all the people saying "so what" or "nothing can be done"
Amazing. I'll have to read the rest a bit later.
> That Craig person, the OP in the thread. Imagine doing all that work, having other people say, hey, that's an issue; and then all the people saying "so what" or "nothing can be done"
For context - Craig's opinion won out, and what he was suggesting (crash-restart and perform recovery) is what postgres has been doing for many years (with an option to revert back to retrying, but I haven't seen anybody toggle that).
The author does mention that the OS appears to be performing a sort of batching or aggregation of fsyncs after reviewing the test results and concludes that more than 1000 fsyncs are occurring per second. I’ve also confirmed this by running some benchmarking on EC2 instances with gp2 volumes: https://justincartwright.com/2025/03/13/iops-and-fsync.html
There’s a nice summary of group commit in this MySQL 5.6 work log for binary log group commit implementation https://dev.mysql.com/worklog/task/?id=5223
Great blog here too on changes since : https://hackmysql.com/group-commit-and-trx-dependency-tracki...
This has been something I’ve been curious about for a long time, so thanks for the write up.
When it comes to system design, we often ignore what modern hardware is capable of and ignore fundamentals like fsyncs/s or more generally IOPS.
So instead of assuming that we potentially need a horizontally scalable system, we should instead find hardware that matches our needs.
Only skimmed through, so might be wrong, but got the impression that the mystery is never really solved… If transactions are batched, then MySQL will sometimes have to return OK before the transaction is on disk, no? But it’s stated that the configuration is fully ACID. Seems like a contradiction.
No.
Two non-conflicting transactions (where neither transaction reads the other's changes) can both wait for the other to commit, so they can be left hanging, and the COMMIT/fsync can be batched.
Depending on how you interpret ACID, even two conflicting transactions (B reads what A wrote) can be left hanging. Whoever submitted A can be left blocked at COMMIT, while B is processed assuming A had committed; then either both commit or both fail. This is ACID as long as you only consider data committed once the COMMIT actually succeeds, and don't produce observable effects (outside the system) before that's true.
The performance test program does one insert at a time, sequentially…
No, it uses 16 threads, and all the inserts are non-conflicting (they don't use information from previous transactions).
The transaction is written to the write ahead log and fsync'd before the client is told OK.
It may not have made it to the table storage on disk yet (but will be in memory), but if there's a crash the WAL can be replayed to recover safely - you'll see postgres do this if it crashes from OOM or gets restarted unexpectedly
I might be wrong but most disk controller report the file as written when it isn't actually written to the drive.
But then it's in their cache which is battery-backed and survives power losses. It's the same concept as the write ahead log is at a higher level.
This is the same fella who is hosting github.com/sipusen/logrus ! Man, I would like to say big THANK YOU for all your work there!
Couldn’t be too hard to find max batch time in the mysql source and see what it does…
[dead]
[flagged]