I recently became a Major Contributor to the PostgreSQL project. It’s a pretty exciting milestone for me, so naturally my colleagues asked me to write a post about it. To make sure I didn’t have any excuses, they even went ahead and thoughtfully put together a list of my achievements for me. It came out great, but I just couldn’t bring myself to publish a “look how awesome I am” post under my own name. I don’t mind talking about it, and I’ll shout from the rooftops about the tech built by my team — or even by me personally — but usually only when I’m explaining how to use it or when I’m chatting one-on-one or within a small group.
Instead, I decided to write about what didn’t work. This post was written in a bit of a rush, so parts of it are fairly technical. Don’t worry if you have to Google a few terms — that’s normal. And if everything here makes perfect sense to you, it might be a sign to cut down on screen time and touch some grass.
Making incremental improvements to any popular technology often comes with a cost. More often than not, the “fixes” proposed for PostgreSQL do more harm than good. Building something new from scratch without breaking anything is hard enough, but trying to do it within the PostgreSQL core is like going through a labyrinth lined with traps.
Out of everything I promised to deliver to Yandex users, Yandex Cloud customers, support engineers (especially Pasha, aka Amatol), my managers, and my team, I feel like I’ve completed less than a third. Most of my adventures ultimately run into technical dead ends. Sometimes I return to old projects and find a new way through — only to reach another dead end further down the road.
That said, plenty of other patches did make it through: incremental improvements to SLRU, GiST index optimizations, UUID v7 support, and ongoing work on Cloudberry and SPQR. But those stories have been told elsewhere. Here I want to talk about what’s harder to see: the infrastructure, the dead ends, and the labour that makes the successful commits possible.
I’m not unique in reaching stalls and dead ends. These adventures include the famous zheap project, the 64-bit transaction counter, TDE, incremental materialized views, and a whole lot of other very desired and technically profound efforts.
So what did I fail to pull off on my journey from my first message on pgsql-hackers to becoming a Major Contributor?
Merging Pages in a B-Tree
Before joining Yandex, I specialized in indexing. In my very first week on the job, we went over various projects that Vladimir Borodin wanted me to move forward. This included backup technologies, monitoring and diagnostic tools, traffic management, and several optimizations — basically everything he considered essential for the Data Platform he was putting together. Back then, the Yandex Cloud site was just a landing page with a video of spinning server fans, but inside Yandex people were already using his product.
One distinctive feature of the PostgreSQL-as-a-Service we were working on then was a regular Friday rebuild of B-tree indexes to clear out accumulated index bloat. At the time, we routinely rebuilt tens of terabytes of indexes. I suspect that today it’s already measured in petabytes.
One reason index pages end up poorly organized after a cleanup is that PostgreSQL lacks a built-in merging mechanism for B-tree pages. For the full algorithm behind this operation, see A Symmetric Concurrent B-Tree Algorithm.
This B-tree algorithm is fairly simple, and the paper illustrates its two stages in the following diagram:
Press enter or click to view image in full size
I was familiar with this paper, so I figured I could implement the algorithm in two days. I genuinely thought I had given myself a generous amount of time. However, PostgreSQL community developers warned me that it probably wouldn’t work: the algorithm wasn’t very compatible with our B-tree implementation. I’ve revisited this project multiple times, but I always run into two structural constraints:
Under the PostgreSQL buffer pinning model, the proposed algorithm doesn’t guarantee that an IndexScan will encounter a specific indexed tuple only once.
PostgreSQL supports backward scans, which may skip rows when racing against page merges.
You can read more about the latest approach to this problem here.
Compression of the Replication Protocol
At Yandex, we follow the “N−1 Data Center” redundancy model. This means our PostgreSQL databases are always spread across multiple data centers. Every change on the database write node (Replication Primary) must be streamed to a Hot Standby, which handles read queries.
Our tests showed that protocol compression could potentially make replication 20 times faster.
What made this even more promising was that we weren’t starting from scratch — Konstantin Knizhnik had already posted a working patch to the community!
My team and I joined the discussion, with Daniil Zakhlystov handling implementation on our side. Over time, the developer of the original patch seemed to lose interest, and Daniil made an attempt to revive the effort.
So what was holding this technology back? It seems that, early on, security concerns were a strong limiting factor. For example, data compression was removed from TLS not too long ago in 2015. Why? Compression can potentially expose encrypted data. This vulnerability is known as CRIME. If you compress secret data together with an attacker-controlled input, the resulting network packet length can reveal how closely the attacker’s data matches the secret data.
That said, MySQL and Oracle have supported protocol compression since long before CRIME was discovered, and they aren’t exactly rushing to remove it. It’s simply too useful, and CRIME is very hard to exploit in practice. Besides, PostgreSQL already has a CRIME-like exposure via WAL compression. That’s why WAL compression settings are only available to superusers who can weigh those risks themselves.
By the time the community finally agreed that the benefits outweighed the risks, the patch authors no longer had the energy to continue working on it. We might revisit this topic later.
Extreme Cases of Synchronous Replication Guarantees
Guarantees that not a single byte is lost if a cluster node goes down are based on synchronous replication. The client only gets a commit confirmation after the replicas have received all transaction data. If the connection is temporarily disrupted, the client must wait for confirmation — sometimes for quite a long time.
Press enter or click to view image in full size
Why the wait? Other nodes in the cluster may form a new cluster and no longer accept the transaction.
However, if the client cancels the transaction, its effects may become visible, and the client might assume that the data was updated even though it wasn’t.
Evgeny Dyukov and I talked about this at HighLoad++ and FOSDEM.
We’ve brought this up with the community multiple times. Major cloud providers (AWS RDS, Azure, Yandex Cloud) use a patch that is functionally equivalent to the one we proposed to the community.
So far, I haven’t been able to convince the committers that we should disallow canceling queries that haven’t been replicated yet. Moreover, even if I was a committer right now — I would not commit this without consensus among a large group of committers. However, I do believe this is technically correct and the desired mode of the HA system.
Accelerating PGLZ Compression
In 2019, I told Vladimir Leskov over lunch how ClickHouse sped up LZ4 decompression using a whole set of hardware-specific optimization techniques.
Vladimir is a competitive programmer who holds a medal from the ACM ICPC World Finals. When I suggested moving bytes in larger, fixed-size chunks, he said something along the lines of: “Asymptotically, optimizations like that don’t change anything. You simply need to move bytes using an expanding range.”
When something is truly simple, competitive programmers don’t say it’s simple — they try to act it out in pantomime or come up with a relevant anecdote. If they call it “simple,” it’s usually something you’ll find in research papers or specialized forums. We implemented the idea right after lunch, and it became part of PostgreSQL, accelerating decompression for the PGLZ codec.
Vladimir didn’t stop there. On Sunday, while the team he coached was busy with a contest, he sat down and wrote a similar optimization for the compression side.
That optimization turned out to be bulky, though, with far too many lines of code.
Press enter or click to view image in full size
In the end, we agreed to split the optimization into four separate parts, but we no longer have the interest or energy to do that.
Vladimir is now working on completely different projects, and I don’t have the mental bandwidth to dig into the details of that patch code. Current LLMs give up too, which puzzles me, because the code is not all that complicated.
BFS vs. DFS
These are only a few projects that ultimately produced little more than hot air and gave us a slightly better understanding of database processes. As my colleague Kirill Reshke put it: “Now you misunderstand PostgreSQL less.”
Looking at these projects, you might think the answer is simply to be more persistent: pick fewer directions and see them through, carrying a project to a level of success where the benefit outweighs the risks and it is usable. Even I sometimes worry that some projects resemble this well-known meme:
Press enter or click to view image in full size
But I’m convinced that the tunnel-digging analogy is the wrong way to look at it. Most unfinished projects build the infrastructure that lets other projects reach the finish line and deliver real value.
They say, “Walk, and you shall reach.” Similarly, a labyrinth full of traps is explored step by step — but in many directions at once.
Exactly 10 years ago I sent my first message to pgsql-hackers. On the 10th anniversary I posted to the same thread a small message to those who consider contributing to postgres.