The AWS Lambda 'Kiss of Death' - Shattered Silicon

Our story begins as most database issues start: with hands on foreheads, internally or externally, saying ‘WTF is going on?’.

We observed a series of database freezes on our production environment. It was quite severe. Connections spiked, writes were stalled and at some point, a large database freeze and they cleared. Being a Galera environment, the question was if something was stalling writes. But the issue only seemed to be affecting the writer node.

But what could it be? Was the server misconfigured?

It could have used a few tweaks, but nothing that would cause this.

An old, raggedy man stood up. ‘Two years ago’, he began, ‘we had issues of database freezes that were to do with long InnoDB history length.

We opened the monitor, typed in the metric and sure enough, there were very large spikes in the innodb_history_list_length.

The numbers were insane. Out of this world. As a reference, 100k is high and should be used as an alarm setting. What could be causing this?

We looked at the undo log files on disk and were blown away – 80Gb!!

We did a lot of research on this, but advice was hard to come by. Then we tried to see what was holding these transactions for so long to see if we could release them.

We ran some queries around the information_schema.INNODB_TRX table to see if we can get any answers.

Then we identified a particular user and tried to see if we can kill it to help release the history.

So we ran this:

And then this happened:

The pressure was released – alongside some freezes.

It turns out that this was an issue of connection pooling in AWS Lambda. The connection is being reused, and some of those connections start a transaction, then forget to close it.

While trying to figure out how to reduce it, we noticed that setting the innodb_undo_log_truncate to ON and then giving innodb_max_undo_log_size a limit had a positive effect. But the best effect was when we asked that the Lambda user have a session variable of “transaction_isolation=READ-COMMITTED” (instead of the default of REPEATABLE-READ).

Why is that?

InnoDB uses Multi-Version Concurrency Control (MVCC) to provide non-locking reads. When a row is updated or deleted, InnoDB does not overwrite it immediately. Instead, it creates a new version of the row and keeps the old version(s) in the undo log (the “history”). The history list length tracks how many such old row versions are pending purge.

The purge thread can only remove old versions when they are no longer visible to any active transaction’s read view (a consistent snapshot of the database state).

A transaction creates a single read view at its first consistent read (usually the first SELECT). This read view persists for the entire lifetime of the transaction. Even a simple, short-lived transaction that only does one quick SELECT and then idles (or a long-running one) continues to “pin” all row versions that existed at that moment. In a busy system with frequent writes, this blocks the purge process, causing the history list to grow rapidly.

This difference is especially pronounced with connection pooling. Pools reuse the same underlying database connections across many application requests. A connection may sit idle in the pool for seconds or minutes (or hours) between uses. Idle or slowly recycled pooled connections with lingering read views are a frequent culprit for unbounded history list growth, even without obvious “long-running transactions.”

Galera relies heavily on InnoDB and adds its own certification-based conflict resolution for multi-master writes – so the problem is compounded.

Setting SET SESSION transaction_isolation=READ-COMMITTED ensures that read views are short-lived. Purge can keep up better, keeping the history list shorter and more stable. Because as soon as a statement finishes executing, its read view is released immediately. The next statement in the same transaction gets a fresh read view reflecting the latest committed state at that moment.

Consequences:

Read views typically live only for the duration of a single query (often milliseconds).
Old row versions become eligible for purge much sooner — often right after the statement that might have needed them completes.
Even if a connection sits idle in the pool for a long time, there is no long-lived read view pinning history from previous requests (because each statement’s view dies when the statement ends).

Conclusion

I hope this helps avoid the Lambda ‘kiss of death’ and I will be following up this post with my recommendation that transaction_isolation on MariaDB/MySQL, especially with Galera, default to READ-COMMITTED. Just like it is the default on PostgreSQL and MS SQL.