Memory Debugging — A Cautionary Tale

For months, everything was working fine. Then one day it wasn’t.

Gather round the campfire kiddies for a tale of adventure as I do battle with my very own white whale — background worker memory bloat! Minds are blown (mine). Souls are crushed (again, mine). Tears are shed (yep, me too). Nearly a year of my life goes by!

In the spirit of Captain Ahab, I never relent even whilst being dragged into the murky depths but thankfully, unlike poor Ahab, I don’t die in the end (spoiler alert?).

We released the new version of our platform in January of 2016. It was a ground-up rewrite (another story for another time). It was going to make everything better. And in the beginning, everything was perfect and beautiful…

A large part of our product involves ingesting large data sets from our customers. These files arrive in a variety of formats (CSV, Excel, JSON, etc.) which we then normalize, process, and import into our system to power our messaging platform. There’s a lot going on in this system.

The first sighting of the white whale, the dreaded Heroku memory quota exceeded error, was roughly three months after our release. It started off slowly at first… First just a few a week. Then a few a day.

Shit, OK well that’s not great. I’d dealt with this beast before in a previous life. But I rationalized to myself, “We’re scaling and adding more customers and the data we’re working with can be fairly large so maybe it’s just growing pains.”

Besides, we’re using

’s excellent Sidekiq Pro for our background processing needs with reliable fetch enabled so even when Heroku bounces our worker dynos for memory usage, nothing bad happens. I tweaked a few things like dyno sizes and concurrency settings but didn’t put much effort into figuring it out.

Ok, I guess we’ll just let the beast be for now…

Round one goes to the whale…

Fast forward a few months and the memory errors were not getting any better… Nothing was on fire exactly and our data pipelines were still flowing but now when I looked at our memory graphs in Heroku, it was more or less a constant stream of pretty red memory errors. We were getting thousands of them per day.

I asked

, one of my lead engineers at the time, to take a look at the memory problem. I know how much of a nightmare tracking down production memory problems like these can be so I wanted to make sure one of my best people was on it.

Lets just say he was a bit surprised by how bad things looked…

Press enter or click to view image in full size

It looked a lot like this where all those red blips are multiple Heroku memory errors…

He spent a week or two trying to profile our data import system and fine tuning various things. He swapped out parsing libraries and made other various optimizations. We were unable to reproduce the problem locally (or even in staging) which certainly didn’t help matters so we were flying blind.

We deployed the changes and waited…

Aaaaand fuck, that didn’t do a damn thing… OK.

By now we were knee deep into our acquisition process and nobody had time to look into this anymore. Besides, it still wasn’t really a problem.

Round two goes to the whale. Things were still working though…

After the dust settled from the acquisition I was left with a skeleton crew. Me and

were the only ones they were willing to pay to help with the transition. Needless to say we had our hands full with things and so, while this memory issue was fucking annoying, it had to wait.

But every so often I would go back and look at our memory usage in Heroku and I would weep quietly to myself.

We finally got past the initial transition and integration hurdles and found a tiny bit of breathing room to clear some technical debt. I checked the memory graphs just to see and…

HOLY SHIT EVERYTHING IS ON FIRE! THE SINGULARITY IS ABOUT TO OPEN BENEATH OUR FEET…

A slight exaggeration. But if you take that graph above and swap all the pale red out for bright blazing inferno red instead you’d have some idea of what I was seeing when I checked our metrics. We were now triggering over 20K memory errors per day (I’m so sorry Heroku). God only knows how often those worker dynos were being killed and restarted.

Surprisingly though, things were still more or less working…

But this could not stand! I mean hell if the folks that acquired us ever bothered to look at these graphs they’d probably (rightly) shit a brick!

Time to dive in again...

I started going through some of the same steps that I’m guessing

had— memory profiling the import system while processing large files locally. Testing with real production data files I downloaded from S3. Tracking object retain counts. I started to see small things here and there that looked like they could be improved.

I refactored anything/everything that looked like it was holding memory unnecessarily. I used bang methods! where possible so I’d mutate the current object instead of making a new one. Needlessly froze strings left and right in a desperate attempt for something, anything I did to matter.

I had hoped these small changes would add up to something big. They never did.

Welp, another couple of weeks burned… Round three goes to the whale.

Another few months passed by. The memory errors are haunting my nightmares at this point. When I close my eyes I see the red rectangular menace mocking me…

One day, something snaps. Not a lightbulb, I mean something in me broke. I decide to do that which shall not be done. I decide to hunt the beast down straight on production. I will literally short-circuit our entire data import system to find the culprit.

Get Brian McManus’s stories in your inbox

Join Medium for free to get updates from this writer.

Think of this as a really terrible version of puts debugging. A tragic git bisect that impacts our customers. But what else can I do, I’ve exhausted everything else. I’m exhausted. Please just make it stop…

Towards thee I roll, thou all-destroying but unconquering bug; to the last I grapple with thee; from hell’s heart I stab at thee; for hate’s sake I spit my last breath at thee. — Captain Ahab (modified slightly)

Now I know, this sounds extreme but trust me when I tell you it was safe. You must understand how your system works to make a decision like this. A few hours of lagging data would not hurt our customers.

My first goal was to confirm my theory that it was indeed the data import system. Oh, I didn’t mention that? Yah, we weren’t sure yet that it even was the import system — it just seemed like the most obvious candidate.

I turned pretty much everything off by just commenting out the service call our background job makes. Data would come in but it would just silently do nothing. I deployed. I waited…

Hey, it worked!

Ok, well that at least confirms the problem is somewhere in the data import system. That’s a HUGE win at this point. Ok lets turn that back on.

Next step, go down a bit deeper and run more of the file processing and data manipulation but again short circuit before doing too much work. I deployed. I waited…

Still works…

A bit more of this and I’ve narrowed it down to a single service composed of two collaborators. One does a bunch of heavy duty data import stuff. The other is two lines long and schedules an email to be sent.

Guess which one it was…

I had checked that service SO many times. It was such a small and seemingly insignificant thing in the grand scheme of things. There was almost too little code to even bother checking it. But then finally something stuck out at me…

We are calling length on an ActiveRecord relation? Oh hey, isn’t that potentially terrible? Oh no…

For those that aren’t aware (as many, including the original author of this small service, aren’t), in Rails, calling length actually loads and instantiates all of the associated objects in memory and then returns the length of that array. Calling count on the other hand issues a simple SQL SELECT COUNT(*) query and returns the length from the database — no objects are instantiated at all.

I quickly made the following fix and deployed.

Press enter or click to view image in full size

Oh, the humanity…

I waited for what seemed like an eternity watching the memory graph in Heroku. Five minutes — looking good... Ten minutes — still not exploding... Thirty minutes —

OMG I THINK IT WORKED!

Once again I wept— but this time they were tears of joy!

That simple commit changed everything. Here is a graph of our memory usage on the day that I snapped. All those vertical grey lines between 12pm-5pm are deployments from me binary searching for this tiny bug. And then, finally, smooth sailing.

Press enter or click to view image in full size

Sweet, sweet victory!

In Rails, please be careful when working with ActiveRecord not to do anything you don’t absolutely have to (that’s good advice in general). In our case, all we wanted was to make sure we only sent customers an email the first time we received an import... If you only need a number, use count — pretty please.

The real lesson, for me, was a reminder to always expect the unexpected and let no stone go unturned.

The problem we were having was never what I thought it was. I mean I had the right system but it had nothing to do with the size of the files we were working with. Or the number of columns or rows. Or with how many customers we were dealing with concurrently. Or with inefficient parsing methods because our files were in S3 instead of on the local filesystem…

No, the problem was just that over the course of roughly a year of importing data for customers continuously every 5 minutes 24 hours a day 7 days a week we had some older customers that now had millions of imported data files. It was a kind of scaling problem but not in the way I was thinking — it was a temporal scaling problem.

Take a single customer through a few months of usage and we’d have this same problem…

This bug was really hard to track down for a number of reasons…

There was no bright line between the introduction of this code (very early on when we had almost no import history) and when the memory problems started happening. They were separated by months with this code going untouched.
There were other red herrings that seemed to correlate better with the timing of these problems than this simple notification service.
Attempts at recreating the problem locally failed for everyone that tried. This was because we were recreating the wrong problem. We were trying to simulate point-in-time load and size of data not over-time history of data.
The service at fault seemed quite innocuous. It’s like two lines of code and just send an email…
The actual bug was a single method call.
The code, as originally written, actually served its intended purpose. There are actually three ways in Rails to get the number of associated objects in a relationship: length, size, and count. Of those three, only length exhibits this problem. Why? ¯\_(ツ)_/¯

It was a frustrating six or so months off and on, but in the end, I got my whale…