The Ruby VM and How Apps Break (Part 2)

Press enter or click to view image in full size

See Global Interpreter Lock, Threads, Lies and Metrics for the first part of this. I’ll be covering how Ruby’s Incremental Generational Garbage Collector works in MRI 2.4.2, and how it can break your app.

In reality, Ruby’s GC is a lot more complicated than I’m outlining here. It has a lot of techniques built in to keep compatibility across versions and maintain performance. I’ll be outlining the gist of how it works, because it’s the relevant parts when trying to understand what GC metrics mean, and why your app is broken.

I’ve included a “Further Reading” section if you want to go deeper into how the garbage collector works in Ruby.

Heaps, Pages and Slots

Ruby allocates two heaps, one called eden which contains all pages that contain at least one alive object, and tomb which contains empty pages.

A heap contains a linked list of pages. Each page is 16KB, and on a 64-bit OS a slot takes 40 bytes, so a single page can hold around 408 slots. Within a page, it has a linked list of allocated and free slots.

A slot contains the type of object (string, boolean, integer, etc) and the value of the object. In some situations, such as strings above ~23 bytes, the value of the object is stored outside of the slot, and the slot stores a reference to the memory.

Depending on the size of the Ruby object, the entire value of the object is stored within the slot. In some scenarios, such as large strings, the string will be stored outside of the slot. A slot is still used to maintain a reference to the memory

For the sake of brevity, I’m skipping over the linked list to track pages for mark/sweep, as it’s not relevant to the explanation. Visually, the Ruby heaps look something like:

Press enter or click to view image in full size

In a real process, there’s quite a lot more pages and slots.

Mark and Sweep

At its core, Ruby uses a mark-and-sweep algorithm. If you want to go down the rabbit hole, Wikipedia has an article on it.

The algorithm is straight forward, we start with a page from the eden heap:

The Ruby VM then walks the slots within the page checking if it has any references, and marks them:

Marked slots are in green. White ones are unreferenced.

Once the mark phase completes, the sweep runs which frees any unmarked (white) slots and moves them to the free slots.

Only one slot was unreferenced.

Repeat until your process exits.

In Ruby 1.9.3, this was further improved to do a mark-and-sweep, then lazily run sweeps to reduce the pause time during GC.

Generational Garbage Collector

The main problem with the basic mark-and-sweep algorithm is it requires you to walk every object to ensure it’s still referenced. A Rails process for example, will allocate a lot of objects during boot that will never be garbage collected, and then allocate short lived objects to handle requests.

To improve performance, Ruby 2.1 added generational GC. This is semantically similar to the one in the JVM, but Ruby only has two buckets, NewGen and OldGen.

Newly created objects are put into NewGen, and if they survive 3 GCs are promoted to OldGen. Now instead of walking every object, Ruby walks the NewGen bucket during a minor GC, and both NewGen and OldGen during a major GC.

Incremental Garbage Collector

The generational garbage collector improved things, but it’s still not perfect. Ruby 2.2 added a full incremental garbage collector, which will lazily mark objects allowing it to mark a subset of slots at a time, and then lazily sweep them.

Instead of spending 100ms on a full mark-and-sweep, you could spend 5ms on a mark and 5ms on a sweep, repeated 10 times. The overall time spent on GC is still the same, but you have more predictable pauses in your app.

How Apps Break the GC

Given all this, there are a few ways you can break a process.

Most GC issues come down to allocating objects too aggressively in a single request, which forces Ruby to run more major GCs. It’s easy to identify when looking at metrics, since you will see a spike in major GCs.

Because GC (even incremental GC) in Ruby requires stopping the world, you run into the same issues outlined in part 1 around thread contention. If you see a large amount of time spent in GC, you can’t trust your timing metrics anymore.

Your timing metrics being thrown off makes it harder to sort out where the memory leak was introduce. Sometimes, you can use the metrics if you go far enough back and find the sweet spot where it slows down but before it goes into a GC death spiral. Otherwise, you have to rely on diffing new vs old code.

Typically, when looking at code to determine where the problem started, I look at code that talks to a SQL database. Anything from a N+1 query due to missing eager loads, to loading tens of thousands of ActiveRecord objects, or doing map(&:id) instead of pluck(:id). Those tend to be the common mistakes that lead to GC thrashing.

Memory Fragmentation

While Ruby eventually frees up pages (and thus memory) in the Tomb Heap, getting a page into the Tomb Heap requires all slots in the page to be freed. What if we have two pages that look like:

Press enter or click to view image in full size

Ideally, the Ruby VM would compact it into a single page that looks like:

Which would allow page 2 to be moved to the Tomb Heap and free up memory. The Ruby VM can only free up a page if all the slots in it are free, it does not compact partially full pages. If you repeat the process enough, you can end up with a slow leak in memory.

In general, this is a rare issue. I’ve only seen it come up a handful of times, and it was because the app had requests that allocated a variably large chunk of memory, that was then freed up immediately after.

The fix was to use a different allocator, and reduce the GC_HEAP_GROWTH_FACTOR value, which reduced the exponential growth of the heap. Although this is no longer necessary with Ruby 2.4.2, because it now allocates based on a goal ratio, and GC_HEAP_GROWTH_FACTOR defaults to 1.0.

jemalloc

jemalloc is magical. It’s built with the aim of reducing memory fragmentation in multi-threaded programs, which uses multiple arenas to allocate memory rather than a shared one.

We had moved all of Square’s big Ruby apps to jemalloc, and it reduced memory usage by around 20% to 30% depending on the app, as well as a minor improvement to GC time.

There are other alternative allocators, such as tcmalloc. I’ve heard of good results with using that, but haven’t personally tried it. I always recommend swapping out malloc with (at least) jemalloc for a Ruby app given how easy it is to do.

GC Tuning

Historically, tuning your GC was a necessity because Ruby defaults were inappropriate for Rails app, and the GC was immature. As of Ruby 2.4, I’ve been recommending people don’t tweak the GC by default.

What I’ve usually seen happen, is someone tries to configure 8 GC options, a year later Ruby has made improvements, their app has changed, and the GC settings are no longer appropriate and hurt more then help.

One exception to this is the RUBY_GC_HEAP_INIT_SLOT ENV variable, which was called RUBY_HEAP_MIN_SLOTS until Ruby 2.1. The option controls how many slots (and thus, pages) should be allocated at boot time. As of Ruby 2.4, it defaults to 10,000.

A typical Rails app will use at least a hundred thousand slots, and I’ve seen even basic ones use 800,000+. This leads to the GC spending more time allocating slots as your app boots up.

It’s an easy option to tune, and short of deleting large chunks of code, doesn’t hurt you if it becomes out of date. Easiest way to figure out a default value is to open a Rails/irb console, run GC.stat[:heap_live_slot] , and set that as your RUBY_GC_HEAP_INIT_SLOT .

The heap_live_slot metric is how many slots are still in use and survived GC.