A Deep Dive into Ruby C Extension Memory Management: embedded vs separate

Press enter or click to view image in full size

Embedded data allocation VS Separate data allocation

Introduction

Some weeks ago I wrote rag_embeddings, a native Ruby library for efficient storage and comparison of AI-generated embeddings, managing thousands of high-dimensional vectors in memory. You can read this article about it.

Each embedding is a hefty chunk of data,n floating-point numbers representing semantic meaning extracted from text. The current implementation works, but as any performance-conscious developer knows, there’s always that nagging question: “Could this be more efficient?”

That question led me down a fascinating rabbit hole of Ruby’s memory management internals, variable-width allocation, and a hard-learned lesson about why sometimes the “obvious” optimization isn’t actually better. This is the story of how I spent days implementing what seemed like a clear performance win, only to discover that my “improved” solution was worse in every measurable way.

The journey began with reading Peter Zhu’s excellent article on implementing embedded TypedData objects, which showcased impressive performance gains from Ruby’s variable-width allocation feature. Time.now got 80% faster, Object#to_enum saw a 68% speedup. Surely this technique could work similar magic for my embedding vectors?

Spoiler alert: assumptions are dangerous, and measurement is everything.

The Problem & Hypothesis

My RAG embeddings library was built as a Ruby C extension, storing embedding vectors as native C structures. The existing implementation was straightforward: each embedding object used xmalloc to allocate memory for its vector data separately from the Ruby object itself.

Press enter or click to view image in full size

This meant two allocations per embedding: one for the Ruby object, one for the actual float array containing the embedding data.

c// Simplified version of the original approach
typedef struct {
    float *data;
    size_t dimensions;
} embedding_t;// Two separate allocations:
// 1. Ruby object allocation
// 2. xmalloc for the float array
embedding_t *embedding = ALLOC(embedding_t);
embedding->data = xmalloc(dimensions * sizeof(float));

Reading about Ruby’s embedded TypedData objects, the optimization seemed obvious. Instead of two allocations, I could use rb_data_typed_object_zalloc with the RUBY_TYPED_EMBEDDABLE flag to allocate the Ruby object and embedding data in a single, contiguous memory block. This variable-width allocation approach promised several compelling benefits:

Reduced allocations: From two allocations per embedding down to one, eliminating the overhead of separate memory management calls.

Better memory locality: The embedding data would be stored immediately after the Ruby object header, improving cache performance when accessing the vector data.

Lower memory overhead: No need to store an 8-byte pointer to separately allocated memory, and no malloc bookkeeping overhead.

Fragmentation resistance: Ruby’s garbage collector is designed to handle memory fragmentation better than system malloc.

The hypothesis was compelling: for a data structure that creates thousands of large, uniform objects, embedded allocation should provide measurable improvements in both performance and memory usage. The RAG embedding use case seemed perfect. We’re dealing with vectors of consistent size (typically from 768 to 4096 dimensions), created in batches, and accessed frequently for similarity calculations.

What could go wrong?

Technical Implementation

Let me show you the difference between the two approaches in simple terms. Think of it like organizing your desk: you can either keep your papers in a separate drawer (the old way) or attach them directly to your computer monitor (the new way).

Press enter or click to view image in full size

The Original Approach: Separate Allocations

// Two-step process:
// 1. Create the Ruby object
embedding_t *embedding = ALLOC(embedding_t);// 2. Separately allocate memory for the actual data
embedding->data = xmalloc(dimensions * sizeof(float));
embedding->dimensions = dimensions;

Here, the Ruby object holds a pointer to the float array, which lives somewhere else in memory. When you want to access the embedding data, Ruby has to follow that pointer. Like looking up a file location in an index, then going to fetch the actual file.

The New Approach: Embedded Allocation

// Single allocation that includes both object and data
size_t total_size = sizeof(embedding_t) + (dimensions * sizeof(float));
embedding_t *embedding = rb_data_typed_object_zalloc(
    EmbeddingClass, 
    embedding_t, 
    total_size, 
    &embedding_type
);// Mark this type as embeddable
embedding_type.flags = RUBY_TYPED_FREE_IMMEDIATELY | RUBY_TYPED_EMBEDDABLE;
// Data lives right after the object header
embedding->data = (float*)((char*)embedding + sizeof(embedding_t));

With embedded allocation, the data is stored immediately after the object, like having your papers taped directly to your monitor. No pointer to follow, no separate lookup, everything is right there.

The key insight is that rb_data_typed_object_zalloc can allocate variable amounts of memory. Instead of just allocating space for the basic object structure, it allocates extra space for however many floats your embedding needs. A 768-dimension embedding gets more space than a 512-dimension one.

The RUBY_TYPED_EMBEDDABLE flag tells Ruby's garbage collector: "This object's data is embedded, so when you move the object during compaction, move everything together."

The Experiment Setup

Time to put theory to the test. I wanted to measure what really matters for an embeddings library: creation speed, computation speed, and memory efficiency.

What I Measured

I designed benchmarks around the two most common operations:

Creation Performance: How fast can we create 10,000 embeddings? This matters when you’re processing large document collections.

Similarity Calculations: How fast can we compute cosine similarity between embeddings? This is the bread and butter of semantic search.

Memory Usage: How much RAM does each approach actually use? Memory efficiency can make or break applications dealing with large vector databases.

The Testing Strategy

I used Ruby’s built-in benchmarking tools and measured RSS (Resident Set Size) memory at different points:

# Simplified benchmark structure
def benchmark_creation
  start_memory = get_memory_usage    Benchmark.measure do
    10_000.times do |i|
      Embedding.new(random_vector(768))
    end
  end
    end_memory = get_memory_usage
  puts "Memory delta: #{end_memory - start_memory} MB"
end

I also tested two different memory patterns:

Create and hold: Create 10,000 embeddings and keep references to all of them (like building an in-memory search index)
Create and discard: Create embeddings but let them be garbage collected (like processing documents in batches)

Why 10,000 Iterations?

That number isn’t arbitrary. In real RAG applications, you often deal with thousands of document chunks. 10,000 embeddings represents a medium-sized document collection, enough to reveal performance patterns without taking forever to run.

I also made sure to test with realistic embedding dimensions (768 to 4096, matching common transformer models) rather than tiny test vectors.

The stage was set. I had my implementation, my benchmarks, and high expectations. Time to run the tests and collect my performance victory…

Results & Analysis

Then reality hit. The results weren’t just disappointing, they were the exact opposite of what I expected.

Here’s what I found after running the benchmarks, on 10,000 embeddings:

Press enter or click to view image in full size

Wait, what?? The “optimized” version was slower AND used nearly 4x more memory. This wasn’t a minor regression, it was a complete disaster.

The Memory Usage Paradox

The most shocking result was memory usage. How could storing data more efficiently use so much more RAM?

The answer lies in how Ruby’s garbage collector works. With the original approach using xmalloc, the float arrays live outside Ruby's managed heap. Ruby knows about them (thanks to the mark and free functions), but they don't count toward Ruby's heap size calculations.

With embedded allocation, those same float arrays are now part of Ruby’s heap. Suddenly, Ruby thinks it needs a much larger heap to accommodate all this data. The garbage collector becomes more conservative about cleaning up, leading to much higher memory usage.

It’s like the difference between keeping your files in a separate filing cabinet versus stuffing them all in your desk drawers. Your desk gets overwhelmed much faster.

The Technical Deep-Dive

Why did embedded allocation fail so spectacularly? It comes down to a fundamental mismatch between the technique and the use case.

When Embedded Allocation Works

Embedded allocation shines with small, frequently created objects. Think Time.now or Object#methodtiny objects that are created, used briefly, and discarded. The overhead of separate malloc calls becomes significant when you're doing this thousands of times per second.

When It Doesn’t Work

RAG embeddings are the opposite: large objects (768 floats = ~3KB each) that stick around for a while. Here’s what goes wrong:

Heap pressure: Adding 3KB objects to Ruby’s heap puts enormous pressure on the garbage collector. Ruby has to work much harder to manage memory.

GC overhead: The garbage collector has to scan through all those embedded float arrays during marking phases, even though they contain no Ruby object references.

Memory retention: Ruby becomes reluctant to shrink its heap when it thinks it needs to accommodate all this embedded data.

Cache pollution: Ironically, the “better locality” of embedded allocation can hurt cache performance when your objects are large enough to span multiple cache lines.

The cruel irony? For large objects, the overhead of following a pointer (the thing embedded allocation eliminates) is negligible compared to the cost of processing the data itself.

Key Takeaways

This experiment taught me several valuable lessons about performance optimization:

Size matters: Embedded allocation is brilliant for small objects but can backfire spectacularly for large ones. There’s no universal “better” approach.

Measure everything: My intuition about memory locality and allocation overhead was completely wrong for this use case. Without benchmarks, I would have shipped a performance regression.

Context is king: Techniques that work well in one scenario (like Ruby’s internal Time objects) might be terrible in another (like large embedding vectors).

Trust the data: When benchmarks contradict your expectations, the benchmarks are usually right. Don’t explain away bad results rather investigate them.

Garbage collection complexity: Ruby’s GC is sophisticated, and changes that seem obviously beneficial can have unexpected interactions with its algorithms.

Conclusion

Here you can find the details I described in this article: https://github.com/marcomd/rag_embeddings/pull/3

I don’t want anyone to get confused, the mood is very positive for the Ruby community’s efforts on the performance front.

I spent days implementing what seemed like an obvious optimization, only to discover that obvious isn’t always right. But sometimes the most valuable experiments are the ones that fail.

Press enter or click to view image in full size

The embedded allocation approach failed because I was optimizing for the wrong bottleneck. For large objects like embedding vectors, the cost of malloc/free is insignificant compared to other operations. The “improvement” I was chasing wasn’t actually a problem worth solving.

In the end, I stuck with the original xmalloc approach. It's simpler, faster, and uses less memory. Sometimes boring wins.

The real victory wasn’t the performance improvement I didn’t get, it was the deeper understanding of Ruby’s memory management and the reminder that measurement beats intuition every time. Before you optimize, measure. After you optimize, measure again.

And sometimes, the best optimization is realizing you don’t need to optimize at all.

Move from C to Rust. Is this a good idea? Any difference in performance? Read here.

Introduction

The Problem & Hypothesis

Technical Implementation

The New Approach: Embedded Allocation

The Experiment Setup

What I Measured

The Testing Strategy

Why 10,000 Iterations?

Results & Analysis

The Memory Usage Paradox

The Technical Deep-Dive

When Embedded Allocation Works

When It Doesn’t Work

Key Takeaways

Conclusion

Next