From 3 Minutes to 7.8 Seconds: Improving on RocksDB performance

At SereneDB, we are building a search-OLAP database to allow for efficient execution of search, analytics and combined queries. We choose RocksDB as our storage engine.

As we are heading towards our first beta release in March and our codebase keeps growing, we had to start measuring ingestion efficiency at some point.

This is a quick overview about how we used flamegraph to investigate performance bottlenecks and which RocksDB settings we used to substantially improve ingestion speed. We thought that our findings might be helpful for others using RocksDB-based DBMS.

If you like this content, feel free to check out other posts about fast I/O buffers or the Postgres-compatible UI for SereneDB.

How did this post come around?

Over the years ClickBench has practically become the industry standard, so we decided to look no further. What does the ClickBench dataset look like? It is:

120 columns, 70 GB of data, around 100 million rows

So we got our hands on this dataset and started loading the data to SereneDB.

We ran a simple command —

COPY hits FROM '/home/vedernikoff/hits.tsv'

— and got nothing but dead silence.

Minutes passed. Hours passed. Still no result.

So we trimmed the file and ran experiments on a smaller set:

still 120 columns but 650 MB of data and only about 1 million rows

Can’t say we were particularly happy about the result we got from perf but at least it gave us something tangible – a baseline of roughly 180 seconds.

3 minutes to get some data in — why so bad?

Initially, we ran the load using the regular rocksdb::Transaction Put (since that’s how our usual INSERT works). Obviously, this class isn’t designed for such loads – key locking, constant sorting on every column insert, and other related “fun” stuff. All of this gets in the way of good performance.

Interestingly enough, we tried concatenating all columns into a string, put it in a table and COPY from that table – it loaded in just 1 second. Why such a big difference? We will explain it in the section speaking about the structure of our storage.

Path to 19.5 seconds – Columnar storage on top of RocksDB

As you know, SereneDB is an analytical database, which means the storage layer also has to be analytical or – in other words – column-oriented.

At SereneDB, we use RocksDB for data storage but RocksDB is just a key/value store, so how do we make it columnar?

The answer is simple: it’s all in the keys.

We store data using composite keys of the following form:

key = StrConcat(table_id, column_id, primary_key)

This layout guarantees that all values belonging to the same column of the same table are stored next to each other in sorted order (like everything in RocksDB). Here’s an example:

Consider the following table t1 with internal table_id = 7:

id	col1	col2
228	20	200
993	10	100

id is the primary key but internally col1 will have column_id = 1 and col2 will have column_id = 2 (just using 1 and 2 as examples here).

The corresponding RocksDB key/value layout (already sorted by key) will look like this:

table_id	column_id	primary_key (id)	value
7	1 (col1)	228	20
7	1 (col1)	993	10

7	2 (col2)	228	200
7	2 (col2)	993	100

As long as keys are ordered as (table_id, column_id, primary_key), RocksDB naturally groups data by column and keeps rows inside each column sorted by the primary key. This gives us efficient column scans and predictable access patterns on top of a plain key/value store.

Squeezing that many columns into one INSERT is a big problem because sorting entries is a very expensive process.

This finding pushed us towards switching to a more suitable class – SST Writer, which writes directly into SST files. Our plan was to create an SST per column, and eventually – as a result of compaction – merging them together.

The SST Writer approach made a huge difference, slashing the time from 180 seconds to just 19.5 seconds but we didn’t want to stop there.

SST is cool, but why stop there?

At this point we realized that it was pretty hard to shoot in the dark, so we decided to record a flamegraph. It’s not too difficult to read — what you’re looking at is a function call stack, nested from bottom to top. The wider the rectangle, the more time that function takes. Have a look for yourself. The bars are clickable.

If we look at CPUThreadPool1, we will see three main blocks:

TableScan::getOutput — reading from the file and parsing CSV,
TableWriter::addInput — writing to RocksDB,
TableWriter::getOutput — which, judging by Standard128RibbonBitsBuilder, computes filters that behave like a Bloom filter.

The most obvious thing that could help here, was disabling the filter computation at the end (this reduces the block by one third). That operation was eating up a whopping 20% of the CPU, and we didn’t want to get stuck there, especially since the filters would be recomputed during compaction anyway.

You can also notice a line with lz4 in addInput, that means compression was going on. That’s also not really needed here, so it can be turned off in the settings, just like the filters.

After disabling both of them and measuring again, we got an improvement from 19.5 down to approximately 14.3 seconds.

After that, only two components remained – we decided to start with the TextReader, the class responsible for parsing CSV files. The reader attempts to parse each type using the getString function. If you look at the flamegraph and the class itself, you can see that parsing relies on sscanf:

auto scanCount = sscanf(str.c_str(), "%" SCNd64 "%lln", &v, &scanPos);

This is a fairly slow function, constantly dealing with format parsing, locale checks, and so on. So we decided to replace it with the fast_float library:

fast_float::parse_options options{
    fast_float::chars_format::general |
    fast_float::chars_format::skip_white_space};
auto [parseEnd, ec] = fast_float::from_chars_advanced(ptr, end, v, options);

Running the load again gave us a pretty nice improvement to 12 seconds (16% faster).

The file is read in batches into a buffer, but the buffer itself is processed byte by byte:

while (true) {
  auto v = th.getByteOptimized(delim);
  if (!th.isNone(delim)) {
    break;
  }
  th.ownedString_.append(1, static_cast<char>(v));
}

We noticed that append() is always called with a single element, so we decided to replace ownedString_ from std::string to std::vector<char>. This actually helped — it sped things up by another 12%, bringing the total time down to 10.6 seconds. The reason for this improvement is that std::string tries to maintain the null terminator, so by switching to std::vector<char> we effectively cut the number of character writes in half.

Now let’s have a look at the flamegraph again.

As we can see, writes to RocksDB (addInput) now take 72% of CPU, so it made sense to focus next on optimizing this part.

Going under 10 seconds by making RocksDB writes faster

The most interesting part is in rocksdb::SstFileWriter::Rep::AddImpl, where we spotted several notably slow operations.

There were quite a few heavy checks here that would be better moved to debug mode (turned into asserts) – that’s exactly what we did. Here’s an example:

if (file_info.num_entries == 0) {
      file_info.smallest_key.assign(user_key.data(), user_key.size());
} else {
      if (internal_comparator.user_comparator()->Compare(
              user_key, file_info.largest_key) <= 0) {
        // Make sure that keys are added in order
        return Status::InvalidArgument(
            "Keys must be added in strict ascending order.");
      }
}

Another hotspot was the repeated calls to the virtual status function, which just accessed an atomic_bool with memory_order_relaxed. In the flamegraph, this alone took up 20% of CPU, which was quite a lot. This could be addressed fairly easily — by adding a template parameter or doing a compile-time static_cast.

After removing these awkward cases, we got another ~ 18% speedup, and the same dataset now loaded in 8.7 seconds. Nice!

We already though it was quite good but then we decided to give it another look and it turned out that there was a hidden string copy in the same function:

constexpr SequenceNumber sn = 0;
constexpr ValueType vt = 1;
ikey.Set(key, sn, vt); // == key.append(sn + vt);
builder->Add(ikey.Encode(), value);

The problem was that this function was called for every column (remember, there are 120 of them) and for every row, which results in a significant number of allocations.

We decided to pre-create this key for each column and eliminate these redundant allocations. This gave us another nice speedup of 10%, bringing the total time down to 7.8 seconds.

At that point we decided to call it a day. But in fact, this is a “story to be continued…” so we will come back with more improvements soon.

Summary

So in the end, using flamegraphs and a few inexpensive changes, we managed to speed up the program by almost exactly 23x (from 180 to 7.8 seconds). Not bad, isn’t it?

Key takeaways:

Avoid virtual functions in hot paths.
Don’t copy strings unnecessarily or our engineer Valery will be angry.
Move runtime checks to asserts if they’re not needed during execution.

And don’t be afraid to change existing code, even in large, mature projects like RocksDB – careful measurements and small, well-targeted changes can still bring big wins.

If you have any questions, you can reach me on Slack https://serenedb.slack.com/ or support our work on GitHub https://github.com/serenedb/serenedb.

How did this post come around?​

3 minutes to get some data in — why so bad?​

Path to 19.5 seconds – Columnar storage on top of RocksDB​

SST is cool, but why stop there?​

Going under 10 seconds by making RocksDB writes faster​

Summary​

Key takeaways:​

Links​