A Hash Table that Uses Less Space Than the Items that it Stores

3 min read Original article ↗

When people talk about “space-efficient hash tables”, they are usually talking about the following type of guarantee: If we are storing n keys, each of which are w bits, then the total space usage should be (1 + \epsilon)wn bits for some small \epsilon.

But, what if I told you we can do better? In fact, it’s possible to construct a hash table that uses fewer than nw bits. That is, the hash table uses less space than if we were to write each of the n keys, one after another, in an array. As terminology, I’ll refer to such a hash table as extremely tiny.

Techniques for building extremely tiny hash tables have been known to theoreticians for decades. But how hard is it to build one in the real world? Not that hard at all, it turns out.

In this post, I’ll show you how to build a very simple extremely tiny hash table for 32-bit keys.

The Construction. Since our keys are 32-bits, we can write each key x as x_1 \circ x_2 \circ x_3 \circ x_4, where each x_i is one byte. Our extremely tiny hash table H will have the following structure: it consists of 256 sub-hash-tables H_1, H_2, \ldots, H_{256}. To insert a key x = x_1 \circ x_2 \circ x_3 \circ x_4 into H, we simply insert the three-byte item x_2 \circ x_3 \circ x_4 into table H_{x_1}. And to query whether x \in H, we simply query whether x_2 \circ x_3 \circ x_4 \in H_{x_1}.

Finally, each H_i is just a classical linear-probing hash table, dynamically resized to have a load factor (i.e., space efficiency) between 80% and 90%. That’s the entire construction.

Analysis. The trick here is that, even though H is storing 4-byte keys, each H_i is only storing 3-byte keys. The first byte of each key is encoded implicitly by the choice of which H_i the key is stored in. This is a (somewhat unusual) application of a technique known as “quotienting”.

How much space do the H_i‘s consume? Remember that each H_i is kept at a load factor of 80% or above. So, if H_i is storing n_i 3-byte keys, then it uses at most 3 n_i / 0.8 = 3.75 n_i bytes of space.

The different sub-hash-tables H_1, \ldots, H_{256} may have very different sizes than each other at any given moment. Nonetheless, since the sum of their sizes in n (the total number of keys in H), the total space used by the sub-hash tables is at most 3.75 n bytes of space.

Finally, we need to account for the metadata of the 256 sub-hash-tables. All things considered, in my implementation, this uses less than 6000 bytes. So our total space is at most 3.75 n + 6000 bytes.

For any n \ge 5334, the total space of our hash table comes out to less than 31 bits per key. We’ve built a real-world extremely tiny hash table.

Real-World Performance. So is this data structure fast? Well, it’s not blazingly fast. But it’s not embarrassingly slow either.

On my laptop, the time to first insert ten million keys and then do twenty million queries (half positive, half negative) is 3.7 seconds. If we instead use the C++ std::unordered_set (configured to use the same hash function as our original hash table), the same sequence of operations takes 7.3 seconds.

So we’re faster than the C++ hash table, but at the same time, that’s not a Herculean feat. (And it’s also not a fair comparison since, of course, the C++ hash table supports much more functionality). But it is a proof of concept: you can build extremely tiny hash tables in the real world.

My code is available at https://github.com/williamkuszmaul/tinyhash.