A Hash Table that Uses Less Space Than the Items that it Stores

When people talk about “space-efficient hash tables”, they are usually talking about the following type of guarantee: If we are storing $n$ keys, each of which are $w$ bits, then the total space usage should be $(1 + \epsilon)wn$ bits for some small $\epsilon$ .

But, what if I told you we can do better? In fact, it’s possible to construct a hash table that uses fewer than $nw$ bits. That is, the hash table uses less space than if we were to write each of the $n$ keys, one after another, in an array. As terminology, I’ll refer to such a hash table as extremely tiny.

Techniques for building extremely tiny hash tables have been known to theoreticians for decades. But how hard is it to build one in the real world? Not that hard at all, it turns out.

In this post, I’ll show you how to build a very simple extremely tiny hash table for 32-bit keys.

The Construction. Since our keys are 32-bits, we can write each key $x$ as $x_1 \circ x_2 \circ x_3 \circ x_4$ , where each $x_i$ is one byte. Our extremely tiny hash table $H$ will have the following structure: it consists of 256 sub-hash-tables $H_1, H_2, \ldots, H_{256}$ . To insert a key $x = x_1 \circ x_2 \circ x_3 \circ x_4$ into $H$ , we simply insert the three-byte item $x_2 \circ x_3 \circ x_4$ into table $H_{x_1}$ . And to query whether $x \in H$ , we simply query whether $x_2 \circ x_3 \circ x_4 \in H_{x_1}$ .

Finally, each $H_i$ is just a classical linear-probing hash table, dynamically resized to have a load factor (i.e., space efficiency) between 80% and 90%. That’s the entire construction.

Analysis. The trick here is that, even though $H$ is storing 4-byte keys, each $H_i$ is only storing 3-byte keys. The first byte of each key is encoded implicitly by the choice of which $H_i$ the key is stored in. This is a (somewhat unusual) application of a technique known as “quotienting”.

How much space do the $H_i$ ‘s consume? Remember that each $H_i$ is kept at a load factor of 80% or above. So, if $H_i$ is storing $n_i$ 3-byte keys, then it uses at most $3 n_i / 0.8 = 3.75 n_i$ bytes of space.

The different sub-hash-tables $H_1, \ldots, H_{256}$ may have very different sizes than each other at any given moment. Nonetheless, since the sum of their sizes in $n$ (the total number of keys in $H$ ), the total space used by the sub-hash tables is at most $3.75 n$ bytes of space.

Finally, we need to account for the metadata of the 256 sub-hash-tables. All things considered, in my implementation, this uses less than 6000 bytes. So our total space is at most $3.75 n + 6000$ bytes.

For any $n \ge 5334$ , the total space of our hash table comes out to less than 31 bits per key. We’ve built a real-world extremely tiny hash table.

Real-World Performance. So is this data structure fast? Well, it’s not blazingly fast. But it’s not embarrassingly slow either.

On my laptop, the time to first insert ten million keys and then do twenty million queries (half positive, half negative) is 3.7 seconds. If we instead use the C++ std::unordered_set (configured to use the same hash function as our original hash table), the same sequence of operations takes 7.3 seconds.

So we’re faster than the C++ hash table, but at the same time, that’s not a Herculean feat. (And it’s also not a fair comparison since, of course, the C++ hash table supports much more functionality). But it is a proof of concept: you can build extremely tiny hash tables in the real world.

My code is available at https://github.com/williamkuszmaul/tinyhash.