When people talk about “space-efficient hash tables”, they are usually talking about the following type of guarantee: If we are storing keys, each of which are
bits, then the total space usage should be
bits for some small
.
But, what if I told you we can do better? In fact, it’s possible to construct a hash table that uses fewer than bits. That is, the hash table uses less space than if we were to write each of the
keys, one after another, in an array. As terminology, I’ll refer to such a hash table as extremely tiny.
Techniques for building extremely tiny hash tables have been known to theoreticians for decades. But how hard is it to build one in the real world? Not that hard at all, it turns out.
In this post, I’ll show you how to build a very simple extremely tiny hash table for 32-bit keys.
The Construction. Since our keys are 32-bits, we can write each key as
, where each
is one byte. Our extremely tiny hash table
will have the following structure: it consists of 256 sub-hash-tables
. To insert a key
into
, we simply insert the three-byte item
into table
. And to query whether
, we simply query whether
.
Finally, each is just a classical linear-probing hash table, dynamically resized to have a load factor (i.e., space efficiency) between 80% and 90%. That’s the entire construction.
Analysis. The trick here is that, even though is storing 4-byte keys, each
is only storing 3-byte keys. The first byte of each key is encoded implicitly by the choice of which
the key is stored in. This is a (somewhat unusual) application of a technique known as “quotienting”.
How much space do the ‘s consume? Remember that each
is kept at a load factor of 80% or above. So, if
is storing
3-byte keys, then it uses at most
bytes of space.
The different sub-hash-tables may have very different sizes than each other at any given moment. Nonetheless, since the sum of their sizes in
(the total number of keys in
), the total space used by the sub-hash tables is at most
bytes of space.
Finally, we need to account for the metadata of the 256 sub-hash-tables. All things considered, in my implementation, this uses less than 6000 bytes. So our total space is at most bytes.
For any , the total space of our hash table comes out to less than 31 bits per key. We’ve built a real-world extremely tiny hash table.
Real-World Performance. So is this data structure fast? Well, it’s not blazingly fast. But it’s not embarrassingly slow either.
On my laptop, the time to first insert ten million keys and then do twenty million queries (half positive, half negative) is 3.7 seconds. If we instead use the C++ std::unordered_set (configured to use the same hash function as our original hash table), the same sequence of operations takes 7.3 seconds.
So we’re faster than the C++ hash table, but at the same time, that’s not a Herculean feat. (And it’s also not a fair comparison since, of course, the C++ hash table supports much more functionality). But it is a proof of concept: you can build extremely tiny hash tables in the real world.
My code is available at https://github.com/williamkuszmaul/tinyhash.