(This is Part Two of a two-part answer. Part one is available here.)
Step 4: Think Recursively
Let's take a minute to think about how we got here. We started off by computing the prefix sum at each bit, as shown here:

This used O(n log n) total bits.
Then we saw that we could reduce the number of bits by subdividing our array into blocks of size log n, only computing the prefix sums at the start of those blocks, and then within each block writing out local prefix sums, as shown here:

This used O(n log log n) bits, a major improvement from before. What we learned here is the following insight: if you're going to write out prefix sums, it's better to subdivide into blocks before doing so.
And with that insight in mind, let's look back at where we are right now. Within our new structure, we are still writing out prefix sums within each block. But we also know that if you're going to write out prefix sums, it's better to subdivide into smaller pieces. Could we apply this idea within each block? Here's the idea. We'll subdivide each block into "miniblocks" of some size we'll pick later. Within each block, we'll write out our prefix sums at the start of each miniblock, and then, within the miniblock write out prefix sums for each of the miniblock elements. That might look like this:

As before, for space reasons I haven't drawn out the subdivision of each block into miniblocks, nor the prefix sums within each miniblock, since doing so would result in a diagram that doesn't fit on the screen. :-)
The procedure for doing a rank query here is essentially the same as what we did with the two-level structure, just now with three levels. We figure out which block our query ends in, then write down the prefix sum at the start of the block. Then we find our which miniblock our query ends in, and write down the prefix sum at the start of the miniblock. Finally, we look up the prefix sum within the miniblock using the table we precomputed at the very bottom level. All of these steps take time O(1), so queries still take time O(1).
What does this do to our space usage? Intuitively, based on what we saw before, this should use less memory than our initial approach. But we have to work out the math to check whether that is indeed the case. Let's do that here.
At the top level, we've subdivided our array into blocks of size log n. There are, therefore, roughly n / log n blocks. A prefix sum at the top level ranges from 0 to n, inclusive, so each prefix sum uses O(log n) bits. Overall, that uses O((n / log n) log n) = O(n) bits.
At the second level, we've subdivided our blocks into "miniblocks." Let's say that our miniblocks have size b'. There are a total of O(n / b') miniblocks. At the start of each miniblock, we write down the prefix sum within the block. Those prefix sums range from 0 to log n bits, inclusive, since our block size is log n. This means that we need O(log log n) bits per prefix sum here. Overall, for this second level, we therefore need O((n log log n) / b') bits.
Now let's look at the third level. At each of the n bits in our original bit array, we'll need to write down the prefix sum at that bit within its miniblock. If each miniblock has size b', then the maximum prefix sum within a miniblock is b', so we need O(log b') bits for each of these numbers. This collectively takes O(n log b') bits.
When we add all these terms together, we're left with a space usage of O((n log log n) / b' + n log b') bits. As before, picking b' to be too small will mean our miniblocks aren't big enough and we'll use too much space within each block writing down prefix sums (the O((n log log n) / b') term will be too big). If we pick b' to be too big, then we'll use too many bits writing down offsets within a miniblock (the O(n log b') term will be too big). There's some optimal point at which we'd set b' to minimize the space usage, and it happens to be the case that this is when b' = O(log log n) (that is, b' is some constant multiple of log log n). When we choose b' this way, our space usage works out to O(n log log log n) bits - yet another improvement in the total amount of bits we use!
At this point, you might spot a pattern. If we don't split into blocks at all, we use O(n log n) bits. If we split into blocks of size log n, we use O(n log log n) bits. If we split into miniblocks of size log log n, we use O(n log log log n) bits. Can we keep this up?
We can, but we're going to need to introduce some new notation in order to see how. :-)
Let's define log(k) n to be the logarithm function applied k times to the number n. So, for example:
- log(0) n = n
- log(1) n = log n
- log(2) n = log log n
- log(3) n = log log log n
- ...
Let's now reframe our previous approaches.
- If we split into blocks 0 times, our space usage is O(n log(1) n).
- If we split into blocks 1 time, our space usage is O(n log(2) n).
- If we split into blocks 2 times, our space usage is O(n log(3) n).
More generally, if we subdivide into blocks, then subdivide those blocks into blocks again, then subdivide those blocks into blocks again, etc., each time making our blocks logarithmically smaller than what we started with, and do this k total times, it looks like our space usage is O(n log(k+1) n). Is that a coincidence?
Turns out, no, it's not a coincidence, but there are a few details we have to watch out for. We can consider the following recursive construction that builds a data structure for ranking.
- If n is "sufficiently small," (say, n ≤ 2), just write down prefix sums for each bit.
- Otherwise, split your array of n items apart into blocks of size b = log n. Write down prefix sums at the start of each block. Then, recursively apply this construction to each block.
If this recursion goes on for k layers, you can work out that the space usage works out to O(nk + n log(k) n). The query time, if this goes on for k layers, is O(k), since at each layer we have to find which block we belong to and send the recursion a bit deeper.
Just by eyeballing things, we can guess that k is going to be tiny compared to n. After all, every time we go through the recursion, the value of n shrinks by a log factor, and that's going to massively reduce how big it is! As an example, suppose we pick n to be the number of protons known to exist in the universe, which is approximately 2256. Then
- At the top level of the recursion, n = log(0) 2256 = 2256.
- At the level below this, n = log(1) 2256 = 256.
- At the level below this, n = log(2) 2256 = 8.
- At the level below this, n = log(3) 2256 = 3.
- At the level below this, n = log(4) 2256 ≈ 1.58.
In other words, once we're five layers deep in the recursion, we've reduced the size of our input from "how many protons are estimated to be in the universe" to something smaller than two. There really aren't going to be that many layers here!
To quantify this, we can introduce the iterated logarithm function, denoted log* n. The value of log* n is, intuitively, "how many times you have to take a logarithm before you drop the number to 2 or lower." So, for example, log* 2256 = 5. This function grows absurdly slowly. In fact, to find a value of n where log* n ≥ 10, we need to look at the number
22222222222
which exceeds anything that anyone has ever conceived of that could fit into the known universe.
Putting all this together, we now know that the number of layers of recursion used here is log* n. That means that the space usage for our structure is now
O(nk + n log(k) n)
= O(n log* n),
and our query time is now O(log* n). For all intents and purposes, this is linear space usage and constant query time, since, as mentioned above, the smallest n where log* n exceeds 10 requires crazily iterated exponents to represent. Amazing!
So... we're done, right? Well, unfortunately, no. Here's why. First, from a practical perspective, while our space usage is essentially "some small constant times n" (say, 4n or 5n bits), we're still left with a situation where our data structure requires more space to store than the original array of bits. And if n is large, we might not have enough space in memory to store 4n or 5n bits.
Second, speaking as a proud citizen of Theoryland, there is a gap - albeit a shockingly small gap, but a gap nonetheless - between O(n log* n) and O(n) and between O(log* n) and O(1). What we'd ultimately want to do is get something that truly does use O(n) space with query times of O(1).
At this point it might not be clear how to get there. We've already seen the two major ideas from before - writing out fewer numbers, and writing out smaller numbers - and carried them to their logical conclusion. And indeed, to the best of my knowledge these ideas on their own can't push the space usage down further. To make additional progress, we're going to need to incorporate a new technique, one that's commonly employed in the data structures research community, but one that is almost unheard of in the general CS community. That technique goes by a mysterious title: the Method of Four Russians.
Step 5: Use the Method of Four Russians
To understand where we're going, I want to jump us backwards in time to the first place we ever tried subdividing our array of bits into blocks. That's when our data structure looked like this:

At this point, we'd split our array into blocks of some size b and written down the prefix sums at the start of each block. At the time we did this, we didn't know how big our blocks would be. Later on, we found out that choosing b = log n (or, more generally, choosing b to be some multiple of log n) worked out particularly well. Subjectively, log n is substantially smaller than n. In other words, intuitively, we're picking blocks that are absolutely tiny from the perspective of the size of our original input.
Let's play around with this idea a bit. For now, let's move away from the idea that we're picking blocks of size log n, and instead imagine that we pick a block size that's generally "very small." For example, suppose we pick our block size to be b = 3, and, as before, write down prefix sums at the start of each block. Here's what that might look like:

And now, for an observation that's going to get us a lot of mileage. I picked this particular bit array and block size because there are 12 total blocks. However, with b = 3, there are only 8 possible distinct blocks. Those are shown here:
000 001 010 011 100 101 110 111
By the pigeonhole principle, since there are more total blocks in our subdivision than there are different combinations of three bits, some of these blocks must appear multiple times in the original array.
"Okay," you might be saying. "So some blocks get repeated. Why is that significant?" To see why, think about our two-layer structure. As a reminder of how our two-layer structure worked, we
- subdivided the original array into blocks of some size, writing down prefix sums at the start of each block, then
- wrote down prefix sums within each block.
Here's what that might look like with b = 3:

As before, I can't draw out the entire bottom layer because it won't fit on your screen. But I've drawn enough to point out a key detail. In this array are three copies of the block 101. Importantly, the prefix sums within those blocks are identical, since the blocks have the same bits. It doesn't matter where in the top-level array those blocks appear. The prefix sums within the block just care about the sums of the bits purely in the block, not the surrounding context.
If our goal is to reduce space usage, this seems like a prime opportunity. Each of those blocks needs to know what its prefix sums are, but there's no reason for us to write out separate copies of those prefix sums every time we see that block. We could imagine just writing down the prefix sums once per block, then finding some way to share those prefix sums across the blocks. If we could do that, assuming that we were guaranteed that the same blocks would repeat over and over again, we could potentially save a lot of space!
Here's what this might look like. We'll pick some block size b. There are 2b possible blocks of size b, since each bit can either be 0 or 1 and there are b total bits. For each of those 2b possible blocks, there are b+1 prefix sums we need to store, one for each bit and one for after all those bits. We could therefore form a table containing O(2b · b) entries representing every possible prefix sum query that could ever be made on any possible block. For b = 3, that would look like this:

To see how to use this table, let's see how to query rank(17). Here's how this would work:
- As with a regular two-level structure, we begin by determining what block our query ends in. To do that, we compute ⌊17 / 3⌋ = 5. We're therefore in block number 5 (zero-indexed), and we can read off the prefix sum up to the start of the block, which is 10.
- We know the sum of the first 15 bits, but we need the sum of the first 17 bits. We therefore need to compute the sum of the first 17 % 5 = 2 bits within this block. To do so, we consult our table! Our block is
111, and we want to look up the sum of the first two bits. Consulting our table, we see that the sum of the first two bits of this block is 2. - We add these two values together to get 10 + 2 = 12, which is the correct answer.
The key to making this run fast is the following: each block is a series of bits, which can be interpreted as a number. For example, our block, 111, is the number 7. We can therefore use the bits of the block themselves as an index into our table! The cost of this lookup is then the cost of a regular 2D table lookup, which is O(1). Amazing!
Now, how much space does this approach use? Some of you may have noticed that the number of possible blocks of size b is 2b and felt a twinge of suspicion. The function 2b grows very quickly as a function of b, which means that we'd need to pick very small blocks for this to work! And indeed we will - but hold that thought for now.
To work out the exact details of how much space we'll need, we need to account for two separate parts of the structure. First, there's the top-level array of running prefix sums up to the start of each block. As we've seen before, that will use O((n log n) / b) bits.
Next, and most importantly, is our table. As we saw earlier, this table will have dimensions 2b × (b + 1), since there are 2b possible blocks of size b and each block can be queried at b+1 indices. That means our table has O(2b · b) entries.
But, as we've seen many a time in the course of this journey, we then have to ask: how may bits is each entry? Each entry stores a number between 0 and b, inclusive, and therefore uses O(log b) bits. Putting this all together, we end up with a table that needs O(2b · b · log b) total bits. (Wow, it's like three exponential generations of b! You have the "grandparent" 2b, the "parent" b, and the "child" log b. ^_^)
Overall, this means that our space usage is O((n log n) / b + 2b · b · log b). Let's think about what this means.
- If b is too small, then, as with our previous structures based on breaking things apart into blocks, we will have too many blocks and therefore O((n log n) / b) will be too big. In particular, if we're gunning for O(n) total space, we need b to be approximately log n.
- If b is too large, then there will be way too many possible blocks of size b. In particular, if b is too large, the O(2b · b · log b) term will be too big. In particular, if we want our total space usage to be O(n), we need to pick b so that the 2b term isn't too big. That means that b will end up being approximately log n.
All of this is suggesting that we pick b = log n. However, that choice won't work. If we do this, then the O(2b · b · log b) term will work out to
O(2b · b · log b)
= O(2log n log n log log n)
= O(n log n log log n).
(This works because 2log n = n, since log n is the inverse of 2n.) And now we're using more space usage than we started with.
However, what we can do is pick b = k log n for some constant k < 1 that we'll pick later. If we do this and apply properties of logarithms, we'll get the following:
O(2b · b · log b)
= O(2k log n · k log n · log (k log n))
= O(2log nk · k log n · (log k + log log n) (properties of logarithms)
= O(2log nk · log n · log log n) (k is a constant)
= O(nk log n log log n)
Now, pick k = ½, meaning we pick b = ½ log n. Then our space usage simplifies down to
O(2b · b · log b)
= O(nk log n log log n)
= O(√n log n log log n)
= O(n2/3).
Don't worry if you're scratching your head over that last step. The reason this works is that both log n and log log n grow slower than any root of n, and so we're able to conservatively bound the total space usage at O(n2/3).
Putting everything together, our space usage works out to
O((n log n) / b + 2b · b · log b)
= O((n log n) / ((1/2) log n) + n2/3)
= O(n + n2/3)
= O(n).
(That last step follows because n2/3 grows much more slowly than n does.) Amazing! We've managed to get linear space usage and constant query time!
The key insight here, that if the blocks are sufficiently small, we can precompute all of them and share space, is sometimes called the Method of Four Russians or a Four-Russians Speedup. It takes its name from a paper by four Soviet computer scientists that first piloted the technique. I like to think of it as "divide, precompute, and conquer:" you break a large problem down into tiny pieces, precompute the solution to each tiny piece, and then combine solutions to the larger-scale problem and the smaller-scale problems together. It's an amazing technique that shows up all over advanced data structures as a way of removing a log factor from runtime or space usage.
So, we must be done at this point, right? Surely, there's no further room for improvement? Well, almost. But not quite.
It's great that we have O(n) total bits of storage, but exactly how many bits is that? If you work out the exact value, it's approximately 2n + n2/3 bits. That is a significant improvement over where we started, but we are still using twice as many bits for our data structure as used by the original bitvector. And if that bitvector is enormous, that's can be a problem!
Our new goal will be to reduce our space usage even further. The goal will be to use fewer than O(n) bits for our data structure. That is, we're going to aim to get our space usage so low, we end up using fewer bits for our rank query structure than would be required by the original bit array itself.
How is this possible? Turns out we already have all the pieces we need. We just need to put them together in the right way.
Step 6: Combine Strategies
We have essentially come up with two parallel strategies for computing ranks.
- Subdivide the input into blocks of size log n. Write down prefix sums at the start of each block. Then recursively repeat this process on each block.
- Split the input into blocks of size ½ log n. Write down prefix sums at the start of each block. Then precompute a table of how to handle queries within blocks of size ½ log n.
Strategy (1) gave us an O(n log* n)-bit data structure, with the space savings coming from the fact that it's more space-efficient to split things apart into blocks than it is to solve the problem directly. Strategy (2) gave us an O(n)-bit data structure, with the space savings coming from the fact that once we hit size ½ log n, we can precompute all possible queries.
Now for the last insight: what if we mix ideas (1) and (2) together? Specifically, here's how we're going to do things.
Split the input array of n bits into blocks of some block size b. Write down prefix sums at the start of each block.
Subdivide each block of size b into "miniblocks" of size ½ log n. Within each block, write down prefix sums at the start of each miniblock.
Build a Four Russians table that says, for each miniblock of size ½ log n and for each query index within such a block, what the prefix sum in that block at that index is.
Querying this data structure feels like a mix of the recursive and the Four Russians approach.
- Find which block your query ends in. Write down the prefix sum at the start of that block.
- Find which miniblock within that block your query ends in. Write down the prefix sum at the start of that miniblock.
- Use the Four Russians table to look up the sum of the remaining bits within the miniblock.
Each step takes time O(1), so our queries still run in time O(1).
The intuition behind this approach is the following. The Four Russians table uses O(n2/3) space, which is already sublinear. To drop our space usage below n, we need to make sure our prefix sums don't use too many bits. By adding a middle layer, we can pick large-ish blocks so that the top-level prefix sums don't use too much space, but then avoid the cost of those larger blocks by using the Four Russians table.
If you work out the math on how much space this strategy will require, we will end up needing
- O((n log n) / b) bits for the top-level prefix sums,
- O((n log b) / log n) bits for the prefix sums within each miniblock (there are O(n / log n) miniblocks, and each miniblock index is b bits long), and
- O(n2/3) bits for the Four Russians table.
Adding this together, we get space usage O((n log n) / b + (n log b) / log n) + O(n2/3) bits. Once again, we find ourselves in a situation where if b is too small, we use too many bits at the top level, and if b is too big, we use too many bits one level below that. What choice of b gives the optimal balance?
Surprisingly, the optimal choice of b turns out to be b = log2 n (that is, b = (log n)2). If you plug in this choice of b into the expression, we get the following space usage:
O((n log n) / b + (n log b) / log n) + O(n2/3)
= O((n log n) / log2 n) + n log (log2 n) / log n) + O(n2/3)
= O(n / log n + n log log n / log n) + O(n2/3)
= O(n · (log log n / log n))
That is a weird-looking space usage - is it good? Is it bad? What does it mean? Well, notice that even though log n grows slowly, log log n grows even more slowly than that. As an example, if n ≈ 4,000,000,000, then log n ≈ 32 and log log n ≈ 5. That means that log log n / log n ≈ 1/7. So the total number of bits we need for our data structure, O(n · (log log n / log n)), is a sublinear number of bits! In other words, we use fewer bits for our data structure than the original array of bits needs!
That isn't to say that we can throw the original array of bits away. Our data structure requires us to still have access to it, since once you're down to a miniblock you need to use the bits of that miniblock as an index into the Four Russians table. Rather, if you already have the n bits of the original bit array lying around, you can add in a sublinear number of additional bits and suddenly be able to compute ranks in time O(1).
So SURELY we are done at this point, right? We MUST have gone as far as we can go, right? Well...
Step 7: The Story Continues
From a Theoryland perspective, we could call it quits here. We have proved that it's possible to compute ranks in a bitvector that's n bits long using fewer than n additional bits.
This combined approach is an example of a succinct data structure. Intuitively, a succinct data structure is one whose space usage equals the space needed to write out the data, plus something that grows more slowly than that. If you're familiar with little-o notation, a succinct data structure is one whose space usage is X + o(X), where X is the number of bits needed to write out the data itself.
But in other senses, in Theoryland, we aren't yet done. We know that it's possible to solve ranking in constant time and with space O(n · (log log n / log n)). Is that the best possible space we can achieve with constant query time, or can we go lower? Turns out we can do much better than this. The approach I've shown here was invented in 1989 by Guy Jacobson in a frequently-cited PhD thesis. Recently (2019), Huacheng Yu published a paper describing a set of theoretical optimal succinct rank data structures that have the best possible tradeoffs between query time and space usage.
Then there's what things look like in practice. The structure we discussed here doesn't perform super well in practice due to the high constant factors required for reading variable-bit numbers (see the above C code, for example) and because of the poor locality of reference of the multiple table lookups. As an example, the poppy data structure supports fast ranking in practice with minimal overhead.
Finally, there are generalizations and other related problems. The wavelet tree generalizes rank on bitvectors to rank on arbitrary strings. In the generalized version of this operation, the rank operation takes as input an index and a character, then asks "how many times does this character appear before this index?" This has applications in compressed data structures for text, like the FM-index. The inverse of the rank operation is called select, where queries are of the form "where is the ith 1 bit in this bitvector?" Using similar techniques to succinct rank, plus some extra problem-specific insights, it's possible to get select queries that take time O(1) with sublinear overhead.
Hope this helps!
(This post is based on a lecture I gave in my data structures course on succinct rank and select.)