Canonical Huffman codes are a speed optimization, not a space optimization

7 min read Original article ↗

I was working on a single-header compression codec for C/C++, and was doing research on how existing compression codecs work. DEFLATE uses Huffman codes, but specifically, “canonical” Huffman codes. Here’s a story about me trying to figure out why DEFLATE used them, and how hard it is to sift through irrelevant-but-interesting-sounding information that people keep parroting without digging into it.

What’s so good about the Huffman canon?

Huffman codes are a type of binary entropy coding used in formats like .zip and others. They’re a fairly simple and very effective way to store highly redundant data that doesn’t have a lot of local structure, like the output of an RLE encoder. The Deflate compression format uses them to further compress data after it’s already been compressed with a different byte-level algorithm. However, Deflate specifically uses “canonical” Huffman codes. Why?

The Wikipedia page on canonical Huffman codes goes on and on for many paragraphs and sections about the space savings that they give when storing the tree itself. (There is no space savings when storing the text encoded with the Huffman code.) It alludes to a possible performance improvement, but only once, and understates it. (“Dramatically simplified”, as the phrasing was when I read the article, and still is at time of writing, does not mean or imply “made dramatically more efficient” to programmers. Oftentimes, if not usually, more complex code is much faster.)

I thought this was stupid. Yes, storing the entire code for each symbol one at a time would be wildly inefficient, but there are other ways to store arbitrary non-canonical Huffman trees that are more efficient than the naive way of doing it. For example, you could store a bit for whether the current node is a parent or a leaf, then store the byte/character/symbol it maps to if it’s a leaf, or its children if it’s not, in a list, walking down the tree starting on the left. Then, decoding the tree follows the same path down the tree, popping a bit to see if it’s at a leaf or not. This works, and for a set of only 256 symbols the maximum overhead is 511 bits, or 64 bytes. This is why I thought the space savings of canonical Huffman codes was stupid. 64 bytes in the grand scheme of storing an entire compressed file is basically nothing.

Surely the Wikipedia page was misguided. I poked around on the internet for a couple days while doing other things, trying to make sense of this. It didn’t make sense. People just talked about the space savings and nothing else. Eventually, I gave up and focused on speeding up other parts of my codec, replacing the RLE system with an LZSS-style lookback system (yes, I was dumb enough to start with RLE).

It didn’t feel right that something so seemingly important only had a few dozen or hundred bytes of gain. Surely there had to be another reason to use it. Canonical Huffman codes are too elegant, when you look at them, to not have some useful property. What was it?

Oh Hi Mark

I went looking again, and this time, I got lucky. I found a post from 2018 on Stack Overflow by Mark Adler (of zlib fame) that, among pointing out some optimization mistakes, alluded to the existence of a faster decompression algorithm that only works with canonical Huffman codes, and pointed at the puff.c source code. He also outright stated that the textbook approach, reading codewords from the compressed bitstream one at a time and traversing the Huffman tree branch by branch, was slower than what puff.c does. And I happened to be using the textbook approach.

(Mark’s post as I found it, with the relevant text highlighted)

It took me a few hours to figure out what the puff.c code was doing, and why it worked. But then I sat down and listed out a simple canonical Huffman code with just enough complexity to maybe give me a hint:

This was it. For every code length, there was an obvious cutoff value, below which the codes had that length, and above which the codes had to be longer.

This means that, assuming the maximum code length is N, you can decode an entire codeword up to the end with no memory accesses other than an access into an N-element-long array of shorts, to read the cutoff value. puff.c is just calculating the cutoff value mid-loop with addition, based on an array of counts, which is why I didn’t understand it.

Once you know where the end of the codeword is, you have to do more memory accesses, but if you limit your codeword length to N, then you can implement it as an index operation into a 2^N-element-long array. If N is 15, then the array will fit entirely inside of L1/L2 cache on modern CPUs.

So, here’s pseudocode of how puff.c’s fast canonical Huffman decoder works:

argument bit_buff
var code = 0
// 15 bits is the max bit length of code words in deflate
for code_length in 1..=15
	var bit = bit_buff.pop_bit()
	code <<= 1
	code |= bit

	var cutoff = get_cutoff(code_length)
	// if we're below the cutoff, we're done. emit the symbol
	if code < cutoff:
		return emit_symbol(symbol_lookup[code])
	

// imagine the following huff tree:
// 0    : A
// 100  : Z
// 101  : B
// 110  : D
// 1110 : Q
// 1111 : R
// the cutoff for codes of length one is 1
// the cutoff for codes of length two is 1
// the cutoff for codes of length three is 111
// the cutoff for codes of length four is 10000

After implementing a canonical Huffman code generator, a pass to make sure the Huffman code was limited-length, and finally, this type of fast Huffman string decoder, I saw a very substantial 55% speed improvement.

// compressed with only huffman, not with LZSS/RLE/delta
// best of 3 runs, but also representative of a typical run

// with length-limited canonical huffman code:
$ ./loh.exe x data/Godot_v4.1.3-stable_win64.exe.loh data/asdf
time to decode huffman-packed data: 1.147s

// with arbitrary huffman code:
$ ./loh.exe x data/Godot_v4.1.3-stable_win64.exe.loh data/asdf
time to decode huffman-packed data: 1.797s

New code:

    size_t i = 0;
    
    // operating on bit buffer input bytes is faster than operating on individual input bits
    uint8_t * in_data = buf->buffer.data;
    size_t in_data_len = buf->buffer.len;
    uint16_t code_word = 0;
    uint8_t code_len = 1;
    // finish up any remaining input bytes
    for (size_t j = buf->byte_index; j < in_data_len; j++)
    {
        // loop over bits in each byte
        uint64_t word = in_data[j];
        for (uint8_t b = 0; b < 8; b += 1)
        {
            code_word = code_word | (uint16_t)(word & 1);
            word >>= 1;
            if (code_word < max_codes[code_len])
            {
                ret.data[i++] = symbols[code_word];
                if (i >= output_len)
                    goto out;
                code_word = 0;
                code_len = 0;
            }
            code_len += 1;
            code_word <<= 1;
        }
    }
    out:
    ret.len = output_len;

This isn’t the best this technique can do by a long shot. I’m probably missing some trivial low-level optimization, like cutting the inner loop down by a couple instructions, or pre-aligning the input buffer and operating on it in 64-bit chunks. But for the sake of demonstrating what should be the first thing people teach about Deflate, it does a good job! (To clarify, this isn’t a Deflate decoder, it’s just doing the same cutoff-based decoding technique.)