Individual URL compression experiment initial thoughts

3 min read Original article ↗

Here is a seemingly simple problem that I can't find a proper solution for: given a single web page URL, devise an algorithm to compress it.

The only constraints are:

  • encoding and decoding overhead should be fairly low (say no more than O(n^2));
  • the algorithm can have as much static data it needs (say at most 4 GiB),
  • but, it can't have a dynamically updated database (thus read-only file-system access only);

I've tried Huffman coding on (up to 256 K) tokens identified with Byte-Pair-Encoding, (see Byte-Pair-Encoding on Wikipedia, and A New Algorithm for Data Compression for the original paper), one dictionary per each part (host, path, and query), and I get somewhere between 50% and 70% savings (i.e. a compression ratio between 0.3 and 0.5) depending on the part (more on the path and query, less on the host).

Here are a few similar projects that I've found (but which I didn't try because they either use plain Huffman coding, or are tailored for other types of data):

I'm curious what techniques others might try, and what would be their expected compression ratio?


What I've learned so far from my experiment:

  • (expected) employing standard compression algorithms (like Gzip, Zstandard, LZ4, etc.) doesn't help much (when applied on individual URLs, thus not in batches); (only less than 1% of the URLs are actually individually compressible;)

  • (unexpected) adding a compression dictionary, at least for LZ4, compresses around two thirds of URLs, but the ratio is negligible, saving on average at most 10% per URL;

  • (expected) splitting the URL in parts, then by using Byte-Pair-Encoding identifying common tokens (for a sample of these parts), and then creating a Huffman code over these tokens, yields somewhere around 50% to 70% savings; (thus a compression ratio of about 0.3 to 0.5;)

  • (unexpected) trying to generate two Huffman codes (thus two per each URL part), one for LZ4 compressible strings, and a separate one for LZ4 incompressible ones, then at runtime based on the LZ4 compression deciding which Huffman code to use for a given part, yields only around 10% to 20% savings; (worse than just using a single Huffman code for all cases and ignoring LZ4 with dictionary compressibility;)


Thus, I think I was defeated... I can't go beyond the the 0.3 to 0.5 compression ratio... Maybe this is the inherent entropy of URLs?

For example, as an optimal use-case for a standard compression algorithm, taking only the path part of an URL, sorting and deduplicating them, and then compressing in bulk around ~225 million such strings (around 9.7 GiB of data), yields the following compression ratios:

  • lzip -9 -- 0.25 (took a lot)
  • bzip3 -b 511 -- 0.25 (quite quickly)
  • xz -9 -- 0.25 (took a lot)
  • brotli -q 11 -- 0.26 (took an eternity)
  • zstd -19 -- 0.27 (took a lot)
  • bzip2 -9 -- 0.30 (quite quickly)
  • gzip -9 -- 0.33
  • lzop -9 -- 0.40
  • lz4 -12 -- 0.41

My Huffman-BPE approach achieves these ratios:

  • 0.38 / 0.41 -- if weighted with the domain "importance" (based on Common Crawl Harmonic Centrality); (my Huffman code takes this into account when building the codes;)
  • 0.43 / 0.45 -- if weighted by occurrences count;
  • 0.46 / 0.47 -- without any weighting; (thus the exact same inputs as for the standard compression algorithms, thus a fair comparison);

In conclusion, the best-case scenario for state-of-the-art compression algorithms squeezes around 10%-20% more than my Huffman-BPE approach, and perhaps there isn't much to be done for the individual URL compression (and still keep the runtime complexity or storage requirements low).