Here is a seemingly simple problem that I can't find a proper solution for: given a single web page URL, devise an algorithm to compress it.
The only constraints are:
- encoding and decoding overhead should be fairly low (say no more than
O(n^2)); - the algorithm can have as much static data it needs (say at most 4 GiB),
- but, it can't have a dynamically updated database (thus read-only file-system access only);
I've tried Huffman coding on (up to 256 K) tokens identified with Byte-Pair-Encoding, (see Byte-Pair-Encoding on Wikipedia, and A New Algorithm for Data Compression for the original paper), one dictionary per each part (host, path, and query), and I get somewhere between 50% and 70% savings (i.e. a compression ratio between 0.3 and 0.5) depending on the part (more on the path and query, less on the host).
Here are a few similar projects that I've found (but which I didn't try because they either use plain Huffman coding, or are tailored for other types of data):
- https://github.com/siara-cc/Unishox2 -- this might be the most promissing one;
- https://github.com/Ed-von-Schleck/shoco
- https://github.com/antirez/smaz
- https://github.com/siara-cc/Shox96
I'm curious what techniques others might try, and what would be their expected compression ratio?
What I've learned so far from my experiment:
(expected) employing standard compression algorithms (like Gzip, Zstandard, LZ4, etc.) doesn't help much (when applied on individual URLs, thus not in batches); (only less than 1% of the URLs are actually individually compressible;)
(unexpected) adding a compression dictionary, at least for LZ4, compresses around two thirds of URLs, but the ratio is negligible, saving on average at most 10% per URL;
(expected) splitting the URL in parts, then by using Byte-Pair-Encoding identifying common tokens (for a sample of these parts), and then creating a Huffman code over these tokens, yields somewhere around 50% to 70% savings; (thus a compression ratio of about 0.3 to 0.5;)
(unexpected) trying to generate two Huffman codes (thus two per each URL part), one for LZ4 compressible strings, and a separate one for LZ4 incompressible ones, then at runtime based on the LZ4 compression deciding which Huffman code to use for a given part, yields only around 10% to 20% savings; (worse than just using a single Huffman code for all cases and ignoring LZ4 with dictionary compressibility;)
Thus, I think I was defeated... I can't go beyond the the 0.3 to 0.5 compression ratio... Maybe this is the inherent entropy of URLs?
For example, as an optimal use-case for a standard compression algorithm, taking only the path part of an URL, sorting and deduplicating them, and then compressing in bulk around ~225 million such strings (around 9.7 GiB of data), yields the following compression ratios:
lzip -9-- 0.25 (took a lot)bzip3 -b 511-- 0.25 (quite quickly)xz -9-- 0.25 (took a lot)brotli -q 11-- 0.26 (took an eternity)zstd -19-- 0.27 (took a lot)bzip2 -9-- 0.30 (quite quickly)gzip -9-- 0.33lzop -9-- 0.40lz4 -12-- 0.41
My Huffman-BPE approach achieves these ratios:
- 0.38 / 0.41 -- if weighted with the domain "importance" (based on Common Crawl Harmonic Centrality); (my Huffman code takes this into account when building the codes;)
- 0.43 / 0.45 -- if weighted by occurrences count;
- 0.46 / 0.47 -- without any weighting; (thus the exact same inputs as for the standard compression algorithms, thus a fair comparison);
In conclusion, the best-case scenario for state-of-the-art compression algorithms squeezes around 10%-20% more than my Huffman-BPE approach, and perhaps there isn't much to be done for the individual URL compression (and still keep the runtime complexity or storage requirements low).