An ode to bzip

169 points by signa11 3 days ago · 92 comments

Reader

idoubtit 3 days ago

My experience does not match theirs when compressing text and code:

> bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.

I've just checked again with a 1GB SQL file. `bzip2 -9` shrinks it to 83MB. `zstd -19 --long` to 52MB.

Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.

mppm 2 days ago

What you are seeing here is probably the effect of window size. BZip has to perform the BWT strictly block-wise and is quite memory-hungry, so `bzip2 -9` uses a window size of 900KB, if I recall correctly. Dictionary-based algorithms are more flexible in this regard, and can gain a substantial advantage on very large and repetitive files. The article kind of forgets to mention this. Not that BZip isn't remarkably efficient for its simplicity, but it's not without limitations.
bigiain 3 days ago

I suspect the reason for the difference here may be specific use case and the implications there on the size of the files? The author's use case is Lua files to run in Minecraft, and I strongly suspect their example file at 327KB is very much closer to "typical" for that use case than a 1GB SQL file.
It wouldn't surprise me at all that "more modern" compression techniques work better on larger files. It also wouldn't surprise me too much if there was no such thing as a 1GB file when bzip was originally written, according to Wikipedia bzip2 is almost 30 years old "Initial releases 18 July 1996". And there are mentions of the preceding bzip (without the 2) which must have been even earlier than that. In the mid/late 90s I was flying round the world trips with a dozen or so 380 or 500MB hard drives in my luggage to screw into our colo boxen in Singapore London and San Francisco (because out office only has 56k adsl internet).
- adrian_b 2 days ago
  
  For large files, it is frequent to obtain much higher compression ratios when using a preprocessing method, e.g. by using lrzip (which invokes internal or external standard compressors after preprocessing the input to find long-range similarities).
  For instance, "lrzip -b", which uses bzip2 for compression, typically achieves much higher compression ratios on big files than using either xz or zstd alone. Of course, you can also use lrzip with xz or zstd, with various parameters, but among the many existing possibilities you must find an optimum compromise between compression ratio and compression/decompression times.
ac29 3 days ago

> Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's
I compressed kernel 6.19.8 with zstd -19 --long and bzip3 (default settings). The latter compressed better and was about 8x faster.
cogman10 3 days ago

bzip is old and slow.
It was long surpassed by lzma and zstd.
But back in roughly the 00s, it was the best standard for compression, because the competition was DEFLATE/gzip.
- duskwuff 3 days ago
  
  Also potentially relevant: in the 00s, the performance gap between gzip and bzip2 wasn't quite as wide - gzip has benefited far more from modern CPU optimizations - and slow networks / small disks made a higher compression ratio more valuable.
- yyyk 3 days ago
  
  Even then, there were better options in the Windows world (RAR/ACE/etc.). Also, bzip2 was considered slow even when it was new.
  - thesz 3 days ago
    
    RAR/ACE/etc used continuous compression - all files were concatenated and compressed as if they were one single large file. Much like what is done with .tar.bz. Bzip on Windows did not do that, there was no equivalent of .tar.bz2 on Windows.
    You can bzip2 -9 files in some source code directory and tar these .bz2 files. This would be more or less equivalent to creating ZIP archive with BWT compression method. Then you can compare result with tar-ing the same source directory and bzip2 -9 the resulting .tar.
    Then you can compare.
    The continuous mode in RAR was something back then, exactly because RAR had long LZ77 window and compressed files as continuous stream.
    
    yyyk 3 days ago
    
    >continuous compression
    'Solid compression' (as WinRAR calls it) is still optional with RAR. I recall the default is 'off'. At the time, that mode was still pretty good compared to bzip2.
eviks 3 days ago

is SQL file text or code?
8n4vidtmkvmk 3 days ago

And here i got best compression out of xz for SQL.

saghm 3 days ago

Early on the article mentions that xz have zstd have gotten more popular than bzip, and my admitted naive understanding is that they're considered to have better tradeoffs in teems of collision compression time and overall space saved by compression. The performance section heavily discusses encoding performance of gzip and bzip, but unless I'm missing something, the only references to xz or zstd in that section are briefly handwaving about the decoding times probably being similar.

My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.

0cf8612b2e1e 3 days ago

There was an analysis which argued that zstd is pareto optimal.
https://insanity.industries/post/pareto-optimal-compression/
- lucb1e 2 days ago
  
  > > Pareto optimality is a situation where no […] preference criterion can be better off without making at least one […] preference criterion worse off […].
  (Omissions theirs.)
  Wasn't that zstandard's stated goal? Not very surprising that it has this property, also considering it's much newer (2015) than the established tools like gzip (1992), bzip2 (1996), LZMA as used by xz utils (1999)
  Edit: https://github.com/facebook/zstd/blob/4856a00164c1d7b947bd38... initial commit indeed states it's meant to have good ratios and good (de)compression speeds as compared to other tools, without sacrificing one for the other (»"Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).«). So Pareto by design, just not using that name
  - bonzini 2 days ago
    
    > meant to have good ratios and good (de)compression speeds as compared to other tools
    That does not mean it's Pareto optimal; Pareto-optimality forms a curve and while zstd, LZMA, LZ4, ZPAQ all want to be as close as possible to the curve, they focus on different parts of it. In particular zstd tries to stay on the middle part of the curve, while LZMA and LZ4 focus on opposite sides
    ___.--- higher throughput / LZ4 / zStd | ; LZMA | | ZPAQ lower size
    Also, the Pareto curve is not necessarily known in advance. All you can do is add more and more algorithms or tweaks to understand what it looks like. For example, this blog post [https://insanity.industries/post/pareto-optimal-compression/] shows that prior to zstd, bzip2 and gzip2 were both pretty much Pareto optimal in the same area for ratio vs. compression speed. LZMA at low settings was a bit better but much slower. There was a huge gap between LZMA and LZ4, and bzip2/gzip filled it as best as they could.
    The same blog post shows that zstd is an absolute speed demon at decompression; while not all zstd settings are Pareto optimal when looking at size vs compression speed (in particular LZMA wins at higher compression ratios, and even considering zstd only there's hardly a reason to use levels 11-15), zstd is pretty much Pareto optimal at all settings when looking at size vs. decompression speed. On the other hand at intermediate settings zstd is faster and produces smaller files than gzip, which therefore is not Pareto optimal (anymore).
    
    rurban 2 days ago
    
    This misses the very best compressors by Fabrice Bellard. https://bellard.org/nncp/ and for text tm_zip
    
    lucb1e 2 days ago
    
    Interesting approach. The fastest of the 4 presented compressors ("LSTM (small)") is 24 times slower than xz, and their best compressor ("LSTM (large1)") is 429 times slower than xz. Let alone gzip or, presumably, zstandard (not shown in paper). They also ran the models on different CPUs (a Core i5 and a Xeon E5) so the results are not even comparable within the same paper. A linked webpage lists the author's decompression times, which are even worse: xz decompresses twelve thousand times faster (50MB/s vs. 4kB/s) when nncp has an Nvidia RTX 3090 and 24GB RAM available to it, which apparently speeds it up by 3x compared to the original CPU implementation.
    At half the size of xz's output, there can be applications for this, but you need to:
    - not care about compression time
    - not be constrained on hardware requirements (7.6GB RAM, ideally let it run on a GPU)
    - not care about decompression time either
    - and the data must be text (I can't find benchmarks other than from English Wikipedia text, but various sources emphasize it's a text compressor so presumably this is no good on e.g. a spacecraft needing to transmit sensor/research data over a weak connection, even if the power budget trade-off of running a GPU instead of pumping power into the antenna were the optimal thing to do)
usefulcat 3 days ago

Bzip is slower than zstd and doesn’t compress as well as xz. There’s no place for it.
- jgalt212 2 days ago
  
  10 years ago we look at replacing gzip archives with bzip. It was just too slow for the incremental space saving gains offered. Creating bzip archives took forever. I know xz is "better", but we still just gzip everywhere.
  - Sesse__ 2 days ago
    
    xz is generally slower-but-more-dense; it's not meant to be a full gzip replacement. Zstd, on the other hand, aims to be a “better gzip”, in that it's compresses about as fast as gzip but packs somewhat denser (and decompresses faster).
- darkwater 2 days ago
  
  You can pry bzip from my dead cold hands.
- int_19h 3 days ago
  
  Article has the numbers for their input:
  uncompressed: 327005
  (gzip) zopfli --i100: 75882
  zstd -22 --long --ultra: 69018
  xz -9: 67940
  brotli -Z: 67859
  lzip -9: 67651
  bzip2 -9: 63727
  bzip3: 61067
  - lucb1e 3 days ago
    
    This matches my experience compressing structured text btw. Bzip2 will beat every other tool out there, both on compression ratio and, sadly, decompression time
    OP says decompression time is so high because it has similar properties to a memory-hard password hash: it's bandwidth-bound due to the random access requirement. Even xz decompresses 2.5x faster, and I don't find it particularly fast
    This is why I switched away, also for text compression; searching for anything that isn't near the beginning of a large file is tedious. My use-case for compression is generally not like OP's, that is, compressing 100KB so that it can fit into Minecraft (if I understood their purpose correctly); I compress files because they take too much disk space (gigabytes). But if I never wanted to access them, I'd not store them, so decompression speed matters. So I kinda do agree with GP that Bzip2 has limited purposes when Zstd is just a few % more bytes to store for over an order of magnitude more speed (1GB/s instead of 45MB/s)
    Edit: And all that ignores non-json/xml/code/text compression tasks, where Bzip2/LZMA doesn't give you the best compression ratio. I'd argue it is premature optimization to use Bzip2 without a very specific use-case like OP has for very good code compression ratios and a simple decoder
    
    necovek 2 days ago
    
    I wonder what the combined speed would be for small to mid-sized text files if they were fully loaded into memory first? That swaps multiple random-accesses for single sequential read (even if sequential is not really with flash memory, it should still perform better than totally random access), and memory random access which should not be a bottleneck at these speeds.
    Or perhaps this is already being done this way and it is a bottleneck?
    This might not work for multi-gigabyte files, but most of text content is well under 1MiB.
    
    lucb1e 2 days ago
    
    This is essentially already being done. The kernel will keep in RAM as much of the file that the system is operating on as will fit
    Code can care about cache locality and improve upon this default behavior, like if you find a heuristic where you can do another part of the decompression already while some of the data is still in a CPU cache and you don't have to hit main RAM, but whether that's possible will depend on the algorithm
    Being in RAM sadly doesn't mean that random access doesn't matter anymore. Just like how hard drives (compare to, say, tape drives) are random access, you still have swapping times to load the data into the CPU (or arm-moving times in the case of an HDD)
  - ac29 3 days ago
    
    I tested bzip3 vs xz compressing enwik8: bzip3 was 7x faster and had ~20% better compression
    
    adrian_b 2 days ago
    
    I have also tested bzip3 and initially I encountered a great number of cases where bzip3 was much better than zstd from all points of view, i.e. compression ratio and execution times, for any zstd options.
    Unfortunately, later I have also found other cases where the bzip3 performance was not so good.
    I spend a lot of time on compression/decompression, but sadly there is no algorithm that can be said to be superior to all others in all circumstances, so for optimum results you must still be prepared to use any of them.
    Even the ancient gzip is preferable in certain cases, when speed is more important than compression ratio, because sometimes it happens to provide a better compromise than newer algorithms.

thesz 3 days ago

BWT is a prediction by partial match (PPM) in disguise.

Consider "bananarama":

  "abananaram"
  "amabananar"
  "ananaramab"
  "anaramaban"
  "aramabanan"
  "bananarama"
  "mabananara"
  "nanaramaba"
  "naramabana"
  "ramabanana"

The last symbols on each line get context from first symbols of the same line. It is so due to rotation.

But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.

[1] https://en.wikipedia.org/wiki/Zipf%27s_law

Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.

fl0ki 3 days ago

This seems as good a thread as any to mention that the gzhttp package in klauspost/compress for Go now supports zstd on both server handlers and client transports. Strangely this was added in a patch version instead of a minor version despite both expanding the API surface and changing default behavior.

https://github.com/klauspost/compress/releases/tag/v1.18.4

klauspost 3 days ago

About the versioning, glad you spotted it anyway. There isn't as much use of the gzhttp package compared to the other ones, so the bar is a bit higher for that one.
Also making good progress on getting a slimmer version of zstd into the stdlib and improving the stdlib deflate.
- fl0ki 3 days ago
  
  Yeah, I make it a habit to read the changelogs of every update to every direct dependency. I was anticipating this change for years, thanks for doing it!
- terrelln 3 days ago
  
  > Also making good progress on getting a slimmer version of zstd into the stdlib
  Awesome! Please let me know if there is anything I can do to help
  - klauspost 3 days ago
    
    https://github.com/klauspost/stdgozstd

hexxagone 3 days ago

Notice that bzip3 has close to nothing to do with bzip2. It is a different BWT implementation with a different entropy codec, from a different author (as noted in the GitHub description "better and stronger spiritual successor to BZip2").

lucb1e 3 days ago

Was the name used with permission? Even if not trademarked (because open source freedom woohoo), it's a bit weird to release, say, Windows 12 without permission from the authors of Windows 11
I tried looking it up myself but it doesn't say in the readme or doc/ folder, there is no mention of any of the Bzip2 authors, and there is no website listed so I presume this Github page is canonical
- tosti 2 days ago
  
  Free software doesn't have a lot of trademarks, with the notable exception of Linux.
  Also, the name is the algorithm. Bzip2 has versions and bzip3 is something else which has its own updated versions. Programs that implement a single algorithm often follow this pattern.
  - lucb1e 2 days ago
    
    > Free software doesn't have a lot of trademarks
    Hence my saying "not trademarked (because open source freedom woohoo)". I understand bzip3 is legal, but my question is whether it's a dick move
    > the name is the algorithm. Bzip2 has versions and bzip3 is something else
    I don't understand this. If bzip3 is "something else" compared to bzip2 then how can both be named after the algorithms they use? Nothing in bzip2's algorithm is related to the digit 2. If it used, say, powers of 2 centrally and bzip3 used pi for something, then the naming could make sense since it's indeed not a version number, but I don't remotely see this argument in these algorithms
    Looking on Wikipedia and bzip2's website, there is no explanation of the name, and yesterday I read somewhere (either in the OP or in another comment) that it would stand for "better zip". It has nothing to do with PKZIP though. If they had both implemented pkzip's format, then a "spiritual successor" to bzip2 would be something like uzip, uzip2 (replacing "better" with "ultimate"), bbzip2, or any of a million other options, but that's not the pattern they chose to follow
    How is this not just taking their established name for your own project with the version number +1?

eichin 3 days ago

Interesting detail on the algorithm but seems to completely miss that if you care about non-streaming performance, there are parallel versions of xz and gzip (pxzip encodes compatible metadata about the breakup points so that while xz can still decompress it, pxzip can use as many cores as you let it have instead.) Great for disk-image OS installers (the reason I was benchmarking it in the first place - but this was about 5 years back, I don't know if those have gotten upstreamed...)

shawn_w 3 days ago

There's also a parallel version of bzip2, pbzip2.
https://man.archlinux.org/man/pbzip2.1.en
And zstd is multi threaded from the beginning.

joecool1029 3 days ago

Just use zstd unless you absolutely need to save a tiny bit more space. bzip2 and xz are extremely slow to compress.

lucb1e 3 days ago

> bzip2 and xz are extremely slow to compress
This depends on the setting. At setting -19 (not even using --long or other tuning), Zstd is 10x slower to compress than bzip2, and 20x slower than xz, and it still gets a worse compression ratio for anything that vaguely looks like text!
But I agree if you look at the decompression side of things. Bzip2 and xz are just no competition for zstd or the gzip family (but then gzip and friends have worse ratios again, so we're left with zstd). Overall I agree with your point ("just use zstd") but not for the fast compression speed, if you care somewhat about ratios at least
hexxagone 3 days ago

In the LZ high compression regime where LZ can compete in terms of ratio, BWT compressors are faster to compress and slower to decompress than LZ codecs. BWT compressors are also more amenable to parallelization (check bsc and kanzi for modern implementations besides bzip3).
silisili 3 days ago

I'd argue it's more workload dependent, and everything is a tradeoff.
In my own testing of compressing internal generic json blobs, I found brotli a clear winner when comparing space and time.
If I want higher compatibility and fast speeds, I'd probably just reach for gzip.
zstd is good for many use cases, too, perhaps even most...but I think just telling everyone to always use it isn't necessarily the best advice.
- joecool1029 3 days ago
  
  > If I want higher compatibility and fast speeds, I'd probably just reach for gzip.
  It’s slower and compresses less than zstd. gzip should only be reached for as a compatibility option, that’s the only place it wins, it’s everywhere.
  EDIT: If you must use it, use the modern implementation, https://www.zlib.net/pigz/
  - adrian_b 2 days ago
    
    Any claims about compressing programs are extremely data-dependent so any general claims will be false for certain test cases.
    I do a lot of data compression and decompression, and I would have liked a lot to find a magic algorithm that works better than all others, to simplify my work.
    After extensive tests I have found no such algorithm. Depending on the input files and depending on the compromise between compression ratio and execution times that is desired, I must use various algorithms, including zstd and xz, but also bzip2, bzip3 and even gzip.
    I use quite frequently gzip (executed after lrzip preprocessing) for some very large files, where it provides better compression at a given execution time, or faster execution at a given compression ratio, than zstd with any options.
    Of course for other kinds of files, zstd wins, but all the claims that zstd should be used for ALL applications are extremely wrong.
    Whenever you must compress or decompress frequently, you must test all available algorithms with various parameters, to determine what works best for you. For big files, something like lrzip preprocessing must also be tested, as it can change a lot the performance of a compression algorithm.
NooneAtAll3 3 days ago

why would one even care about compression speed on minecraft ComputerCraft machine?
size and decompression are the main limitations

pella 3 days ago

imho: the future is a specialized compressor optimized for your specific format. ( https://openzl.org/ , ... )

lucb1e 3 days ago

It's good, but is it "the future" when it's extra work?
Consider that you could hand-code an algorithm to recognize cats in images but we would rather let the machine just figure it out for itself. We're kind of averse to manual work and complexity where we can brute force or heuristic our way out of the problem. For the 80% of situations where piping it into zstd gets you to stay within budget (bandwidth, storage, cpu time, whatever your constraint is), it's not really worth doing about 5000% more effort to squeeze out thrice the speed and a third less size
It really is considerably better, but I wonder how many people will do it, which means less implicit marketing by seeing it everywhere like we do the other tools, which means even fewer people will know to do it, etc.
cgag 3 days ago

This seems very cool. Was going to suggest submitting it, but I see there was a fairly popular thread 5 months ago for anyone interested: https://news.ycombinator.com/item?id=45492803
srean 3 days ago

That is an interesting link.
Does gmail use a special codec for storing emails ?
- duskwuff 3 days ago
  
  The biggest savings for a service like GMail are going to be based around deduplication - e.g. if you can recognize that a newsletter went out to a thousand subscribers and store those all as deltas from a "canonical" copy - congratulations, that's >1000:1 compression, better than you could achieve with any general-purpose compression. Similarly, if you can recognize that an email is an Amazon shipping confirmation or a Facebook message notification or some other commonly repeated "form letter", you can achieve huge savings by factoring out all the common elements in them, like images or stylesheets.
  - dataflow 3 days ago
    
    I kind of doubt they would do this to be honest. Every near-copy of a message is going to have small differences in at least the envelope (not sure if encoding differences are also possible depending on the server), and possibly going to be under different guarantees or jurisdictions. And it would just take one mistake to screw things up and leak data from one person to another. All for saving a few gigabytes over an account's lifetime. Doesn't really seem worth it, does it?
    
    srean 2 days ago
    
    That's why a base and a delta. Whereas PP was talking about general compression algorithm, my question was different.
    In line with the original comment, I was asking about specialized "codecs" for gmail.
    Humans do not read the same email many times. That makes it a good target for compression. I believe machines do read the same email many times, but that could be architected around.
  - srean 2 days ago
    
    Yes.
    These and other email specific redundancies ought to be covered by any specialized compression scheme. Also note, a lot of standard compression is deduplication. Fundamentally they are not that different.
    Given that one needs to support deletes, this will end up looking like a garbage collected deduplication file system.

jiggawatts 2 days ago

I recently did some works on genomics / bioinformatics, where terabyte-sized text datasets are common and often compressed and decompressed multiple times in some workloads. This often becomes the serial bottleneck we all know and love from Amdahl's law.

I ran a bunch of benchmarks, and found that the only thing that mattered was if a particular tool or format supported parallel compression and/or parallel decompression. Nothing else was even close as a relevant factor.

If you're developing software for processing even potentially large files and you're using a format that is inherently serial, you've made a mistake. You're wasting 99.5% of a modern server's capacity, and soon that'll be 99.9%.

It really, really doesn't matter if one format is 5% faster or 5% bigger or whatever if you're throwing away a factor of 200 to 1,000 speedup that could be achieved through parallelism! Or conversely, the ability to throw up to 1,000x the compute at improving the compression ratio in the available wall clock time.

Checksumming, compression, decompression, encryption, and decryption over bulk data must all switch to fully parallel codecs, now. Not next year, next decade, or the year after we have 1,000+ core virtual machines in the cloud available on a whim. Oh wait, we do already: https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

Not to mention: 200-400 Gbps NICs are standard now in all HPC VMs and starting to become commonplace in ordinary "database optimised" cloud virtual machines. Similarly, local and remote SSD-backed volumes that can read and write at speeds north of 10 GB/s.

There are very few (any?) compression algorithms that can keep up with a single TCP/IP stream on a data centre NIC or single file read from a modern SSD unless using parallelism. Similarly, most CPUs struggle to perform even just SHA or AES at those speeds on a single core.

hexxagone 2 days ago

Then you should take a look at https://github.com/flanglet/kanzi-cpp: it is optimized for fast roundtrips, multi-threaded by design and produces a seekable bitstream.

cobbzilla 3 days ago

My first exposure to bzip: The first Linux kernels I ever compiled & built myself (iirc ~v2.0.x), I packed as .tar.bz2 images. Ah the memories.

Yes, there are better compression options today.

Grom_PE 3 days ago

PPMd (of 7-Zip) would beat BZip2 for compressing plain text data.

vintermann 3 days ago

If you're implementing it for Computercraft anyway, there's no reason to stick to the standard. It's well known that bzip2 has a couple of extra steps which don't improve compression ratio at all.

I suggest implementing Scott's Bijective Burrows-Wheeler variant on bits rather than bytes, and do bijective run-length encoding of the resulting string. It's not exactly on the "pareto frontier", but it's fun!

roytam87 2 days ago

I wonder why people think there is only 1 bzip (which seems to be bzip2), when the original bzip exists: https://unix.stackexchange.com/a/125803 and also bzip3: https://github.com/iczelia/bzip3

Settings

An ode to bzip

Keyboard Shortcuts