Bzip3 – A better and stronger spiritual successor to bzip2

158 points by palaiologos 4 years ago · 106 comments

Reader

hannob 4 years ago

It seems somewhat suspicious that the benchmarks don't compare to zstd.

It's not entirely clear to me what the selling point is. "Better than bzip2" isn't exactly a convincing sales pitch given bzip2 is mostly of historic interest these days.

Right now the modern compression field is basically covered by xz (if you mostly care about best compression ratio) and zstd (if you want decent compression and very good speed), so when someone wants to pitch a new compression they should tell me where it stands compared to those.

nousermane 4 years ago
> benchmarks don't compare to zstd.
```
  wget http://corpus.canterbury.ac.nz/resources/calgary.tar.gz
  zcat calgary.tar.gz|time zstd -19|wc -c

  902963 (=902.9KB, vs. 807.9KB for bzip3)
```
> "Better than bzip2" isn't exactly a convincing sales pitch
Sure, but nobody is pitching that. TFA does comapre to lzma (~xz), and claims bzip3 outperforms it quite handsomely in speed, while being competitive in compression ratio.
grumpyprole 4 years ago

There's more to a compression standard than benchmarks and ratios. Xz does not seem to score well in other, perhaps more important areas: https://www.nongnu.org/lzip/xz_inadequate.html
sounds 4 years ago

Can you test it out and post results back here?
- linsomniac 4 years ago
  
  For linux-5.17.6.tar:
  Original file: 129MB xz, 1.2G uncompressed.
  "zstd -T0": 1.34 seconds, 189M
  "xz -T0": 63 seconds, 131M
  "xz -T0 -9": 183 seconds, 125M
  "bzip3 -e -j 6": 21 seconds, 129M (edited, was SIGSEGV)
  "bzip3 -e": 84 seconds, 129M
  I used linux source because the source website uses linux and recommends bzip3 for compressing source and text. Results were on Ubuntu 22.04, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
  - linsomniac 4 years ago
    
    One more note:
    "bzip3 -e -j 6 -b 50": 25 seconds, 125MB
    So nearly as good as the best of xz, but in a 20th the time.
    However: Do note that any unexpected use is met with a SIGSEGV: using as a filter, using "-j6" instead of "-j 6", not specifying "-e"...
    
    palaiologosOP 4 years ago
    
    lies. not specifying -e displays an error message:
    % bzip3 -e -j 6 -b 50 corpus/calgary.tar % bzip3 -j 6 -b 50 corpus/calgary.tar bzip3 - A better and stronger spiritual successor to bzip2. Copyright (C) by Kamila Szewczyk, 2022. Licensed under the terms of GPLv3. Usage: bzip3 [-e/-d/-t/-c] [-b block_size] input output Operations: -e: encode -d: decode -t: test Extra flags: -c: force reading/writing from standard streams -b N: set block size in MiB -j N: set the amount of parallel threads
    you can use bzip3 as a filter:
    % cat corpus/calgary.tar | bzip3 -b 10 -e -c | wc -c 807959
    and using "-j6" is simply being unable to read the help page.
    
    eesmith 4 years ago
    
    The code has, at https://github.com/kspalaiologos/bzip3/blob/bf2f0e02fd59f4c4... :
    } else if (argv[i][1] == 'j') { workers = atoi(argv[i + 1]); i++;
    If the last argument is "-j6" then this will read past the end of the allocated argv strings and try to do atoi(NULL):
    % ./bzip3 -j3 < README.md Segmentation fault
    "-j6" is standard getopt() behavior, and the default expected behavior from Unix/POSIX systems.
    
    ttybird2 4 years ago
    
    They are not disputing that - they actually acknowledged it. Instead they are disputing that omitting -e will lead to a crash.
    
    linsomniac 4 years ago
    
    You're right, what I thought was encoding/filter SEGVs were the bug in handling "-j6".
  - JoshTriplett 4 years ago
    
    If you ramp up the compression level on zstd, does it get smaller than bzip3 before it gets to the point of taking more time?
    
    nousermane 4 years ago
    
    No, not even if you disregard time altogether. Compression of Calgary Corpus on a random old laptop:
    sec KB gzip 0.19 1070 zstd 0.02 1063 zstd -19 1.47 897 bzip2 0.35 891 brotli 8.12 862 xz 1.35 853 bzip3 0.52 808
  - palaiologosOP 4 years ago
    
    it's `bzip3 -e -j 6`. you need a space.
  - adgjlsfhk1 4 years ago
    
    Can you also give decompression speeds?
    
    linsomniac 4 years ago
    
    "zstd -d": 1.03 seconds
    "xz -d": 7.92 seconds
    "bzip3 -d" as filter: SEGSEGV
    "bzip3 -d linux-5.17.6.tar.bz3": 81 seconds
    "bzip -d -j 6": 21.35 seconds (edit)
    Lest I be called a liar again:
    $ ./bzip3-1.0.1/bzip3 -d <linux-5.17.6.tar.bz3 >/dev/null fish: Job 1, 'time ./bzip3-1.0.1/bzip3 -d <li…' terminated by signal SIGSEGV (Address boundary error)
    
    palaiologosOP 4 years ago
    
    Consider using `-c`, which makes the compressor use standard streams, or pull the main branch because I had just pushed a tiny patch that automatically enables it when no positional arguments are given.
- Matumio 4 years ago
  
  No, but it would be nice to see visually where it is on the size vs (de-)compression speed pareto-front. Like this graphic (from the zstd homepage): https://raw.githubusercontent.com/facebook/zstd/master/doc/i...
  - codewiz 4 years ago
    
    This chart is from 2016. Both zstd and brotli are under active development, so I'd like to see a more recent comparison presented in this same format.

klauspost 4 years ago

Looks interesting, but my main objections to general adoption the same as bzip2, lzma and context modelling based codecs - decompression speed.

Compressing logs for instance, decompression speed of 23MB/s per core, is simply too slow when you need to grep through gigabytes of data. Same for data analysis, you don't want your input speed to be this limited when analysing gigabytes of data.

I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.

bayindirh 4 years ago

> I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.
I think it boils down to the feelings of the author (of the previous format).
I don't think PKWARE feels bad because ZSTD is a homage to ZIP. Similarly if someone created a follow-up file format to something I've designed, I'd just want credit or a link to my version as a homage and pointer for history continuity, nothing else.
Open source software is designed to be mangled, modified, shared and leapfrogged. If a completely different implementation advertises itself as a newer iteration of a format because it's built on the same theory, I think it's ethical if the developer is not intending to capitalize name. Either case, if the original developer returns to the game, it can create a BZIP4 and points to the diversion as, "hey, somebody liked BZIP2 too much and created this, give him/her a kudos", and continue.
- altairprime 4 years ago
  
  Pkware might not have been so forgiving if someone had released ZIP2. Incrementing the version number like that is only an acceptable thing in relatively unusual circumstances, but does happen sometimes; and still, I would really hesitate to say that it’s a good idea for a third party to call itself bzip3.
  The original author replaced their own bzip release with bzip2 to avoid a patent issue with arithmetic coding, so this is the first time a third party has done so: https://web.archive.org/web/19980704181204/http://www.muraro...
  So if the release of bzip3 is approved by the current maintainer, then I guess it’s fine, but otherwise it makes me uncomfortable to consider using under this name.
- eru 4 years ago
  
  > Open source software is designed to be mangled, modified, shared and leapfrogged.
  I agree in spirit, but I can also see why someone might want their source to be free and mangleable, but still care about trademarks.
  (Just imagine Linus Torvalds getting lots of emails with support requests for a hypothetical Linux2 operating system that I wrote, and that he has no relation with. That could become pretty annoying; even if he doesn't mind me taking his source code.)
  - bayindirh 4 years ago
    
    A trademark changes the situation in every aspect. If your code is getting big, famous, and needs its name and likeness (e.g. Firefox), you get a trademark, and your derivatives shall not use the name. It's simple and clear.
    When I quickly looked, bzip2 didn't have a trademark, and I assumed the developer don't care.
    It's same for me. If I care, I'd trademark it, and prevent people from using it. If I'm giving the name away, I'd not trademark it. It's plain and simple.
    
    eru 4 years ago
    
    So, a trademark is a 'legal' concept, and comes with certain rights and obligations. Eg you need to actively defend your trademark in order to keep it.
    I was perhaps a bit fuzzy: I used the word trademark, but I meant the moral equivalent that only confers moral rights and obligations. Ie don't be a jerk and name your stuff in a confusing way, even if there's no legal obligation.
    Even this fuzzier 'moral' trademark only applies, if the original owner wants it to apply. I have no special insight into the bzip2 situation. So I have no clue whether the authors of bzip or bzip2 cared. I gladly defer to your research on this question.
    My point was meant purely abstract about the possibility of wanting your source to be free and open and fondled by others, but still have a (morally) protected name.
- adrianmonk 4 years ago
  
  > I don't think PKWARE feels bad because ZSTD is a homage to ZIP.
  Is zstd actually an homage to zip?
  I'm not saying that it definitely isn't, but the only connection I know of myself is that they both begin with the letter Z, and the letter Z has a long association with data compression that goes back before the zip format / pkzip program.
  The LZ77, LZ78[1], and LZW[2] algorithms all predate the zip format. As do two very old, obsolete Unix compression programs: "pack"[3], which is uses a ".z" suffix for compressed files, and "compress"[4], which uses a ".Z" suffix for compressed files.
  In those algorithm names, the L and Z stand for Lempel and Ziv, respectively. But interestingly, the Unix "pack" program uses a ".z" suffix even though its algorithm is just Huffman (not one of the Lempel-Ziv family of algorithms), so the letter Z somehow came to signify data compression more generally.
  Rough timeline of letter Z in data compression:
  1977: LZ77
  1978: LZ78
  1982 or earlier: Unix "pack" (.z)
  1984: LZW
  1985: Unix "compress" (.Z)
  1989: PKZIP
  ---
  [1] https://en.wikipedia.org/wiki/LZ77_and_LZ78
  [2] https://en.wikipedia.org/wiki/Lempel-Ziv-Welch
  [3] https://en.wikipedia.org/wiki/Pack_(compression)
  [4] https://en.wikipedia.org/wiki/Compress
  - bayindirh 4 years ago
    
    > Is zstd actually an homage to zip?
    I think so. Or more generally, DEFLATE family of algorithms used by zlib and ZIP.
    > I'm not saying that it definitely isn't, but the only connection I know of myself is that they both begin with the letter Z, and the letter Z has a long association with data compression that goes back before the zip format / pkzip program.
    > The LZ77, LZ78[1], and LZW[2] algorithms all predate the zip format.
    I'm well aware of the Ziv, Lempel & Welch's work, since I've developed a compression algorithm [*] and did extensive research on the family before, however the usage of extension is a rather new information for me.
    On the other side, official repository of zstd [0] lists zlib and other libraries which use DEFLATE in one way other, pointing back to Ziv's ZIP at the same time.
    At the end of the day, it might be tipping its hat directly to "ZIP" per se, but to general direction of DEFLATE which is used by ZIP format too.
    [*]: The algorithm I developed was working on syllables rather than bytes. It had a deterministic and fast hyphenation engine for the language, and used embedded dictionary to minimize bit-flip damage during transit. I have paper published from it, and still planning to re-implement and open it, since the PoC was beyond bad from a code quality perspective.
    
    adrianmonk 4 years ago
    
    Right, there's definitely a connection in the algorithms. They're all derivatives of the same family of stuff.
    I could've been clearer, but when I said the only connection I saw between them was the Z, I meant the only connection between their names. In other words, while zstd surely has zip among its main influences, I'm not sure that the Z in "zstd" is there to convey "zip-like". It could be, or maybe it's just there because Z is generally associated with data compression.
perihelions 4 years ago
That's funny, 23 MiB/s is exactly what I get for reading systemd logs (on an NVME SSD). Is it supposed to be otherwise?
```
    $ sudo journalctl -r | pv -a > /dev/null
    [22.8MiB/s]
```
- yakubin 4 years ago
  That appears to be systemd being slow.
  $ dd if=/dev/urandom of=test bs=1G count=1 iflag=fullblock $ gzip -k test $ zcat test.gz | pv -a >/dev/null [ 228MiB/s] $ sudo journalctl -r | pv -a >/dev/null [13.1MiB/s]
  UPDATE: Gzip with more real-world data[1]:
  $ gzip -k adventures-of-huckleberry-finn.txt $ zcat adventures-of-huckleberry-finn.txt.gz | pv -a >/dev/null [ 151MiB/s]
  [1]: <https://gutenberg.org/files/76/76-0.txt>
  - klauspost 4 years ago
    
    By "compressing" random data you are bypassing gzip, since it will just store your data as uncompressed blocks, making "decompression" a memcopy.
    With real data, deflate maxes out somewhere around there either way, but that is a bit coincidental.
    With modern CPUs getting increasingly smaller IPC improvements this will likely be pretty much the max decompression speed we can expect from gzip going forward.
    
    yakubin 4 years ago
    
    It actually made the file bigger (1.1G from 1.0G) :)
    I was getting the same numbers with text data I have scattered on my disk, but those were small, so I decided to generate a bigger file. But, yes, I agree a more robust benchmark would use a Mark Twain novel e.g.
- erk__ 4 years ago
  
  I think the lack of speed here is more that it has to serialize the data from disk into a readable format. I assume using the `--grep=` option is faster than piping it through grep because of this
  - perihelions 4 years ago
    
    --grep=
    That's exactly what I needed to know! I'm glad I asked the stupid question. Thank you!
vintermann 4 years ago

With the other parts of the codec, I doubt it's possible for files compressed with this, but one of the strengths of BWT-based compression is that there has been a lot of research on search operations directly on compressed data.
usefulcat 4 years ago

If you’re using xz, pixz can do multithreaded decompression. It’s still xz/lzma, so still expensive to decompress, but at least that allows you to throw as many cores as you want at it.

lynguist 4 years ago

If anyone just cares for speed instead of compression I’d recommend lz4 [1]. I only recently started using it. Its speed is almost comparable to memcpy.

[1] https://github.com/lz4/lz4

klodolph 4 years ago

Zstandard achieves similar speeds at higher ratios. LZ4 only comes out ahead if you use LZ4 at, like, level 1.
- klauspost 4 years ago
  
  Yes, zstd has forced bitwise match coding, whereas lz4 is byte-aligned and with inline literals.
  So lz4 has some base advantages in terms of speed, which zstd is unlikely to match. But as you point out it is only relevant for very high speed operations.
- Beltalowda 4 years ago
  
  It's still quite a bit faster than zstd -1, at least according to their GitHub page. The trade-off is it's a worse compression ratio (2.8 vs. 2.1), but in some cases that's a good trade-off.
- adgjlsfhk1 4 years ago
  
  zstd has similar compression speeds, but lz4 crushed everything else for decompression.
- loeg 4 years ago
  
  My impression was that lz4 ratios were still marginally better than zstd for the same compression speed, and decompression is much, much faster.
  - klodolph 4 years ago
    
    We must be looking at different graphs. See p.7
    https://indico.fnal.gov/event/16264/contributions/36466/atta...
    You can see the classic Pareto frontier, with LZ4 filling the niche at the very bottom right edge of the graph.
    
    loeg 4 years ago
    
    I'm looking at http://facebook.github.io/zstd/ "Benchmarks."
    P.7 of your slides doesn't seem to cover Zstd --fast or decompression. It would be interesting to see how Zstd performs in those modes.
    
    rincebrain 4 years ago
    
    Depending on your data, experimentally, some people find zstd --fast can beat LZ4 for them on compression, some people find the opposite; my usual advice to people considering one or the other is to experiment and find out.
    (An interesting anecdote about the differing notions of compressibility - when I recently wrote something to do a clever dance to avoid burning a great deal of CPU on incompressible data with higher zstd levels, I ended up using LZ4 -> zstd-1 as a two-tier filter to catch incompressible data, because what they each thought was incompressible was different enough that only using LZ4 lost a significant amount of compression sometimes, but only using zstd-1 was comparatively expensive and also lost a significant fraction.)
    
    adgjlsfhk1 4 years ago
    
    note that for decompression speeds, they quote "We know LZ4 is significantly faster than ZSTD on standalone benchmarks: likely bottleneck is ROOT IO API"

pcwalton 4 years ago

The Burrows-Wheeler transform, which was the main innovation of bzip2 over gzip, and which this bzip3 retains, is one of the most fascinating algorithms to study: https://en.wikipedia.org/wiki/Burrows-Wheeler_transform

It hasn't been used lately because of the computational overhead, but it's interesting and I'm glad that there's still work in this area. For anyone interested in algorithms it's a great one to wrap your head around.

Klasiaster 4 years ago

Here some other BWT compressors in the large text compression benchmark (look for "BWT" in "Alg" column): http://mattmahoney.net/dc/text.html

And here a BWT library with benchmarks: https://github.com/IlyaGrebnov/libsais#benchmarks

denzquix 4 years ago

From their own benchmarks it seems more like bzip3 is geared towards a different compression/speed trade-off than bzip2, rather than an unambiguous all-around improvement. Am I misreading it?

once_inc 4 years ago

That's what I took out of it too. Sacrificed a bit of speed and a lot of memory for a smaller output size.
edit: ah, bzip3 is parallelizable, while bzip2 isn't. That alone is enough for me to be able to claim 'faster'.
- thilog 4 years ago
  
  bzip2 can exploit concurrency through pbzip2, can't it?
  - klauspost 4 years ago
    
    It is kind of a hack for decompression, where you look forward in the stream for a block signature, and try decompressing from there.
    In practice it works, but it isn't pretty ;)
  - gjvc 4 years ago
    
    see also: lbzip2
    (context: I have had a situation where files created by pbzip2 on linux were not able to be decompressed with some library on .NET, but using lbzip2, they were. I never looked into the details.)
    
    Klasiaster 4 years ago
    
    Can't agree more, lbzip2 is the go-to tool for dealing with bzip2 compression and decompression, it's a whole lot of faster than bzip2 which is single-threaded!

joelthelion 4 years ago

In the Era of zstandard, do we really need this?

wongarsu 4 years ago

I find it somewhat telling that they don't benchmark themselves against zstd.
Right now I'm almost exclusively using zstd (general stuff) or lzma2/xz (high compression where read speed doesn't matter). And of course gz and zip for data interchange where compatibility is key. From the information presented bzip3 won't replace any of those use cases for me, but that's fine. Maybe it fits somebody else's use case, or maybe it's the foundation for the next great algorithm that we all end up using.
- palaiologosOP 4 years ago
  
  zstd -19 linux.tar 462.58s user 0.76s system 100% cpu 217M memory 7:42.56 total
  % wc -c linux.tar.zst linux.bz3 134980904 linux.tar.zst 129255792 linux.bz3
  - 149764 4 years ago
    
    # compression bzip3 -j 4 -e linux-5.18-rc6.tar linux-5.18-rc6.tar.bz3 user: 345.48s system: 0.59s cpu: 373% total: 1:32.75 zstd -19 --long -T4 -f linux-5.18-rc6.tar user: 1270.48s system: 0.89s cpu: 376% total: 5:37.9 > du -b linux-5.18-rc6.tar.* | sort -rn | reln 1.000000 130907738 linux-5.18-rc6.tar.zst 0.994715 130215881 linux-5.18-rc6.tar.bz3
    With additional ‘--ultra -22’ tar.zst is smaller, but the compression time sky rockets.
    # decompression bzip3 -j 4 -d linux-5.18-rc6.tar.bz3 linux-5.18-rc6.tar user: 222.57s system: 0.92s cpu: 362% total: 1:01.69 bzip3 -d linux-5.18-rc6.tar.bz3 linux-5.18-rc6.tar user: 141.29s system: 0.89s cpu: 99% total: 2:22.19 zstd -d -T4 -f linux-5.18-rc6.tar.zst user: 2.26s system: 0.84s cpu: 99% total: 3.102
    zstd doesn’t seem to support parallel decoding, but still 20x faster
    
    codewiz 4 years ago
    
    Is reln a command to add a column of relative numbers to the left? Neat.
    
    149764 4 years ago
    
    Yes, a small python script.
- cout 4 years ago
  
  Have you ever tried lzma/lzma2 with the hc3 (hash chain) match finder instead of the default (bt3 or bt4) match finder? I've found this to be a really good middle ground between gz/deflate and lzma2 with default settings.
proofrock 4 years ago

Yes, because someone said the same when zstandard came out. This may not have the same strong points, but maybe the next will… compression is not a completed task.
trasz 4 years ago

Not to mention the restrictive license which effectively prohibits its use in any Open Source project licensed under anything other than GPLv3.
- palaiologosOP 4 years ago
  
  Frankly, same holds for gzip. I've been planning to relicense bzip3 with the more permissive LGPLv3.
  - chungy 4 years ago
    
    gzip has BSD-licensed compatible alternatives already. It's doubtful the same attention would be given to bzip3; chicken-and-egg scenario there. Plus the lingering question of "Why not zstd?"
  - trasz 4 years ago
    
    Gzip is just a frontend for zlib, which is BSD(ish).
baybal2 4 years ago

1. zStandard is not a standard
2. Bzip2 is somewhat is a standard
3. zStandard is not a substitute for Bzip2
- Beltalowda 4 years ago
  
  In what way is bzip2 more of a "standard" than zstd? bzip2 doesn't even seem to have any official reference description of its file format; just an "unofficial" one[1], whereas zstd is RFC 8478[2].
  When I evaluated various compression algorithms a few years ago zstd came ahead of bzip2 in every metric.
  [1]: https://github.com/dsnet/compress/blob/master/doc/bzip2-form...
  [2]: https://datatracker.ietf.org/doc/html/rfc8478
  - chasil 4 years ago
    
    That is interesting.
    The author of lzip has harsh criticism of xz, and admiration of bzip2 for error detection/correction and "rightsizing" the container format.
    I use lzip in preference to xz unless I need portability.
    https://www.nongnu.org/lzip/xz_inadequate.html
    
    Beltalowda 4 years ago
    
    As far as I know xz and zstd and completely unrelated?

yakubin 4 years ago

From the "disclaimers" section:

> Every compression of a file implies an assumption that the compressed file can be decompressed to reproduce the original. Great efforts in design, coding and testing have been made to ensure that this program works correctly.

> However, the complexity of the algorithms, and, in particular, the presence of various special cases in the code which occur with very low but non-zero probability make it impossible to rule out the possibility of bugs remaining in the program.

That got me thinking: I've always implicitly assumed that authors of lossless compression algorithms write mathematical proofs that D o C = id[1]. However, now that I've started looking, I can't seem to find that even for Deflate. What is the norm?

[1]: C being the compression function, D being the decompression function, and o being function composition.

forgotpwd16 4 years ago

Cannot answer your question but, since you mentioned it, there's a mathematical specification of deflate (see: https://arxiv.org/abs/1609.01220).

asicsp 4 years ago

Good work!

I was also confused with faster speed claims than bzip2, and then saw the discussion in the issue: https://github.com/kspalaiologos/bzip3/issues/2

roelschroeven 4 years ago

That discussion doesn't really clear up my confusion though.
I don't understand how bzip3 gets to claim "A better, faster and stronger spiritual successor to BZip2." when even all its own benchmarks show it's slower than bzip2?
- palaiologosOP 4 years ago
  
  bzip3 usually operates on bigger block sizes, up to 16 times bigger than bzip2. additionally, bzip3 supports parallel compression/decompression out of the box. for fairness, the benchmarks have been performed using single thread mode, but they aren't quite as fair towards bzip3 itself, as it uses a way bigger block size.
  what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.
  it should be noted that while using 16 times larger block sizes than bzip2 while providing compression ratios up to 10%-50% better at a cost of, as empirically shown, 17 seconds per 1.3GB of data, is a pretty good trade-off and if bzip2 wanted to get anywhere close to that (e.g. using the C API to tweak the block size), it'd have to sacrifice a lot of its performance.
  - roelschroeven 4 years ago
    
    First, I don't understand what the block size has to do with the number of threads used. I understand why one could consider the benchmarks unfair to bzip3 because they're single-threaded (depending on exactly one defines "faster"), but why do you say they are not fair towards bzip3 because of the bigger block size it uses? Do the benchmarks not use the optimal block size for each compression tool?
    I can see how bzip3 is better able to exploit the characteristics of modern hardware, but that's not enough to call it faster. The proof of the pudding is in the eating, and if the benchmarks show bzip3 is slower than bzip2 than bzip3 is clearly not faster (but perhaps "aiming to be faster", with some work to be done before it reaches that goal).
    Better compression at a slightly slower pace can indeed be a good trade off. I'm not saying bzip3 doesn't work well. What I'm saying is that the "faster" in its description "A better, faster and stronger spiritual successor to BZip2" is not supported by the evidence.
- easytiger 4 years ago
  
  And what does stronger mean? It's not cryptography.
  Unless it is
  - verytrivial 4 years ago
    
    I read it as three quarters of a Daft Punk reference (only).
    There is no shortage of compression tools that are better in this or that. At the very least this is a fun engineering exercise, but there is always a massive inertia propblem regarding install base with compression, esp. for data-at-rest. gzip is still used by default in so many contexts, and that's not such a bad thing IMHO.
  - palaiologosOP 4 years ago
    
    it's fairly common, at least in the circles i usually dwell in, to call compression ratio "compression _strength_". bzip3 is _better_ than bzip2 since it uses a better technological model as outlined in one of my replies.

williamkuszmaul 4 years ago

One of the things that's cool about Bzip is that it makes use algorithmic techniques developed by theoretical computer scientists in order to perform the Burrows Wheeler Transform efficiently. It's a great example of theory and practice working symbiotically.

forgotpwd16 4 years ago

>better, faster

If I'm reading the benchmarks correctly, it gets higher compression but is slower and has higher memory usage. Thus cannot call it better.

>spiritual successor to BZip2

What does that mean? If it isn't related to bzip2, why choose this name?

alerque 4 years ago

It is related to bzip2 in the sense of using the Burrows-Wheeler algorithm.

fefe23 4 years ago

Hmm, I see LZ77, PPM and entropy coding in the description, and obviously Burrows-Wheeler.

Has anyone tried doing zstd at the end instead of LZ77 and entropy coding?

Does the idea even make sense? (I'm a layman)

iruoy 4 years ago

So bzip2 and bzip3 focus on compressed size, lz4 on compression speed and zstd on decompression speed?

jeffbee 4 years ago

I don't know if that's really accurate. LZ4 is often faster on both sides, while usually having a larger compressed size. On most of the inputs listed at this benchmark, LZ4 is twice as fast for compression, 50% faster for decompression, while having a compressed size about 125% as large as Zstd. My rule of thumb is that zstd is good if you're going to store or transmit the result, while lz4 is the better choice if you're planning to compress and decompress exactly once without storing (i.e. as a transfer encoding between two network peers).

jkbonfield 4 years ago

It doesn't compare itself against bsc, which feels a bit poor IMO given it's using Grebnov's libsais and LZP algorithm (he's the author of libbsc).

On my own benchmarks, it's basically comparable size (about 0.1% smaller than bsc), comparable encode speeds, and about half the decode speed. Plus bsc has better multi-threading capability when dealing with large blocks.

Also see https://quixdb.github.io/squash-benchmark/unstable/ (and without /unstable for more system types) for various charts. No bzip3 there yet though.

palaiologosOP 4 years ago
You've literally tested it on a single file, enwik8. That's not enough to extrapolate valuable results. One of the benchmarks:
```
  time ./bsc e ../linux.tar linux.bsc -e2 -b16 -T
  68.69s user 1.14s system 99% cpu 117M memory 1:09.84 total
```
While bzip3 uses 98M, takes 1min 17s to produce a 129023171 byte file, compared to 127747834B from BSC. They're very similar except bzip3 tends to use less memory and decompresses a little slower. BSC is much more mature than bzip3 though, and the benchmarks might be a subject to change some time in the future. Surprisingly, BSC code isn't really that robust (I reported a UB bug to libsais and had to pretty much rework the LZP code because it couldn't stand fuzzing).
- jkbonfield 4 years ago
  
  Well yes it was one file, but it was stated as being good on text and enwik8 is a pretty standard test corpus for text compressors.
  I could have done more, but it somewhat vindicated what I was saying really. It has a very similar core to bsc (based on the same code) and gives very similar file sizes as expected. Note you may wish to use bsc -tT to disable both forms of threading. I don't know if that changes memory usage any.
  Have you tried making PRs back to libbsc github to fit the UB and fuzzing issues? I'm sure the author would welcome fixes given you've already done the leg work.
  Anyway, please do consider benchmarking against libbsc. It's conspicuously absent given the shared ancestry.
  - palaiologosOP 4 years ago
    
    I haven't figured a libsais fix and my LZP fix changes the functionality a little (removes chunking for better compression at a rather small runtime cost), so I don't think the author would like me to submit it. I have opened tickets, though.

kstenerud 4 years ago

There comes a point where the complexity itself becomes too much of a liability. It's important to be able to trust these algorithms as well as all popular implementations with your data.

mschuster91 4 years ago

One should verify the integrity of stuff like backups or archives anyway, by supplying the end user with a sha1 or better hash of both the compressed/encrypted archive as well as all of the files it contains, and by regularly verifying if both still match.
- bell-cot 4 years ago
  
  Yes...though I'd say to rule out sha1, or any other "no longer considered secure" hashes. The space & time savings (vs, say, sha-512) are really not worth baking into your backup format & procedures. Keep in mind that you might need to really verify the integrity of your backup during a ransomware incident, or as part of a high-stakes legal situation, or ...
  - joveian 4 years ago
    
    SHA1 and even MD5 are broken for collision resistance, not second preimage (which is what matters for backups). While it can still make sense to not use SHA1 for anything, it is fine if you do use it. Blake3 is the one to use if you are looking for maxium speed, although it is newer and some may avoid it for that. In a quick test (using hyperfine -w 1 -m 3) on a 8GB file (Arch Linux, i5-6260U 2 core processor with hyperthreading disabled) times are 1.995s for b3sum, 11.847s for sha1sum, 14.495s for b2sum, 16.903s for sha512sum, and 24.737s for sha256sum. md5sum for comparison is 16.492s, just a tiny bit slower than sha512sum, so that is why you rarely see it these days even for things like backups. SHA3-224 (sha3sum) took 39.531s and SHA3-512 (sha3sum -a 512) took 73.164s, although they will eventually be very quick with hardware support.
    xz -C sha256 will use sha256 on the uncompressed data.
    Personally, I use mtree inside the backup file (from pkgsrc bootstrap on Linux), although it has trouble with a few unicode file names. That kind of tool is great and I'm not sure why there doesn't seem to be an equivalent in the Linux world.
- cestith 4 years ago
  
  Ideally, yes. Yet logrotate uses zlib to compress log files and deletes the originals. That's how trusted it is.

AceJohnny2 4 years ago

Will bzip3 be added to the Squash benchmarks?

https://quixdb.github.io/squash-benchmark/

I note that the "Calgary Corpus" that bzip3 prominently advertises is obsolete, dating back to the late 80s:

https://en.wikipedia.org/wiki/Calgary_corpus

the-alchemist 4 years ago

I'm really interested in GPU-based compression / decompression.

Anyone know what the current SOTA GPU-based algorithms are, and why they haven't taken off?

Brotli has gotten browser support, so it seems to my naive self that a GPU-based algorithm is just waiting take over.

alerque 4 years ago

GPU's are good at massively parallel tasks. Compression is, almost by definition, not a parallel problem. If you want speed you can break it up into chunks and if you are optimizing from throughput there are gains to be made there. But if you are optimizing for compression, the more chunks you break the task up into the less opportunity you have to find ways to compress it. For example a fast compression tool creating an archive of files might split up each file into a different thread which gets the job done fast, but it will loose out on huge gains in compression if there are common parts to files that could have been compressed if there were processed as a single blob. GPUs are designed to do lots of small chunks of work in parallel, CPUs are better at doing bigger jobs faster.

oefrha 4 years ago

Interesting, this seems to be a good replacement for xz if the benchmarks are representative.

joppy 4 years ago

Why is there such a big disclaimer/warning on the front? Shouldn’t the program just check that decompress(compress(x)) = x as it goes, and then it can be sure that compress(x) has not lost any data?

palaiologosOP 4 years ago

no compressor tests the output while compressing as it hurts the performance. you can do it after compressing, though, using `bzip3 -t`.
- joppy 4 years ago
  
  Right, but I would probably test the output while compressing instead of putting a big USE AT YOUR OWN PERIL sign across the front…
  - cestith 4 years ago
    
    It seems PRs are welcome. The author/maintainer may accept that as optional behavior with a command-line flag.

72deluxe 4 years ago

I use pbzip2 with gusto because the original bzip2 is single-threaded. I heartily recommend it to all I meet, even those in the street!

rurban 4 years ago

Can be easily improved by using the HW crc32, it's just SW crc32.

themusicgod1 4 years ago

> Github link

...so long as this lives in NSA/Microsoft Github, it's not a 'spiritual successor' to anything.

Settings

Bzip3 – A better and stronger spiritual successor to bzip2

Keyboard Shortcuts