Show HN: pzip- blazing fast concurrent zip archiver and extractor
github.comWow. Thank you for making this. I'm frequently have to zip and unzip ~100GB of zip archive and I have to waste 10 minutes of waiting on a fast NVMe and 32 cores workstation. I know about ZSTD or pigz but the format must be zip.
7-Zip by Igor Pavlov can create zip files, has multi-threading and in my small test, comparing with the "pzip", was both as fast in "real" time and produced smaller file (while using similar amount of CPU but differently distributed between user and sys).
https://www.7-zip.org/download.html
Also, in my example the compression level of Info-Zip that best matched the one in pzip was -3 This can, of course, depend on the set.
~/c/measure_pzip$ time 7z -tzip -mx1 a a7.zip /usr/lib/apache2/* 7-Zip (z) 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20 64-bit locale=en_US.UTF-8 Threads:3 OPEN_MAX:1024, ASM ... real 0m0,074s user 0m0,121s sys 0m0,014s ~/c/measure_pzip$ time ./pzip a-p.zip /usr/lib/apache2/* real 0m0,073s user 0m0,097s sys 0m0,038s ~/c/measure_pzip$ time zip -3 -r a-zip.zip /usr/lib/apache2/ >/dev/null real 0m0,118s user 0m0,114s sys 0m0,004s ~/c/measure_pzip$ ls -l a*.zip | ./my2 1576511 a7.zip 1619733 a-p.zip 1613607 a-zip.zipTesting with 100 MB set from mattmahoney.net and relatively comparable sizes pzip is twice as fast as the previously mentioned Pavlov's 7z, that's clearly useful for those who need the fastest possible creation of a "classic" zip with compressed files, when lower compression ratio (1.6 MB bigger compressed file when compressing 100 MB set, compared to 7z) is acceptable.
If the "classic" (i.e. the goal to unpack the archive using older programs) compatibility is not important, it could be interesting to consider that at least since 2020 zstd is officially a "standard" method for ZIP files too, allowing even faster compression speed for the same compression size targets.$ time zip -2 -r a-zip.zip 100mb/ >/dev/null real user sys: 2,1 1,8 0,1 $ time 7z -tzip -mx=1 a a-7z-1.zip 100mb/ >/dev/null real user sys: 1,0 2,7 0,0 $ time ../pzip a-pzip.zip 100mb/ >/dev/null real user sys: 0,5 1,0 0,1 $ L a 48197707 a-7z-1.zip 49921626 a-pzip.zip 49553097 a-zip.zip
https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.9.TX...93 - Zstandard (zstd) CompressionI'm aware that there are some attempts of modifications of 7zip to allow using that method in ZIP files, but I don't know more than that:
https://github.com/mcmilk/7-Zip-zstd
https://github.com/mcmilk/7-Zip-zstd/issues/132
https://github.com/libarchive/libarchive/issues/1403
If ZIP target format is not a condition, here's the speed of using zstd on tar for the same input and approximately the same resulting size:
time tar -c 100mb | zstd -2 -o a.tar.zst 2>/dev/null real user sys: 0,4 0,4 0,1 48585639 a.tar.zst
Is that by chance a public dataset? I would like to do some benchmarks with large non-synthetic zip archives.
Here you go it’s ~200GB https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detect...
Note that even when not considering the speedup due to the compression happening in multiple threads, the libraries used for compression here use much less CPU (user 3m33s) than "the standard zip utility" (user 13m13s i.e. 3.7 times the former -- if I understand correctly, this "standard" is Info-ZIP) which is a little less surprising knowing that the source for the later hasn't been updated for 15 years, while, if I understand correctly, this new go version depends on the compression routines maintained in https://pkg.go.dev/compress/flate
I also don't see the comparison of the resulting compression sizes of the two programs.
Yep the standard I refer to is Info-ZIP (zip(1)).
I will add the resulting compression sizes- there is not much between them (pzip was around 2% larger for the 10GB directory). Although, I do have some optimizations in mind which will bring this down further.
Allowing for 2% bigger resulting file could mean huge speedup in these circumstances even with the same compression routines, seeing these benchmarks of zlib and zlib-ng for different compression levels:
https://github.com/zlib-ng/zlib-ng/discussions/871
IMO the fair comparison of the real speed improvement brought by a new program is only between the almost identical resulting compressed sizes.
This also depends on whether you distro is using zlib or zlib-ng, which is significantly faster.
Do you know which distributions are using zlib-ng for zip and unzip programs?
If I understand, the improvement would be around 3 times less CPU use for the comparable resulting size, but I see it here shown for "minizip" not zip:
Can't find out if this supports encryption, or streaming from stdin and/or to stdout. I haven't found one zip tool that does the above.
At the moment it doesn't- if it's something you're looking for, please open an issue.
I didn't want to go down the route of implementing a bunch of features without there being a actual need for them, especially for an initial version.
Info-ZIP (aka zip(1)) does, assuming that a heavily flawed standard encryption and a single file input from stdin suffices.
How does this compare against pigz? [1]. Afaik pigz comes bundled in some modern distros, I’ve also personally used it in some backup operations reliably
This appears to be for the .zip format, not gzip.
That's generally true, but theoretically, pigz can extract single-member zip archives. I assume that they are both equally fast, assuming that they use zlib. libdeflate or ISA-L should speed this up significantly.
Furthermore, for compression, this still might be a valid question especially for the single-file case because it sounds like pzip parallelizes over the file members and cannot speed up compression/decompression of a single file member.
pzstd := parallel zstd