Settings

Theme

Show HN: pzip- blazing fast concurrent zip archiver and extractor

github.com

26 points by exposition 2 years ago · 17 comments

Reader

thangngoc89 2 years ago

Wow. Thank you for making this. I'm frequently have to zip and unzip ~100GB of zip archive and I have to waste 10 minutes of waiting on a fast NVMe and 32 cores workstation. I know about ZSTD or pigz but the format must be zip.

  • acqq 2 years ago

    7-Zip by Igor Pavlov can create zip files, has multi-threading and in my small test, comparing with the "pzip", was both as fast in "real" time and produced smaller file (while using similar amount of CPU but differently distributed between user and sys).

    https://www.7-zip.org/download.html

    Also, in my example the compression level of Info-Zip that best matched the one in pzip was -3 This can, of course, depend on the set.

        ~/c/measure_pzip$ time 7z -tzip -mx1 a a7.zip /usr/lib/apache2/*
    
        7-Zip (z) 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20
         64-bit locale=en_US.UTF-8 Threads:3 OPEN_MAX:1024, ASM
    
        ...
    
        real 0m0,074s
        user 0m0,121s
        sys 0m0,014s
        ~/c/measure_pzip$ time ./pzip a-p.zip /usr/lib/apache2/*
    
        real 0m0,073s
        user 0m0,097s
        sys 0m0,038s
        ~/c/measure_pzip$ time zip -3 -r a-zip.zip /usr/lib/apache2/ >/dev/null
    
        real 0m0,118s
        user 0m0,114s
        sys 0m0,004s
    
        ~/c/measure_pzip$ ls -l a*.zip | ./my2
        1576511  a7.zip
        1619733  a-p.zip
        1613607  a-zip.zip
    • acqq 2 years ago

      Testing with 100 MB set from mattmahoney.net and relatively comparable sizes pzip is twice as fast as the previously mentioned Pavlov's 7z, that's clearly useful for those who need the fastest possible creation of a "classic" zip with compressed files, when lower compression ratio (1.6 MB bigger compressed file when compressing 100 MB set, compared to 7z) is acceptable.

          $ time zip -2 -r a-zip.zip 100mb/ >/dev/null
          real user sys: 2,1 1,8 0,1 
          $ time 7z -tzip -mx=1 a a-7z-1.zip 100mb/ >/dev/null
          real user sys: 1,0 2,7 0,0 
          $ time ../pzip a-pzip.zip 100mb/ >/dev/null
          real user sys: 0,5 1,0 0,1 
          $ L a
          48197707 a-7z-1.zip
          49921626 a-pzip.zip
          49553097 a-zip.zip
      
      If the "classic" (i.e. the goal to unpack the archive using older programs) compatibility is not important, it could be interesting to consider that at least since 2020 zstd is officially a "standard" method for ZIP files too, allowing even faster compression speed for the same compression size targets.

          93 - Zstandard (zstd) Compression 
      
      https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.9.TX...

      I'm aware that there are some attempts of modifications of 7zip to allow using that method in ZIP files, but I don't know more than that:

      https://github.com/mcmilk/7-Zip-zstd

      https://github.com/mcmilk/7-Zip-zstd/issues/132

      https://github.com/libarchive/libarchive/issues/1403

      If ZIP target format is not a condition, here's the speed of using zstd on tar for the same input and approximately the same resulting size:

          time tar -c 100mb | zstd -2 -o a.tar.zst 2>/dev/null
          real user sys: 0,4 0,4 0,1 
          48585639 a.tar.zst
  • mxmlnkn 2 years ago

    Is that by chance a public dataset? I would like to do some benchmarks with large non-synthetic zip archives.

acqq 2 years ago

Note that even when not considering the speedup due to the compression happening in multiple threads, the libraries used for compression here use much less CPU (user 3m33s) than "the standard zip utility" (user 13m13s i.e. 3.7 times the former -- if I understand correctly, this "standard" is Info-ZIP) which is a little less surprising knowing that the source for the later hasn't been updated for 15 years, while, if I understand correctly, this new go version depends on the compression routines maintained in https://pkg.go.dev/compress/flate

I also don't see the comparison of the resulting compression sizes of the two programs.

  • expositionOP 2 years ago

    Yep the standard I refer to is Info-ZIP (zip(1)).

    I will add the resulting compression sizes- there is not much between them (pzip was around 2% larger for the 10GB directory). Although, I do have some optimizations in mind which will bring this down further.

    • acqq 2 years ago

      Allowing for 2% bigger resulting file could mean huge speedup in these circumstances even with the same compression routines, seeing these benchmarks of zlib and zlib-ng for different compression levels:

      https://github.com/zlib-ng/zlib-ng/discussions/871

      IMO the fair comparison of the real speed improvement brought by a new program is only between the almost identical resulting compressed sizes.

  • singron 2 years ago

    This also depends on whether you distro is using zlib or zlib-ng, which is significantly faster.

sntran 2 years ago

Can't find out if this supports encryption, or streaming from stdin and/or to stdout. I haven't found one zip tool that does the above.

  • expositionOP 2 years ago

    At the moment it doesn't- if it's something you're looking for, please open an issue.

    I didn't want to go down the route of implementing a bunch of features without there being a actual need for them, especially for an initial version.

  • lifthrasiir 2 years ago

    Info-ZIP (aka zip(1)) does, assuming that a heavily flawed standard encryption and a single file input from stdin suffices.

steelbrain 2 years ago

How does this compare against pigz? [1]. Afaik pigz comes bundled in some modern distros, I’ve also personally used it in some backup operations reliably

[1]: https://zlib.net/pigz/

  • Rolcol 2 years ago

    This appears to be for the .zip format, not gzip.

    • mxmlnkn 2 years ago

      That's generally true, but theoretically, pigz can extract single-member zip archives. I assume that they are both equally fast, assuming that they use zlib. libdeflate or ISA-L should speed this up significantly.

      Furthermore, for compression, this still might be a valid question especially for the single-file case because it sounds like pzip parallelizes over the file members and cannot speed up compression/decompression of a single file member.

eisbaw 2 years ago

pzstd := parallel zstd

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection