Many times faster (de)compression using multiple processors.
iasylum.netIt's good people get interested in the subject. But this is very odd and has some errors. For example xz requires a lot more memory resources than bzip2 (see benchmarks below, Mem column).
http://mattmahoney.net/dc/text.html
http://mattmahoney.net/dc/uiq/
Matt Mahoney mantains the best benchmarks on text and generic compression. Some of the best on the field (like Matt) usually hang out at encode.ru.
By resources I meant wall-time, mostly - what I am optimizing for (I move >200GB zfs snapshots around a lot..). I did not pay attention to memory, specifically - in that light I updated the side note.
What else did you find odd/wrong?
You don't mention ratios. Better compression becomes asymptotically/exponentially harder when it gets close to the optimal code size.
Thanks for the shout out. BTW Matt's benchmarks are for all major compression engines, not only bzip2.
That benchmark runs xz with the -9e flags, which turn on its slowest and most memory-intensive mode. If you pass it -0 it only needs 3 MB to compress and 1 MB to decompress.
I usually use -0 with xz because it is extremely fast and memory-efficient yet still compresses better than gzip or bzip2. You can also use -0e for a slower, better compression that still requires only 1 MB to decompress. This way the decompressor can run entirely within the CPU's cache.
Yes, it's in the benchmarks. But comparing against bzip2 is not very meaningful. It's more relevant to compare with compressors in the same efficiency rate. For those flags it compresses to 26MB, in that range there are many equivalent ROLZ/LZP engines with similar numbers. For example csc32 is a bit faster but uses more memory.
http://mattmahoney.net/dc/text.html#2118
http://mattmahoney.net/dc/text.html#2300
Proper analysis would need benchmarking with different data and different flags for all compressors.
Is this decompressing a single stream on multiple processors? My knowledge of gzip is very limited, but I would have thought sequential processing was required. What's the trick here? (TFA doesn't explain anything, and e.g. pigz homepage doesn't either).
You are right. The .pdf version of the user manual of pigz (at http://zlib.net/pigz/pigz.pdf ) has a few details (the bottom of this answers your question):
"Pigz compresses using threads to make use of multiple processors and cores. The input is broken up into 128 KB chunks with each compressed in parallel. The individual check value for each chunk is also calculated in parallel. The compressed data is written in order to the output, and a combined check value is calculated from the individual check values.
The compressed data format generated is in the gzip, zlib, or single-entry zip format using the deflate compression method. The compression produces partial raw deflate treams which are concatenated by a single write thread and wrapped with the appropriate header and trailer, where the trailer contains the combined check value.
Each partial raw deflate stream is terminated by an empty stored block (using the Z_SYNC_FLUSH option of zlib), in order to end that partial bit stream at a byte boundary. That allows the partial streams to be concatenated simply as sequences of bytes. This adds a very small four to five byte overhead to the output for each input chunk. The default input block size is 128K, but can be changed with the -b option. The number of compress threads is set by default to the number of online processors, which can be changed using the -p option. Specifying -p 1 avoids the use of threads entirely. The input blocks, while compressed independently, have the last 32K of the previous block loaded as a preset dictionary to preserve the compression effectiveness of deflating in a single thread. This can be turned off using the -i or --independent option, so that the blocks can be decompressed independently for partial error recovery or for random access.
Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. As a result, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances. Parallel decompression can be turned off by specifying one process ( -dp 1 or -tp 1 )."
I also just updated the article with this info.
Interesting. How is this capable of extracting LZMA "solid" archives where all files are part of a single stream to achieve better (and very much slower) compression?
Had to try this on my quad core laptop, as I never heard of these tools .
josh@snoopy:~/Downloads $ grep -m2 -i intel /proc/cpuinfo
vendor_id : GenuineIntel
model name : Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz
josh@snoopy:~/Downloads $ ls -l test
-rw-r--r-- 1 josh josh 1073741824 2012-03-07 20:06 test
josh@snoopy:~/Downloads $ time gzip test
real 0m16.430s
user 0m10.210s
sys 0m0.490s
josh@snoopy:~/Downloads $ time pigz test
real 0m5.028s
user 0m16.040s
sys 0m0.620s
Looks good.. although the man page describes it as being "an almost compatible replacement for the gzip program".That's actually a dual core with hyperthreading. (I'm not just being pedantic, I'm curious as to how well hyperthreading works.) http://ark.intel.com/products/52224
I wonder what the speed looks like with two threads? (pigz -p 2) ?
But what's the compression ratio and what's the test data?
Test data
with -p2 it runs ~2 sec slowerjosh@snoopy:~/Downloads $ dd if=/dev/zero of=test bs=1024K count=1000 josh@snoopy:~/Downloads $ gzip -l test.gz compressed uncompressed ratio uncompressed_name 1176024 1048576000 99.9% test.txtFor that kind of data you can just use RLE instead. It's the trivial case (an edge case in fact).
Is xz less resource intensive then bzip2? My testing (admittedly two years ago or so) showed significant differences, better compression ratio with xz but significantly longer and/or more memory used.
In the tests I did (including compression of an entire filesystem image), xz always compresses better for the same effort (amount of time).
One example (using the standard tools, not the multithreaded ones):
In this case xz compressed roughly to the same size (slighly better than bzip2, actually) using only 72% of the time. This is often the case.igorhvr:tmp/ $ ls -la co_2011-08_import.mdb* -rw-r--r-- 1 igorhvr igorhvr 261832704 2011-10-11 07:19 co_2011-08_import.mdb igorhvr:tmp/ $ sudo time nice -n -20 bzip2 -k co_2011-08_import.mdb 42.85user 0.14system 0:43.16elapsed 99%CPU (0avgtext+0avgdata 31968maxresident)k 112inputs+0outputs (3major+2268minor)pagefaults 0swaps igorhvr:tmp/ $ sudo time nice -n -20 xz -3 -k co_2011-08_import.mdb 30.99user 0.19system 0:31.48elapsed 99%CPU (0avgtext+0avgdata 130640maxresident)k 96inputs+0outputs (2major+8389minor)pagefaults 0swaps igorhvr:tmp/ $ ls -la co_2011-08_import.mdb* -rw-r--r-- 1 igorhvr igorhvr 261832704 2011-10-11 07:19 co_2011-08_import.mdb -rw-r--r-- 1 igorhvr igorhvr 24020243 2011-10-11 07:19 co_2011-08_import.mdb.bz2 -rw-r--r-- 1 igorhvr igorhvr 23949408 2011-10-11 07:19 co_2011-08_import.mdb.xzInteresting I may have been comparing parallel bzip2 with single threaded xz or perhaps it was just the data I was using for a test.
I'll have to look at it again, thanks!
Put two spaces before your lines of terminal output to prevent HN from munging them.
Done - thanks
If you're handling a lot of data it make sense to hash-partition it on some key and spread it out to a large number of files.
In that case you might have, say, 512 partitions and you can farm out compression, decompression and other tasks to as many CPUs as you want, even other machines in a cluster.
I like to use PPMd (via 7zip) for large volumes of text, but it seems to only be single-threaded, which is a shame. It cuts a good 30% again off the size of the .xml.bz2's that Wikipedia provides.
This is awesome since compression in parallel has been largely neglected in practice.
It's not neglected, it's just hard and a lot of projects failed. But there are many papers on the subject in the last 10 years. Also parallel is a vague term, it can mean multicore, SIMD, GPU and clusters. All different, with a unique set of tradeoffs.