GNU Parallel Cheat Sheet [pdf]

119 points by ole_tange 7 years ago · 63 comments

Reader

Ah, the rare case of nagware in GNU.

From the man page:

"--citation Print the BibTeX entry for GNU parallel and silence citation notice. If it is impossible for you to run --bibtex you can use --will-cite. If you use --will-cite in scripts to be run by others you are making it harder for others to see the citation notice. The development of GNU parallel is indirectly financed through citations, so if your users do not know they should cite then you are making it harder to finance development. However, if you pay 10000 EUR, you should feel free to use --will-cite in scripts."

Asking for donations/citations is one thing, but putting this junk in the man page about 10000 EUR and nagging users is quite an annoyance. How GNU allows such junk in their man pages puzzles me. Obviously the GPL allows one to remove the nagware and redistribute, but I don't know if anyone has forked it.

jmiserez 7 years ago

Yes. Another issue: The author has been promoting GNU Parallel pretty heavily on many StackOverflow questions dealing with xargs or parallel execution, even when an additional tool is neither needed nor wanted (because it's a) not already installed, unlike xargs, and b) because of the aforementioned citation issue which I disagree with).
It's a great tool I'm sure, but I've been able to get by using just xargs, flock, etc., for most usecases.
leppr 7 years ago

Your criticism would be fair if FOSS authors were rightly conpensated for their work.
This isn't nearly the case, so until then blaming FOSS authors for some experimentations is just unwarranted.
- bonoboTP 7 years ago
  
  They can do experimentations and distribute their work through their channels then. But GNU tools (and FOSS by extension) are so popular because of their no-nonsense philosophy of here it is do with it whatever you want. Run it anywhere and any way you please.
  Citing it or not is an issue of academic practice/considerations (whether its use was a significant part of the research etc.). Mandating it through nag messages is too much.
  What's next? make will print ads while the compilation runs? GIMP will watermark my images if I don't pay 10K or promise to cite it if I make figures with for my paper?
  So again, my main confusion is about how this can be an official GNU tool.
  - mechanical_jane 7 years ago
    
    Are your questions answered in the FAQ? http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/cita...
    
    snazz 7 years ago
    
    The FAQ includes a lot of support for what seems like the wrong hill to die on.
    Here’s my thought process:
    - the GNU Parallel author(s) want/wants people to use and contribute to it.
    - they think that most users are academics who write papers and that potential users will find the project after reading the citation, which may or may not be true
    - they include a nagware message that “reminds” users to cite the software
    - despite the message being controversial and being the subject of the #1 comment in an otherwise unrelated HN thread about the software in general, an FAQ is written to back up the existence of this message
    This brings me to the question of whether the inclusion of this message acts more as a deterrent to potential contributors and users. I agree with the motivation, but the means feels petty and undercuts the original goal.
    
    mechanical_jane 7 years ago
    
    I enjoy following your thought process. I cannot make that fit with the content of the FAQ:
    "In other words: It is preferable having fewer users, who all know they should cite, over having many users, who do not know they should cite.
    If the goal had been to get more users, then the license would have been public domain.
    ...
    The citation notice is about (indirect) funding - nothing else."
    Does that fit with your assumption that "the GNU Parallel author(s) want/wants people to use and contribute to it"?
    
    snazz 7 years ago
    
    What kind of “funding” is referenced in the FAQ? Is there some kind of organization that I am not aware of that pays the author(s) for citations in papers? How is the “long term survival” impacted by whether the author receives citations?
    I’m confused as to how “[not including citations] would not have been sustainable in the long term” unless either citations become money at some point or the author is motivated sufficiently by citations to the extent that they would otherwise not work on the project.
    If you are an author or are involved in the project, please know that this isn’t intended to be an attack, I’m just interested as to why a project would do something that seems counterintuitive (at least from my point of view).
    
    mechanical_jane 7 years ago
    
    Isn't that explained in the link in the very first question?
    https://lists.gnu.org/archive/html/parallel/2013-11/msg00006...
  - zaphirplane 7 years ago
    
    Make will build executables with time limits. A bit hyperbole don’t you think, it’s just some text
oids98d98s 7 years ago

Gooood who caaaaares.
They spent a lot of time and effort, and made a cool thing and gave it away for free. If it bothers you so much, just add the flag. Or patch it out.
- oids98d98s 7 years ago
  
  parent comment was deleted. the tldr was "how rude of them to do the licensing flag" thing
lelf 7 years ago

Don't use it if you don't like it. It's that simple.
mechanical_jane 7 years ago

Do you feel this is covered in the FAQ? http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/cita...
- musicale 7 years ago
  
  The FAQ makes it worse:
  "GNU Parallel is indirectly funded through citations. It is therefore important for the long term survival of GNU Parallel that it is cited. The citation notice makes users aware of this."
  It's a bit like saying:
  "Webkit is indirectly funded by iPhones. It is therefore important for the long term survival of Webkit that people purchase iPhones. The iPhone notice make users aware of this."
  - mechanical_jane 7 years ago
    
    We probably have to agree to disagree on that one:
    I see loads of commercials for buying iPhones. I do not see a lot of commercials for citing GNU Parallel.
    If I have to make a citation, it will cost me no money, but one line of text if I write an article. If I have to buy an iPhone, it will cost me many hours of work.
    To me the two things are not even close to be similar.
    But I can find one similar aspect: No one forces you to use an iPhone.
musicale 7 years ago

Beyond being supremely irritating, nagware is simply not scalable.
Imagine if every utility, library, or driver in a typical Linux distribution took this approach. :(
I encourage Debian et al. to adopt a "no nagware" policy.
- jwilk 7 years ago
  
  There's something similar in Debian Policy §2.3 (https://www.debian.org/doc/debian-policy/ch-archive.html#cop...):
  > Programs whose authors encourage the user to make donations are fine for the main distribution, provided that the authors do not claim that not donating is immoral, unethical, illegal or something similar; in such a case they must go in non-free.
  BTW, the nagware code has been removed in Debian unstable:
  https://bugs.debian.org/905674
  - musicale 7 years ago
    
    It's about time! In addition to my "it's annoying and simply not scalable" comment, the bug discussion brings up some additional compelling points:
    1. It included a click-wrap agreement in violation of the Debian Free Software Guidelines.
    2. Fishing for inappropriate citations should not be encouraged, as it compromises the integrity of scholarship.
- mechanical_jane 7 years ago
  
  I am trying to understand your take. Do you also call Firefox nagware, when it pops up with a dialog box where you can click "Don't show this again"?
  To me the dialog box is actually worse, because the program often blocks until you close the dialog box (not 100% sure if that is the case with Firefox).
  With GNU Parallel you run 'parallel --citation' once, and you are done. We are talking an effort of 15 seconds or less.
  When I install a library I often have to run the install command and it often takes longer than 15 seconds.
  Finally, I would like to understand why you do not just use another utility? Would that not solve your issue?
  - musicale 7 years ago
    
    Nagware is irritating and simply not scalable.
    At least with web browsers they are user-facing and you only have a few of them to deal with.
    I never choose parallel intentionally, but I still encounter the nagware messages in the output of scripts that other people wrote. And disabling the nagware message on my laptop doesn't disable it in a container, in the cloud, etc.. It's very annoying.
    Wasting 15 seconds of human time certainly isn't scalable over dozens or hundreds of utilities. And applying the Steve Jobs computation[1], 15 seconds * 1 million users = nearly six months of wasted human lifetime.
    Fortunately Debian-unstable seems to have fixed the issue by removing the nag message (which violates the DFSG). With luck this will propagate into mainline and into all of the downstream distributions like Ubuntu.
    [1] https://www.folklore.org/StoryView.py?story=Saving_Lives.txt
LeoPanthera 7 years ago

It would be trivial to fork parallel. If people cared enough, a fork would appear and be adopted. That's the beauty of free software. If you don't like it you can change it.
- rurban 7 years ago
  
  It was already forked in its early perl days, and thus it's very hard to use it properly as build tool, as the non-GNU version has an entirely different argument syntax. eg on macOS or BSD. You really have to probe for the GNU version (the more popular and newer one, with this awkward citation and begging), but for longer tasks it speeds up processing immensely. There's no need for Hadoop when you can use parallel. I'm processing hundreds of log files in one of my build-steps (similar to pgo, profile guided optimization), and with parallel it needs 30s, without 3min. This makes a difference.
  - jwilk 7 years ago
    
    Where can I read about this fork?
    Googling for "bsd parallel command" doesn't seem to show anything relevant.
    
    rurban 7 years ago
    
    Mixed it up. It was from moreutils, not bsd.
    My configure.ac recipe for the proper parallel is this, setting logs_all to the GNU parallel version.
    dnl GNU parallel, skip the old non-perl version from moreutils so far AC_CHECK_PROGS([PARALLEL], [parallel]) logs_all=logs-all-serial.sh.in if test -n "$PARALLEL"; then AC_MSG_CHECKING([PARALLEL version]) parallel_version=`$PARALLEL --version 2>&1 | head -n1 | cut -c14-` case "$parallel_version" in [0-9]*) AC_MSG_RESULT([$parallel_version]) logs_all=logs-all-parallel.sh.in ;; *invalid*) PARALLEL= parallel_version="skip old moreutils version, need GNU parallel" esac AC_MSG_RESULT([$parallel_version]) fi AM_CONDITIONAL([HAVE_PARALLEL], [test -n "$PARALLEL"])
    
    jwilk 7 years ago
    
    For the avoidance of doubt, moreutils' parallel is not a fork of GNU parallel. They are independent implementations.
    
    rurban 7 years ago
    
    Not really independent. Ole Tange tried to get his 2nd parallel into findutils (to replace xargs and allow parallel processing) in 2005. Because it was written in Perl, findutils refused to add it. So Ole Tange contacted moreutils in 2009, but they never answered, but later one guy of them, Tollef Fog Heen, rewrote parallel in C, with just minor API discrepancies. They say it predates GNU parallel, because the 2005-2007 package was not a GNU project then. It was added to savannah as individual GNU project in 2010, Ole gave up inclusion into findutils then.
    moreutils parallel was written 2008, and added to moreutils 2009, just when Ole asked them. https://git.joeyh.name/index.cgi/moreutils.git/commit/?id=0f...
    https://www.gnu.org/software/parallel/history.html
    Hence I called it a "fork". Independent, yes, but when you know about the project, steal its name and put it into wide distribution under that same name, because you think your name has a better chance of being adopted, this is called a fork. Like a pitchfork. Poking into the original authors eyes with a sharp instrument.
    
    mechanical_jane 7 years ago
    
    If you are just looking for an alternative, GNU Parallel publishes a list of alternatives: https://www.gnu.org/software/parallel/parallel_alternatives....

gcommer 7 years ago

A few slightly more advanced GNU Parallel features that I've used:

- --joblog writes out a detailed logfile of the jobs, which can be used to resume from interrupted runs with --resume{,-failed}

- `--slf filename` can be used to provide a list of ssh logins to remote worker nodes to run jobs. Importantly, parallel will automatically reread this list when it changes. This lets you very easily distribute batch jobs across preemptible gcloud vms (or ec2 spot instances) and gracefully handle worker nodes appearing/disappearing with just a few lines of bash https://gist.github.com/gpittarelli/5e14fb772ce0230a3c40ffad...

- When used with bash, parallel can run bash functions if you export them with `export -f functionName` .

ziotom78 7 years ago

Yeah, --joblog is a very handy feature. I once hacked a small Python script to produce an ASCII time plot from its output:
https://github.com/ziotom78/plot_joblog
bloopernova 7 years ago

Those are all really good tips, thank you for sharing them.

mruts 7 years ago

I've never used GNU Parallel. But could someone explain to me the value add vs GNU xargs -P/--max-procs? From the examples at the top, it seems like those could be achieved with xargs.

eindiran 7 years ago

The value add is that you don't need to do the `xargs --max-procs N`, yourself. By default N is 1 for xargs. For parallel, the default is N = number of CPUs.
Additionally, you can run a series of unrelated commands that aren't from a list/piped in with parallel using the `--` syntax:
`parallel -j 3 -- ls df "echo hi"`
You can limit system load using parallel, which as far as I know isn't possible with xargs: `parallel -l L` where L is the average system load you want to remain beneath.
gcommer 7 years ago
parallel is like xargs++; for simple cases it does the same thing as xargs, but it also has many more advanced features such as:
- Splitting input lines into multiple fields and building more complex commands from them
- Running jobs on remote nodes
- Pausing/resuming batch jobs (--joblog)
- ETA and progress bars
- Passing data to programs on stdin and generally many, many other ways of distributing and collecting data that xargs can't do
You can see a bunch of examples at: https://www.gnu.org/software/parallel/man.html
```
  $ PAGER=cat man xargs | wc -l
  259
  $ PAGER=cat man parallel | wc -l
  3985
```
- e12e 7 years ago
  
  OT: I got curious, and this also works:
  PAGER="wc -l" man xargs
  (although my man page for xargs is just 211 lines)
  - magissima 7 years ago
    
    Try changing your terminal's width and running it again ;)
    
    e12e 7 years ago
    
    Ah.
    MANWIDTH=80 PAGER="wc -l" man xargs 292
acdha 7 years ago

A couple months ago, I parallelized execution of thousands of slow batch jobs on a fleet of remote servers. With parallel that was one command, including estimated time to completion and retries for failed jobs. It was awfully nice not to need to install or setup anything or spend time coding built-in features. Once it was done, I will almost certainly never run that exact operation again.
I normally use xargs for simple things and if it’s a regular business operation I’d setup a task queue but there’s a fair amount of work in the middle where it’s nice to have a solid tool with most of the features you could want built in and tested.
creatornator 7 years ago

It has some more granular control over "pasting" in values. For example, you can use {} for the arg value itself, or you can use {.} for just what's before the extension, or you can use {/} for the basename, or {/.} for the basename without extension, etc. You can also get progress bar, ETA, etc.
phonebanshee 7 years ago

I use both. xargs does a simple job reasonably well; if I'm just typing on the command line it often is the tool I use. parallel has many, many more options and ways to turn output from script A into parallel invocations of script B in multiple machines. Parallel is also handy just for parsing filenames; it's become my default tool for manipulating a stream of filenames and then running commands on them. The parallel part is just a nice plus.
fwip 7 years ago

Parallel ensures that the output of each process is kept together, which can aid readability.
e12e 7 years ago

Mainly parallel remote execution. Possibly resumption (depending on the task, see gcommer's comment).
You might want to look at:
https://unix.stackexchange.com/questions/104778/gnu-parallel...
I'll say that field separation / null termination is a bit annoying for xargs/find etc-but more so perhaps for novice users of shell. I do like shell pipelines, but quoting can be nearly.
srean 7 years ago

To get an exhaustive answer you really need to go through the command line options as described in its man pages. xargs is good for simple stuff as long as you avoid some gotchas,but this really does a whole lot more. A lot more than anyone can do justice in a comment. AWK with gnu parallel is surprisingly potent combination.

bloopernova 7 years ago

Parallel is Good Stuff (tm) and works very well but I haven't had much cause to use it.

For ad-hoc system modifications I've found myself using tmux's synchronize-panes feature, or xargs. For anything bigger or more involved then I break out Ansible/Chef/Puppet depending on which client project I'm working on.

I remember one place I worked at had a huge elaborate configuration/deployment system hand written by the head IT guy which used Parallel+bash+perl extensively. Thing is, while it was a great system, I could make the same changes in Ansible or Puppet with a couple of lines and push them within minutes, while making changes using the hand written system might take hours. Plus no logging and poor error handling led to all sorts of problems with that system, despite it being a real labour of love by that wacky Finnish dude.

However this sheet is really nice because it is just one side of a letter/A4 piece of paper and lays out the information clearly. I definitely want to mess around with Parallel now because of this cheat sheet. I wonder how it was typeset or laid out on the page? I try to write my own cheat sheets but they always seem way too sparse with too much white space. Maybe it is written in LaTeX or similar.

mechanical_jane 7 years ago

Not LaTeX, but LibreOffice: http://git.savannah.gnu.org/cgit/parallel.git/tree/src/paral...

jason_slack 7 years ago

I use GNU Parallel for pulling stock data from various sources, massaging it, creating flatfiles of the data, creating models of the data, etc.

I also use it as a rudimentary queue system for stacking up the next jobs (while scripts stack up the next jobs, but..).

It had a bit of a learning curve because the docs are really technical and not geared towards new users enough, but reading and re-reading and trying some examples helped cement.

Here are a few ways I use it:

echo "Number of RAR archives: "$(ls .rar | wc -l)

ls .rar | parallel -j0 1_1_rarFilesExtraction

ls -d stocks_all/Intraday/*.txt | parallel -j${ccj}% 1_2_stockFileProcessing {}

I'd like to scale this to work with multiple machines (as Parallel can do) but I get really tempted to just write my own parallel processor just to rely on my own code.

scrummyin 7 years ago

My favorite parallels command `$ find ~/Source/folder -name .git | parallel "cd {}/.. ; git pull ; git checkout -b new_branch" `

akramer 7 years ago

Each time I've seen something about GNU parallel pop up I've been tempted to post, but I've never made an account until now.

I wrote a very different style of command parallelizer that I named lateral. It doesn't require constructing elaborate commandlines that define all of your work at once. You start a server, and separate invocations of 'lateral run' add your commands to a queue to run on the server, including their filedescriptors. It makes for easier parallelization of complex arguments.

Take a look if this sort of thing interests you, as I haven't seen anyone write one like this before. Its primary difference is the ease with which each separate command can output to its own log, and the lack of need to play games with shell quoting and positional arguments.

Check it out: https://github.com/akramer/lateral

the_it_girl 7 years ago

I think it is good you finally made an account: How are people going to find your software if you do not tell them about it :)
Can you make a comparison between lateral and sem?
https://www.gnu.org/software/parallel/sem.html
arbie 7 years ago

This looks neat! Much <3 for using Golang and YAML.
Can a single lateral server queue be used across multiple host machines? And in the other direction, can lateral launch and monitor processes that reside across multiple machines?

res0nat0r 7 years ago

Lots of good examples also here: https://www.gnu.org/software/parallel/man.html

Mizza 7 years ago

If you're using GNU Parallel for simple, non-parallel command line tasks and scripting, I've written a tool which I find to be much more intuitive:

https://github.com/Miserlou/Loop

The author of GNU Parallel wrote a pretty detailed comparison, which you can find in the linked README.

isaachier 7 years ago

Your tool looks nice, but it doesn't seem to parallelize the work in any way.
- isaachier 7 years ago
  
  Never mind, missed your point about not being parallel.

devy 7 years ago

Is there a Rust port for GNU parallel? It's written in Perl and having to install dependencies for Perl is not as simple as download a binary :)

wyoh 7 years ago

https://github.com/mmstick/parallel but it's unmaintained and the author wanted to do a rewrite.

hprotagonist 7 years ago

Still often the simplest way to get parallel computation in python, sadly.

bonoboTP 7 years ago

The multiprocessing module is pretty good in Python.
- srean 7 years ago
  
  Not in my experience. Has edge cases, especially on windows. Its understandable though, if you look under the covers there is a huge amount of complexity there.
- zaphirplane 7 years ago
  
  Logging was lost last time I tried it
  - ZeroCool2u 7 years ago
    
    Has this issue recently. Turns out there's a great library for this specifically. https://github.com/jruere/multiprocessing-logging
- hprotagonist 7 years ago
  
  pickle is gross.
black-tea 7 years ago

It's often they only way you really need.

Settings

GNU Parallel Cheat Sheet [pdf]

Keyboard Shortcuts