GNU Parallel Cheat Sheet [pdf]
gnu.orgAh, the rare case of nagware in GNU.
From the man page:
"--citation Print the BibTeX entry for GNU parallel and silence citation notice. If it is impossible for you to run --bibtex you can use --will-cite. If you use --will-cite in scripts to be run by others you are making it harder for others to see the citation notice. The development of GNU parallel is indirectly financed through citations, so if your users do not know they should cite then you are making it harder to finance development. However, if you pay 10000 EUR, you should feel free to use --will-cite in scripts."
Asking for donations/citations is one thing, but putting this junk in the man page about 10000 EUR and nagging users is quite an annoyance. How GNU allows such junk in their man pages puzzles me. Obviously the GPL allows one to remove the nagware and redistribute, but I don't know if anyone has forked it.
Yes. Another issue: The author has been promoting GNU Parallel pretty heavily on many StackOverflow questions dealing with xargs or parallel execution, even when an additional tool is neither needed nor wanted (because it's a) not already installed, unlike xargs, and b) because of the aforementioned citation issue which I disagree with).
It's a great tool I'm sure, but I've been able to get by using just xargs, flock, etc., for most usecases.
Your criticism would be fair if FOSS authors were rightly conpensated for their work.
This isn't nearly the case, so until then blaming FOSS authors for some experimentations is just unwarranted.
They can do experimentations and distribute their work through their channels then. But GNU tools (and FOSS by extension) are so popular because of their no-nonsense philosophy of here it is do with it whatever you want. Run it anywhere and any way you please.
Citing it or not is an issue of academic practice/considerations (whether its use was a significant part of the research etc.). Mandating it through nag messages is too much.
What's next? make will print ads while the compilation runs? GIMP will watermark my images if I don't pay 10K or promise to cite it if I make figures with for my paper?
So again, my main confusion is about how this can be an official GNU tool.
Are your questions answered in the FAQ? http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/cita...
The FAQ includes a lot of support for what seems like the wrong hill to die on.
Here’s my thought process:
- the GNU Parallel author(s) want/wants people to use and contribute to it.
- they think that most users are academics who write papers and that potential users will find the project after reading the citation, which may or may not be true
- they include a nagware message that “reminds” users to cite the software
- despite the message being controversial and being the subject of the #1 comment in an otherwise unrelated HN thread about the software in general, an FAQ is written to back up the existence of this message
This brings me to the question of whether the inclusion of this message acts more as a deterrent to potential contributors and users. I agree with the motivation, but the means feels petty and undercuts the original goal.
I enjoy following your thought process. I cannot make that fit with the content of the FAQ:
"In other words: It is preferable having fewer users, who all know they should cite, over having many users, who do not know they should cite.
If the goal had been to get more users, then the license would have been public domain.
...
The citation notice is about (indirect) funding - nothing else."
Does that fit with your assumption that "the GNU Parallel author(s) want/wants people to use and contribute to it"?
What kind of “funding” is referenced in the FAQ? Is there some kind of organization that I am not aware of that pays the author(s) for citations in papers? How is the “long term survival” impacted by whether the author receives citations?
I’m confused as to how “[not including citations] would not have been sustainable in the long term” unless either citations become money at some point or the author is motivated sufficiently by citations to the extent that they would otherwise not work on the project.
If you are an author or are involved in the project, please know that this isn’t intended to be an attack, I’m just interested as to why a project would do something that seems counterintuitive (at least from my point of view).
Isn't that explained in the link in the very first question?
https://lists.gnu.org/archive/html/parallel/2013-11/msg00006...
Make will build executables with time limits. A bit hyperbole don’t you think, it’s just some text
Gooood who caaaaares.
They spent a lot of time and effort, and made a cool thing and gave it away for free. If it bothers you so much, just add the flag. Or patch it out.
parent comment was deleted. the tldr was "how rude of them to do the licensing flag" thing
Don't use it if you don't like it. It's that simple.
Do you feel this is covered in the FAQ? http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/cita...
The FAQ makes it worse:
"GNU Parallel is indirectly funded through citations. It is therefore important for the long term survival of GNU Parallel that it is cited. The citation notice makes users aware of this."
It's a bit like saying:
"Webkit is indirectly funded by iPhones. It is therefore important for the long term survival of Webkit that people purchase iPhones. The iPhone notice make users aware of this."
We probably have to agree to disagree on that one:
I see loads of commercials for buying iPhones. I do not see a lot of commercials for citing GNU Parallel.
If I have to make a citation, it will cost me no money, but one line of text if I write an article. If I have to buy an iPhone, it will cost me many hours of work.
To me the two things are not even close to be similar.
But I can find one similar aspect: No one forces you to use an iPhone.
Beyond being supremely irritating, nagware is simply not scalable.
Imagine if every utility, library, or driver in a typical Linux distribution took this approach. :(
I encourage Debian et al. to adopt a "no nagware" policy.
There's something similar in Debian Policy §2.3 (https://www.debian.org/doc/debian-policy/ch-archive.html#cop...):
> Programs whose authors encourage the user to make donations are fine for the main distribution, provided that the authors do not claim that not donating is immoral, unethical, illegal or something similar; in such a case they must go in non-free.
BTW, the nagware code has been removed in Debian unstable:
It's about time! In addition to my "it's annoying and simply not scalable" comment, the bug discussion brings up some additional compelling points:
1. It included a click-wrap agreement in violation of the Debian Free Software Guidelines.
2. Fishing for inappropriate citations should not be encouraged, as it compromises the integrity of scholarship.
I am trying to understand your take. Do you also call Firefox nagware, when it pops up with a dialog box where you can click "Don't show this again"?
To me the dialog box is actually worse, because the program often blocks until you close the dialog box (not 100% sure if that is the case with Firefox).
With GNU Parallel you run 'parallel --citation' once, and you are done. We are talking an effort of 15 seconds or less.
When I install a library I often have to run the install command and it often takes longer than 15 seconds.
Finally, I would like to understand why you do not just use another utility? Would that not solve your issue?
Nagware is irritating and simply not scalable.
At least with web browsers they are user-facing and you only have a few of them to deal with.
I never choose parallel intentionally, but I still encounter the nagware messages in the output of scripts that other people wrote. And disabling the nagware message on my laptop doesn't disable it in a container, in the cloud, etc.. It's very annoying.
Wasting 15 seconds of human time certainly isn't scalable over dozens or hundreds of utilities. And applying the Steve Jobs computation[1], 15 seconds * 1 million users = nearly six months of wasted human lifetime.
Fortunately Debian-unstable seems to have fixed the issue by removing the nag message (which violates the DFSG). With luck this will propagate into mainline and into all of the downstream distributions like Ubuntu.
[1] https://www.folklore.org/StoryView.py?story=Saving_Lives.txt
It would be trivial to fork parallel. If people cared enough, a fork would appear and be adopted. That's the beauty of free software. If you don't like it you can change it.
It was already forked in its early perl days, and thus it's very hard to use it properly as build tool, as the non-GNU version has an entirely different argument syntax. eg on macOS or BSD. You really have to probe for the GNU version (the more popular and newer one, with this awkward citation and begging), but for longer tasks it speeds up processing immensely. There's no need for Hadoop when you can use parallel. I'm processing hundreds of log files in one of my build-steps (similar to pgo, profile guided optimization), and with parallel it needs 30s, without 3min. This makes a difference.
Where can I read about this fork?
Googling for "bsd parallel command" doesn't seem to show anything relevant.
Mixed it up. It was from moreutils, not bsd.
My configure.ac recipe for the proper parallel is this, setting logs_all to the GNU parallel version.
dnl GNU parallel, skip the old non-perl version from moreutils so far AC_CHECK_PROGS([PARALLEL], [parallel]) logs_all=logs-all-serial.sh.in if test -n "$PARALLEL"; then AC_MSG_CHECKING([PARALLEL version]) parallel_version=`$PARALLEL --version 2>&1 | head -n1 | cut -c14-` case "$parallel_version" in [0-9]*) AC_MSG_RESULT([$parallel_version]) logs_all=logs-all-parallel.sh.in ;; *invalid*) PARALLEL= parallel_version="skip old moreutils version, need GNU parallel" esac AC_MSG_RESULT([$parallel_version]) fi AM_CONDITIONAL([HAVE_PARALLEL], [test -n "$PARALLEL"])For the avoidance of doubt, moreutils' parallel is not a fork of GNU parallel. They are independent implementations.
Not really independent. Ole Tange tried to get his 2nd parallel into findutils (to replace xargs and allow parallel processing) in 2005. Because it was written in Perl, findutils refused to add it. So Ole Tange contacted moreutils in 2009, but they never answered, but later one guy of them, Tollef Fog Heen, rewrote parallel in C, with just minor API discrepancies. They say it predates GNU parallel, because the 2005-2007 package was not a GNU project then. It was added to savannah as individual GNU project in 2010, Ole gave up inclusion into findutils then.
moreutils parallel was written 2008, and added to moreutils 2009, just when Ole asked them. https://git.joeyh.name/index.cgi/moreutils.git/commit/?id=0f...
https://www.gnu.org/software/parallel/history.html
Hence I called it a "fork". Independent, yes, but when you know about the project, steal its name and put it into wide distribution under that same name, because you think your name has a better chance of being adopted, this is called a fork. Like a pitchfork. Poking into the original authors eyes with a sharp instrument.
If you are just looking for an alternative, GNU Parallel publishes a list of alternatives: https://www.gnu.org/software/parallel/parallel_alternatives....
A few slightly more advanced GNU Parallel features that I've used:
- --joblog writes out a detailed logfile of the jobs, which can be used to resume from interrupted runs with --resume{,-failed}
- `--slf filename` can be used to provide a list of ssh logins to remote worker nodes to run jobs. Importantly, parallel will automatically reread this list when it changes. This lets you very easily distribute batch jobs across preemptible gcloud vms (or ec2 spot instances) and gracefully handle worker nodes appearing/disappearing with just a few lines of bash https://gist.github.com/gpittarelli/5e14fb772ce0230a3c40ffad...
- When used with bash, parallel can run bash functions if you export them with `export -f functionName` .
Yeah, --joblog is a very handy feature. I once hacked a small Python script to produce an ASCII time plot from its output:
Those are all really good tips, thank you for sharing them.
I've never used GNU Parallel. But could someone explain to me the value add vs GNU xargs -P/--max-procs? From the examples at the top, it seems like those could be achieved with xargs.
The value add is that you don't need to do the `xargs --max-procs N`, yourself. By default N is 1 for xargs. For parallel, the default is N = number of CPUs.
Additionally, you can run a series of unrelated commands that aren't from a list/piped in with parallel using the `--` syntax:
`parallel -j 3 -- ls df "echo hi"`
You can limit system load using parallel, which as far as I know isn't possible with xargs: `parallel -l L` where L is the average system load you want to remain beneath.
parallel is like xargs++; for simple cases it does the same thing as xargs, but it also has many more advanced features such as:
- Splitting input lines into multiple fields and building more complex commands from them
- Running jobs on remote nodes
- Pausing/resuming batch jobs (--joblog)
- ETA and progress bars
- Passing data to programs on stdin and generally many, many other ways of distributing and collecting data that xargs can't do
You can see a bunch of examples at: https://www.gnu.org/software/parallel/man.html
$ PAGER=cat man xargs | wc -l 259 $ PAGER=cat man parallel | wc -l 3985OT: I got curious, and this also works:
PAGER="wc -l" man xargs
(although my man page for xargs is just 211 lines)
Try changing your terminal's width and running it again ;)
Ah.
MANWIDTH=80 PAGER="wc -l" man xargs 292
A couple months ago, I parallelized execution of thousands of slow batch jobs on a fleet of remote servers. With parallel that was one command, including estimated time to completion and retries for failed jobs. It was awfully nice not to need to install or setup anything or spend time coding built-in features. Once it was done, I will almost certainly never run that exact operation again.
I normally use xargs for simple things and if it’s a regular business operation I’d setup a task queue but there’s a fair amount of work in the middle where it’s nice to have a solid tool with most of the features you could want built in and tested.
It has some more granular control over "pasting" in values. For example, you can use {} for the arg value itself, or you can use {.} for just what's before the extension, or you can use {/} for the basename, or {/.} for the basename without extension, etc. You can also get progress bar, ETA, etc.
I use both. xargs does a simple job reasonably well; if I'm just typing on the command line it often is the tool I use. parallel has many, many more options and ways to turn output from script A into parallel invocations of script B in multiple machines. Parallel is also handy just for parsing filenames; it's become my default tool for manipulating a stream of filenames and then running commands on them. The parallel part is just a nice plus.
Parallel ensures that the output of each process is kept together, which can aid readability.
Mainly parallel remote execution. Possibly resumption (depending on the task, see gcommer's comment).
You might want to look at:
https://unix.stackexchange.com/questions/104778/gnu-parallel...
I'll say that field separation / null termination is a bit annoying for xargs/find etc-but more so perhaps for novice users of shell. I do like shell pipelines, but quoting can be nearly.
To get an exhaustive answer you really need to go through the command line options as described in its man pages. xargs is good for simple stuff as long as you avoid some gotchas,but this really does a whole lot more. A lot more than anyone can do justice in a comment. AWK with gnu parallel is surprisingly potent combination.
Parallel is Good Stuff (tm) and works very well but I haven't had much cause to use it.
For ad-hoc system modifications I've found myself using tmux's synchronize-panes feature, or xargs. For anything bigger or more involved then I break out Ansible/Chef/Puppet depending on which client project I'm working on.
I remember one place I worked at had a huge elaborate configuration/deployment system hand written by the head IT guy which used Parallel+bash+perl extensively. Thing is, while it was a great system, I could make the same changes in Ansible or Puppet with a couple of lines and push them within minutes, while making changes using the hand written system might take hours. Plus no logging and poor error handling led to all sorts of problems with that system, despite it being a real labour of love by that wacky Finnish dude.
However this sheet is really nice because it is just one side of a letter/A4 piece of paper and lays out the information clearly. I definitely want to mess around with Parallel now because of this cheat sheet. I wonder how it was typeset or laid out on the page? I try to write my own cheat sheets but they always seem way too sparse with too much white space. Maybe it is written in LaTeX or similar.
Not LaTeX, but LibreOffice: http://git.savannah.gnu.org/cgit/parallel.git/tree/src/paral...
I use GNU Parallel for pulling stock data from various sources, massaging it, creating flatfiles of the data, creating models of the data, etc.
I also use it as a rudimentary queue system for stacking up the next jobs (while scripts stack up the next jobs, but..).
It had a bit of a learning curve because the docs are really technical and not geared towards new users enough, but reading and re-reading and trying some examples helped cement.
Here are a few ways I use it:
echo "Number of RAR archives: "$(ls .rar | wc -l)
ls .rar | parallel -j0 1_1_rarFilesExtraction
ls -d stocks_all/Intraday/*.txt | parallel -j${ccj}% 1_2_stockFileProcessing {}
I'd like to scale this to work with multiple machines (as Parallel can do) but I get really tempted to just write my own parallel processor just to rely on my own code.
My favorite parallels command `$ find ~/Source/folder -name .git | parallel "cd {}/.. ; git pull ; git checkout -b new_branch" `
Each time I've seen something about GNU parallel pop up I've been tempted to post, but I've never made an account until now.
I wrote a very different style of command parallelizer that I named lateral. It doesn't require constructing elaborate commandlines that define all of your work at once. You start a server, and separate invocations of 'lateral run' add your commands to a queue to run on the server, including their filedescriptors. It makes for easier parallelization of complex arguments.
Take a look if this sort of thing interests you, as I haven't seen anyone write one like this before. Its primary difference is the ease with which each separate command can output to its own log, and the lack of need to play games with shell quoting and positional arguments.
Check it out: https://github.com/akramer/lateral
I think it is good you finally made an account: How are people going to find your software if you do not tell them about it :)
Can you make a comparison between lateral and sem?
This looks neat! Much <3 for using Golang and YAML.
Can a single lateral server queue be used across multiple host machines? And in the other direction, can lateral launch and monitor processes that reside across multiple machines?
Lots of good examples also here: https://www.gnu.org/software/parallel/man.html
If you're using GNU Parallel for simple, non-parallel command line tasks and scripting, I've written a tool which I find to be much more intuitive:
https://github.com/Miserlou/Loop
The author of GNU Parallel wrote a pretty detailed comparison, which you can find in the linked README.
Your tool looks nice, but it doesn't seem to parallelize the work in any way.
Never mind, missed your point about not being parallel.
Is there a Rust port for GNU parallel? It's written in Perl and having to install dependencies for Perl is not as simple as download a binary :)
https://github.com/mmstick/parallel but it's unmaintained and the author wanted to do a rewrite.
Still often the simplest way to get parallel computation in python, sadly.
The multiprocessing module is pretty good in Python.
Not in my experience. Has edge cases, especially on windows. Its understandable though, if you look under the covers there is a huge amount of complexity there.
Logging was lost last time I tried it
Has this issue recently. Turns out there's a great library for this specifically. https://github.com/jruere/multiprocessing-logging
pickle is gross.
It's often they only way you really need.