Dimsum is like head or tail, but pulls randomly from the entire file or stream

blog.noblemail.ca

29 points by snoble 13 years ago · 30 comments

Reader

neilk 13 years ago

This doesn't come with an elaborate test suite, but it does pretty much everything that does in a few dozen lines of Perl.

https://github.com/neilk/misc/blob/master/randline

I've had this script (or versions of it) around for more than a decade. I didn't know the technique had a name.

andrewcooke 13 years ago

it was an example in the original camel book.
[edit: i was going to delete this, but since you replied i'll leave it - it does appear (too?) in the camel book, on p 246 of my copy, but like you say, it's for a single line. hadn't opened that book in years, took me some time to find it...]
- neilk 13 years ago
  
  I believe it was an example in the Perl Cookbook, but for picking a single line only. (Ancient UseNet thread: http://bit.ly/Thd4eE)
snobleOP 13 years ago

the code isn't special. but it's easy for anyone to install and use

andrewcooke 13 years ago

uses reservoir sampling -http://en.wikipedia.org/wiki/Reservoir_sampling

(so it presumably consumes the entire stream before giving any results; any alternative i can think of would not be "really random" unless you knew the length of the stream in advance).

snobleOP 13 years ago

the magic of reservoir sampling is the memory footprint is the size of the output while being completely random. In this particular implementation the order of the output is slightly biased but each row has an equal chance of being in the output.
- snobleOP 13 years ago
  
  but yes. it has to wait until the stream is done before producing any output.
avibryant 13 years ago

Yep, though a feature request I've put in is to respond to a ctl-c by producing the results from the stream so far... that way if it's taking a while on a large file you can interrupt and still get something useful.
- cgs1019 13 years ago
  
  This breaks ctl-c in my opinion. When I ctl-c I want shit to stop, not dump (potentially large quantities of) output into my terminal.
  - wodow 13 years ago
    
    It could accept another signal (e.g. SIGUSR1) and have this clearly documented.
    
    neilk 13 years ago
    
    I just added the SIGUSR1 feature to my hacky perl script (see my other comment).
    e.g.
    $ (while true; do cat /usr/share/dict/words; done;) | ./randline 3 & [2] 93937 $ kill -s SIGUSR1 93937 declinograph brotheler woolpack $ kill -s SIGUSR1 93937 lustrify brotheler bromophenol
cgs1019 13 years ago

Maybe I'm overlooking something obvious but wouldn't piping through awk 'rand()<.1{print}' give a decent streaming random sample of roughly 10% of input?
- minimax 13 years ago
  
  Reservoir sampling allows you to return a specific number of samples regardless of the size of the input rather than a specific fraction of samples (which is what your awk script does).
  - philsnow 13 years ago
    
    On the other hand, the awk script handles streaming quite nicely; If you have a few hundred machines and you want to just get a sense of log messages on all of them (in real time; say you're about to change something and want to just eyeball whether the mix of error messages changes), you could do something very quick and dirty with that awk script.
- taltman1 13 years ago
  As minimax stated, your awk code won't provide exactly k samples. Here's a bit of awk code that implements reservoir sampling (apologies in advance for any bugs) and prints the current sample to stdout, without needing to first process the entire stream. It simply prints the set of samples to stdout whenever the sample updates, with sets of samples separated by a distinct string. It is called as follows (of course, replace dmesg with any stream of your choosing):
  dmesg | gawk -f reservoir-sample.awk k=5 record_separator='===' #!/usr/bin/gawk -f ### reservoir-sample.awk ### ### Sample k random lines from a stream, without knowing the size of the stream. ### ### (Tomer Altman) ### Parameters: (set from command-line) ## ## k: number of lines to sample ## record_separator: string used to separate iterations of the sample in stdout. ## Define a function for returning a number between 1 and n: function random_int (n) { return 1 + int(rand() * n) } ## Initialize the PRNG seed: BEGIN { srand() } ## For the first k lines, initialize the sample array: NR <= k { sample[NR] = $0 } ## If we've initialized the sample array, and we pick an integer between one and the current record number that is less than or equal to k, update the sample array and print it to stdout: NR > k && (current_random_int = random_int(NR)) <= k { sample[current_random_int] = $0 for (i in sample) print sample[i] print (record_separator != "") ? record_separator : "--" }
  - taltman1 13 years ago
    
    In case anyone is interested, I've extended this code and put it up on GitHub as a script called 'samp':
    https://github.com/taltman/scripts

nullc 13 years ago

uhhh. You mean like shuf -n NNNN ?

bo1024 13 years ago

I wonder if the implementation of shuf would handle very large input efficiently? Reservoir sampling wouldn't need to keep the whole input in memory, which could be an advantage. But I don't know how shuf works.
- teraflop 13 years ago
  
  Doesn't look like it. I just tried running "yes | shuf -n 1" (using the latest version of GNU coreutils, 8.20) and its memory consumption increased steadily until I killed it.
  It seems like this would be a really useful improvement, and I'm surprised that it doesn't already seem to have been requested on the coreutils issue tracker.
  - malcook 13 years ago
    
    did you try "yes | dimsum -n 1"?
    In my hands, `top` shows resident memory increasing steadily too....
    It is perhaps more instructive to compare output from, for example
    seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.dimsum.100000.out.%p dimsum -n 1
    with
    seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.shuf.100000.out.%p shuf -n 1
    in my hands, shuf is faster and uses less memory for this task.
    How about you?
    
    snobleOP 13 years ago
    
    sigh, memory leak. It's fixed in github. When camilo is around I'll get him to update the gem
    
    malcook 13 years ago
    
    thanks - looking forward to the patch
    
    snobleOP 13 years ago
    
    try a `gem update`. Memory performance should be much better now but I'm still curious about speed
  - malcook 13 years ago
    
    Agreed it would be.
    So, let's pursue this: http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079...
nburger 13 years ago

or "sort -R"...
- gwern 13 years ago
  
  sort -R isn't actually a random permutation or shuffle like one would ordinarily expect from a 'random' output: if you read the man page very carefully, you'll notice it only speaks of 'random hash' which means that 2 lines which are identical (quite possible and common) will be hashed to the same value and placed consecutively in the output.
  (This bit me once. Filed a bug about maybe clarifying the man page, but nope, apparently we're all supposed to recognize instantly the implication of a random hash.)

patrick_grant 13 years ago

I really don't like how this behaves for populating the array initially, and how it behaves for small inputs...

  $ seq 15 | dimsum  -n 10 
  14
  12
  3
  4
  5
  6
  7
  8
  9
  10

taltman1 13 years ago

Looks like a valid sample to me. Are you bothered by the ordering of the sample members? Then I'd continue the pipeline to include a call to shuf.
snobleOP 13 years ago

yup, the order is biased. definitely worth a couple of minutes to fix that

Settings

Dimsum is like head or tail, but pulls randomly from the entire file or stream

Keyboard Shortcuts