Show HN: st – simple statistics from the command line

73 points by nferraz 12 years ago · 27 comments

Reader

For casual purposes st may be convenient, but it doesn't have state of the art numerical stability:

    my $variance = $count > 1 ? ($sum_square - ($sum**2/$count)) / ($count-1) : undef;

Taking the difference between two similar numbers loses precision, and in extreme cases squaring the raw numbers could cause overflow. For comparison, see the recently posted: http://www.python.org/dev/peps/pep-0450/ and https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...

nferrazOP 12 years ago

Thanks! I changed the algorithm to online variance, hope it is more stable:
https://github.com/nferraz/st/commit/d0fb1bf814fc5940c5aae39...
Doches 12 years ago

If you need numerical stability, I'd use Gary Perlmann's [|stat](http://oldwww.acm.org/perlman/stat/history.html). It's older and somewhat harder to get a copy of, but it's as reliably correct as a piece of software can be...
- imurray 12 years ago
  I've just had a look at the |STAT source, and it computes the variance with
  double M = Sum/N; /* mean */ double var = (s2 - M*Sum)/(N-1); /* variance */
  where s2 is the sum of squares. In most reasonable situations, this approach will work fine. It just doesn't take as wide a variety of inputs as is easily possible to achieve with normal floating point doubles. In fairness, the |STAT terms and conditions state:
  |STAT PROGRAMS HAVE NOT BEEN VALIDATED FOR LARGE DATASETS, HIGHLY VARIABLE DATA, NOR VERY LARGE NUMBERS.

Sprint 12 years ago

I'd just use octave. It's as simple as

  $ octave
  octave:1> a=load('numbers.txt');  
  octave:2> sum(a)
  ans =  55
  octave:3> mean(a)
  ans =  5.5000
  octave:4> std(a)
  ans =  3.0277
  octave:5> quantile(a)
  ans =
  
      1.0000
      3.0000
      5.5000
      8.0000
     10.0000

etc

nferrazOP 12 years ago

I like octave and R!
The reason I wrote this script was to get quick results from the command line.
For instance: I could use grep, cut and other unix tools to get the numbers from a file and make quick calculations.
Of course, for complex processing I would use octave or R.
- Sprint 12 years ago
  Yeah, I was thinking about that and spent the past minutes to make me some Bash functions like:
  function mean() { octave -q --eval "mean = mean(load('$1'))" }
  Then just run "mean numbers.txt".
  I am sure your approach is much quicker, octave takes a good 0.5s(!) to load on my machine.
  - nferrazOP 12 years ago
    
    Yup, octave requires more time to warm up.
    Regarding speed, for simple calculations like sum, mean and variance, the bottleneck is in I/O.
asgeirn 12 years ago

Would you be able to use Octave for reading from stdin?
- nferrazOP 12 years ago
  Yes, Sprint suggested this:
  octave -q --eval "mean = mean(load('$1'))"
  But, again, octave requires more time to warm up...
  - fluidcruft 12 years ago
    
    That does not read the data from stdin. You could probably get it to work with some bash wizardry. But maybe not. I did spend some time on this a few years ago and it may have changed in the interim, but my memory is that I tried load("/dev/stdin") and using a fifo etc and it doesn't work (probably because load() uses some seeks to determine matrix data shape before reading data in i.e. data is read in as columns instead of reading in rows and transposing). At least that was my takeaway. At least if you want to use the load() builtin.
    Basically you just need to write an octave function that reads values from the terminal.

philsnow 12 years ago

One suggestion: whatever the default may be, give an option to have line-delimited output rather than column-delimited.

IMHO if you want your script's output to be easily usable by other scripts, line-delimited is easier since you can grep out what lines you want rather than having to rely on the column position never changing (since you can give cut only a field number and not a field name like "average").

sprayk 12 years ago

suckless' terminal emulator already uses the name st, though it's not quite popular enough to be in any major repos.

http://st.suckless.org/

nferrazOP 12 years ago

Thanks for the information!
I wanted to use "stat", but it was already used (display file status); "statistics" was too big.
Just as curiosity, I got the idea for this script when I wanted to calculate the sum of some numbers and discovered that the "sum" command was used for another purpose (display file checksums and block counts)!
- Kliment 12 years ago
  
  sta seems to be available

riffraff 12 years ago

maybe you are not aware of it, but there is a nifty little tool in freebsd called ministat that somewhat overlaps with what you did, maybe of interest:

http://www.freebsd.org/cgi/man.cgi?query=ministat&apropos=0&...

codemac 12 years ago

EXACTLY!
I converted this tool to linux for the archlinux package forever ago:
https://github.com/codemac/ministat
There are a few forks (adding autoconf, an osx branch, etc) as well.
nferrazOP 12 years ago

Nice! They definitely overlap, although there are a few differences... I'm not sure if ministat accepts bignum, scientific notation, etc.

pjungwir 12 years ago

Lovely! As someone with my own little script to sum up the values in a given column, I can see how you'd want to just have this tool sitting ready to hand in ~/bin in wherever. And this script seems to adhere better to the Unix way than mine, since it's easy to use cut(1) to extract whatever column you want, but it makes sense for one tool to do sum, mean, sd, etc. Thanks for sharing!

fsiefken 12 years ago

Nice, is there a maximum rowcount? What would be nice is a way to do a sum on a second or third column - or would you use awk to get those and pipe the result to st?

nferrazOP 12 years ago

There isn't a max rowcount for sum, mean, variance, etc, because it is not necessary to hold the data in memory.
The calculation of median and quartiles require that the whole set is stored and later sorted, so it is limited to the available memory.
Regarding your suggestion -- I'm considering the idea of dealing with multiple columns and even CSV and other types of tabulated data.
nidoran 12 years ago

cut is also a good option when dealing with column-based data. Just specify the delimiter and which column you want.

jldugger 12 years ago

So, http://suso.suso.org/programs/num-utils/index.phtml already exists and is written in perl. It seems like your major contribution is a statistical slant, which might be compatible with the existing code base.

edit: it's perl, not python. brainfart on my part.

montecarl 12 years ago

I wrote a program to quickly generate histograms from data. Seems like it would complement "st" nicely for quick command land stats calculations.

https://github.com/SamChill/hist

hnriot 12 years ago

why not write a one line python program instead? I'd never use the shell for these kinds of things. Quickly they grow into more than something that a simple one liner can handle. Before you know it, you're reading from a CSV and summing column "foo" and so on. This then turns your shell approach into mess, instead of a (now 5 line) python program.

Settings

Show HN: st – simple statistics from the command line

Keyboard Shortcuts