Numeric linear algebra code: an appreciation

Most computer science guys and “software engineers” never deal with the floating point end of their computer machines. This is too bad, as the computer was invented basically to do floating point calculations, whether to build nuclear weapons or to intercept Rooskie bombers. The present use cases of using computers to watch TV, sell underpants on the internet, mining bitcoins, and sending email to the guy across the cubicle from you are later developments: all integer oriented.

Because the subject is so old and obscure, you have to be a sort of historian of science to understand why things are the way they are on the floating point side of the computer machine. You also have to know Fortran, because that’s more or less what most of it is written in. In fact, you have to know old Fortran, because an awful lot of it was written a long time ago.

Much longer ago than this even

Much longer ago ago than this even

It is an exaggeration to say everything you can do to simulate or model the world happens using numeric linear algebra of some kind, but it isn’t much of an exaggeration. Numeric linear algebra can be dense, meaning all numbers are defined, and a matrix looks like a matrix in memory, or sparse meaning most of the numbers are 0 and it looks like a hash table in memory. Most problems are inherently dense, but sparsification can make impossible things possible, and there are many data science applications (recommendation engines for example) where sparsity can be assumed if you think your matrix of data has any meaning to it.

You’d think this sort of thing is a trivial problem, but it isn’t. The reason it is two fold. One is the operations are hard. A naive dense square matrix multiply with no trickery involved is O(n^3). If it’s dense, and you need to keep all three matrices around, it’s 3 * O(n^2) in memory. There is an additional problem here which “big O” doesn’t say much about. When you have big-O memory problems, you have IO problems. Your computer doesn’t operate directly on memory chips you install on your computer; it operates directly on registers. While matrix multiply is just vector multiply and sum, the conga dance of doing this in a way that maximizes the data moving in and out of the registers, caches and main memory is extremely intricate and usually machine dependent (because everyone’s computer has a slightly different memory profile).

Most of you have never thought about these matters, because you haven’t tried to simulate a quantum mechanical system, or solved differential equations using finite element analysis. If you build your linear algebra library right, you can get factors of 100 or 1000 speedup. That’s not a big-O speedup, but it sure as hell matters. Computer dorks worry a lot about big-O. Of course, big-O matters very much for both memory and processor cycles. But other things matter too, and computer dorks rarely explicitly worry about them, in part because they have no way of really understanding when they’re happening. I’d posit for any interestingly “large” problem, you’re dealing with something I call “big-IO.” Once you get past having a correct linear algebra algorithm, big-IO is screechingly important on practical problems. It’s the difference between being able to solve a problem and not solve a problem.

I think I’ve convinced any sane reader that matrix multiplies are hard. But this is usually the easiest thing you have to deal with, as it is an actual array operation; something that can be done without explicit loops and branches. QR decompositions, Givens rotations, Eigenvalues, Eigenvectors, Singular Value Decompositions, LU decomposition, inverses, partial inverses; these are more algorithmically challenging. Depending on the properties of the matrix you’re working with (and there’s no simple way to make matrices introspective to tell you this), there may be cheap solutions to these problems, and there may be very expensive ones. Lots of interesting problems are in the O(n^3) range. Regardless of what you use, you have “big-IO” bottlenecks everywhere.

Further complicating your problem, linear algebra on a computer also takes place on several types of floating point data: 32bit/float, 64bit/double and the short and long versions of complex numbers. Some problems are inherently only 32 bit, some are 64 bit; some are inherently real numbers, and some complex numbers. So already, we’re talking about every kind of operation having 4 different types of implementation, all of which must be written and maintained and optimized for your IO and data locality requirements.

Most dense linear algebra work presently happens in something called LAPACK. If you want to calculate eigenvectors in Matlab, Incanter/Clojure, Python, R, Julia and what have you: it’s almost certainly getting piped through LAPACK. If you’re working in C, C++ or Fortran, you’re still probably doing in via LAPACK. If you’re doing linear algebra in Java, you are a lost soul, and you should seriously consider killing yourself, but the serious people I know who do this, they allocate blobs of memory and run LAPACK on the blobs. The same thing was true 25 years ago. In fact, people have basically been running their numerics code on LAPACK for over 40 years when it was called EISPACK and LINPACK. Quite a lot of it conforms with… Fortran 66; a language invented 50 years ago which is specifically formatted for punched cards. That is an astounding fact.

mx5

Lapack: significantly older than this photo; still being used on your computer now

LAPACK is a sort of layercake on top of something called the BLAS (“Basic Linear Algebra Subroutines”). I think BLAS were originally designed as a useful abstraction so your Eigendecomposition functions have more elementary operations in common with your QR decomposition functions. This made LAPACK better than its ancestors from a code management and complexity point of view. It also made LAPACK faster, as it allowed one to use machine optimized BLAS when they are available, and BLAS are the type of thing a Fortran compiler could tune pretty well by itself, especially on old-timey architectures without complex cache latency. While most of these functions seem boring, they are in fact extremely interesting. Just to give a hint: there is a function in there for doing matrix-multiply via the Winograd version of the Strassen algorithm, which multiplies matrices in O(N^2.8) instead of O(N^3). It’s the GEMMS family of functions if you want to go look at it. Very few people end up using GEMMS, because you have to be a numerical linear algebraist to understand when it is useful enough to justify the O(N^0.2) speedup. For example, there are exactly no Lapack functions which call GEMMS. But GEMMS is there, implemented for you, just waiting for the moment when you might need to solve … I dunno, a generalized Sylvester equation where Strassen’s algorithm applies.

BLAS can be hyperoptimized for an individual machine by specialized packages like ATLAS or GotoBLAS (and descendants BLIS and OpenBLAS). Usually, it’s worth a factor of 5 or so, so I usually do it, even though it is a pain in the ass. Those packages with your Linux distribution that call themselves ATLAS BLAS; they are technically, but they’re never as fast as rolling your own. The original GotoBLAS were mostly hand tuned assembler, meta-compiled into very fast BLAS for various permutations and combinations of supported architecture. They’re not called GotoBLAS because they use goto statements (though they do); they’re named after the author, Goto Kazushige, who appears to be some kind of superhuman mentat.

LAPACK is intensely important. Computer people who are not physicists or electrical engineers have probably never given it a second thought. The fact that they’re old Fortran code has implications. The most obvious one is Fortran arrays are column major, just like Linear Algebra itself is. I have no idea why C decided to be row-major; probably because they weren’t doing linear algebra. There are ways of handling this, but it ‘s one more thing you have to keep track of, and it can prevent you from doing things on the C side without expensive memory copies. There is also the matter that because old school Fortran looks like stack allocation of arrays (in code appearance only), you have to do weird things like instantiate your arrays and bizarro looking “workspaces” way up the calling pipeline when you’re calling in a C-like language. This is painful, and efforts have been made (LAPACKE) to make the API a little more friendly where this is done under the covers.

mx1

The problem with stuff like this: if you’ve already gone through the trouble of porting a version of C or Fortran LAPACK to your system, LAPACKE makes you do it all over again. The only thing you get out of this exercise is a nicer calling convention, except you have to write the code for your new and improved calling convention, and you don’t get anything extra from this work. So the old calling conventions persist, and probably will persist for another 20-30 years. People will probably be coding up errors that amount to “punched card rejected” in 2050 or 2060.

It’s weird that we still use LAPACK, but considering the difficulty of writing such a thing and moving it from place to place, not terribly surprising. There have been attempts at rewrites, some of which look pretty good to me. FLENS was one of them. I’ve never actually used this, though I have struggled with some vintage of its peculiar build system when examining a neural net package I was interested in. Considering the fact that I found it easier to write my own neural net than to get an old Python package based on FLENS working, I don’t see this project taking over any time soon. While calling FLENS from C++/Python might have been sweet, it didn’t seem useful enough to justify its existence. Another attempt was LibFLAME, which I have never attempted to use. LibFLAME was claiming some performance improvements, which is a reason to think about using it. It was clean and easy to read and came with Goto-san’s enhanced BLAS. Unfortunately, the last major release was in 2010, and the last point release in March of 2014, so I’m presuming the authors are doing other things now. As a historical point, LibFLAME actually employed Goto-san back when he was writing GotoBLAS. Unfortunately for the libFLAME team, he is now at Intel, presumably working on MKL.

mx3

For all this, you’d think there has been no progress in numeric dense linear algebra. This isn’t remotely true. It is one of the most interesting areas in computer science research at present. All kinds of crazy things have happened since LAPACK was written. They’re scattered to the winds, existing as scraps of C/Matlab/Fortran. In fact, I would even go so far as to say LAPACK doesn’t even contain the most interesting subset of available dense numeric linear algebra routines, though it inarguably contains the most useful subset. Just one obvious example of its deficiencies; the standard eigen-decomposition routine in LAPACK calculates the complete eigen-decompostion of a matrix. Very often, you only need the top eigenvectors, and that’s a lot cheaper; you can use a Lanczos algorithm. What does one do when this happens? Well, there’s this other ancient Fortran numeric linear algebra package called ARPACK out there. Usually, this gets used, very often for this one function. What about sparse methods? Well, that’s another package. In Fortran days it was IMSL or SparseKit. Now a days, there are many choices.

If I had to bet on a future linear algebra package taking over, it would be libelemental. I only heard about this last year (thanks Ryan), and I have yet to use it for anything important, but it appears to be a very good place to stick new numeric linear algebra algorithms. The problem with it is it depends on a lot of other stuff; LAPACK, libFLAME (optionally for the good parts anyway), Scalapack (old Fortran code for distributed matrix math) and the kitchen sink. It would be nice if they eventually replace these old pieces, but in the meanwhile, it has a decent framework for fetching and building some version of them. A very big upside to libelemental is it gives you access to a lot of useful old code, via a nice modern C API and a modern C++ API. It also gives you access to some useful things built on top of these guys, such as non-negative matrix factorization and skeleton decomposition. It includes convex optimization routines, logistic regression, and quite few other things that are nice to have kicking around. With luck, Jack Poulson, the primary author, will dedicate enough of his time to keep this going and turn it into something that lots of other code depends on out of the box. Seems like there is enough interest at Stanford to keep this moving forward.