Revisiting the Fast Inverse Square Root – Is It Still Useful?

145 points by nhellman 3 years ago · 53 comments

Reader

One important point that the article doesn't touch on is determinism. rsqrtps is implemented differently on different CPUs, and so if having reproducible floating point results is a requirement (there's a lot of use-cases for this), you simply cannot use it. Your only remaining option is to use a totally IEEE-754 compliant algorithm that is guaranteed to work the same on every CPU that implements IEEE-754 floats, and for that there's still no better approach than using the Q_rsqrt idea, of course with some modifications for the "modern age".

Findecanor 3 years ago

> rsqrtps is implemented differently on different CPUs
That applies to x86 for sure. But the designers of ARM and RISC-V had the foresight to standardise the implementation of rsqrt to make it deterministic ... on each respective platform. But on either, the precision is only 7 bits.
Recent AMD and Intel x86-64 processors use 11 input bits, and the results are similar enough that only 22 results over the whole input range are different. Source: <https://robert.ocallahan.org/2021/09/emulating-amd-rsqrtss-e...>
- adgjlsfhk1 3 years ago
  
  the point of 7 bits is that you can double the digits with newton if you need, so a bad but fast version lets you choose how accurate you need it.
stephencanon 3 years ago

> for that there's still no better approach than using the Q_rsqrt idea
Or you can just take a square root and divide; these are (partially-)pipelined operations on modern CPUs, with _much_ shorter latency than they had when "fast inverse square root" was a thing.
It still has a niche, but that niche is very, very narrow today.
- pclmulqdq 3 years ago
  
  If you're doing ~20 other FP ops (not including division) per inverse square root and not using a huge dependency chain of inverse square roots, you might not even notice the impact of using an exact inverse square root with this method.
  Short of functions like exp and the trig functions, DIY approximations are usually not worth it these days. However, FP division getting faster has made these functions faster, too, since they can now use rational approximations rather than pure polynomials.
  - adgjlsfhk1 3 years ago
    
    polynomials are still king. a 6th degree polynomial is often enough and will be faster than a 2nd degree rational.
    
    pclmulqdq 3 years ago
    
    A 6th degree polynomial is firmly in the "fast approximation" category. I think sin/cos are ~15th degree polynomials for 1 ULP of precision.
    
    adgjlsfhk1 3 years ago
    
    sin/cos commonly use 2 different 5th degree polynomials (depending on quadrant of the sin graph) https://github.com/JuliaLang/julia/blob/bb83df1d61fb649efd15...
    
    moonchild 3 years ago
    
    https://github.com/rutgers-apl/The-RLIBM-Project/blob/main/l... has a lot less and claims to be correctly rounded.
    
    pclmulqdq 3 years ago
    
    For 32-bit float, yes. For double precision, not so much.
- thegeomaster 3 years ago
  
  What do you mean by "taking the square root"? sqrtps has exactly the same problem, it's not standardized, so you need a custom algorithm.
  - dooglius 3 years ago
    
    Sqrt is standardized by IEEE, it is required to be the correctly rounded value.
  - stephencanon 3 years ago
    
    square root is an IEEE 754 basic operation, correctly rounded (and therefore bitwise reproducible) on all compliant systems. What made you think that it isn't?
nhellmanOP 3 years ago

This is a good point, I updated the article to include a comparison where the naive method is only using standardized floating-point operations. When not using -funsafe-math-optimizations the compiler emits sqrtps followed by divps (sqrtps seems to implement sqrt of ieee-754).
In this case, the Q_rsqrt actually seems to provide a 2-4x speedup compared to the reproducible naive method.
colejohnson66 3 years ago

But how often is determinism in the LSB really needed?
- jcranmer 3 years ago
  
  There are some things that end up "accidentally" demanding exact equality of floats. Comparing the reference output of a program that writes floating-point numbers comes to mind. A multiplayer game that relies on each client computing gamestate locally can get desyncs to happen if two clients compute floats that differ even by that last bit.
  Another, admittedly niche, scenario is that numerics code can use lower-precision types to emulate higher-precision types (e.g., represent a number with a pair of doubles).
  - nine_k 3 years ago
    
    AFAICT any modern multiplayer games use a central node that forces a coherent state of the world on all clients, otherwise cheating through desync becomes a serious problem. (Trusting clients in adversarial transactions, like deathmatch, is hard in general: cheating through altered rendering also used to be a thing.)
    
    yvdriess 3 years ago
    
    Depends on the game and platform. RTS games tend to still have peer to peer netcode. Deterministic simulation is still a thing.
    Here is an interesting GDC talk on building a fixed precision engine core to ensure determinism: https://youtu.be/wwLW6CjswxM
  - bee_rider 3 years ago
    
    This is totally off the cuff, but if you were using a pair of doubles like that, you’d essentially need to represent these new operations with numerical algorithms on the pair of doubles, right? So the algorithms would presumably need to deal with the last bit being a bit dodgy at times, right?
    
    pclmulqdq 3 years ago
    
    With appropriate rounding modes, the basic operations (+, -, *) are fairly short code sequences. Things like exp and 1/x can become adventures, but they're not too bad.
    
    stephencanon 3 years ago
    
    Division is actually pretty straightforward: compute a residual using the multiply-add that you already have, divide _that_, and then add it to the quotient.
    Or (roughly equivalently, but maybe easier to understand) do a native double division, then do a Newton-Raphson step, which requires only multiplication and addition (just like you would refine a reciprocal estimate).
- andrewmcwatters 3 years ago
  
  Happens very frequently in multiplayer games. As a result, if you don't have it, you can't byte-compare floats and need to have an "approximate" function with some predefined epsilon slop.
- magicalhippo 3 years ago
  
  I recall LHC@Home had some issue with this[1]. They ran simulations of beams in the LHC using BOINC, each run could be up to a million revolutions. The goal was to do parameter searches to find the best magnet settings before going live, so they could spend less time tuning the real thing.
  Anyway, as with other BOINC projects, they sent each work unit (simulation run) to at least two (or was it three?) different computers and compared results, to ensure correctness. And they found that they got quite a lot of work units which disagreed and had to be sent to more computers for validation.
  After some digging and eliminating factors like overclocked CPUs, they found that usually, all Intel machines would agree and all AMDs would agree, but Intel and AMD would disagree. Like, a run that would hit the detector wall after 30 revolutions on Intels could go on for many thousands of revolutions on AMDs.
  Further digging led to the discrepancy in lower bits of transcendental operations in the FPUs[2]. After switching to a software library for these operations, at the cost of a few % in speed, they got Intels and AMDs to agree.
  So yeah, when you do a large number of iterated operations like this, even a single LSB of difference can lead to issues.
  As an aside, the LHC@Home was initially run almost like a hobby project by a few researchers connected to the LHC, without much official support. However the data the project produced was AFAIK highly beneficial to the machine commissioning, and it later became a more official part of the High Luminocity upgrade.
  [1]: https://cds.cern.ch/record/1463291/files/CERN-ATS-2012-159.p...
  [2]: https://epaper.kek.jp/icap06/PAPERS/MOM1MP01.PDF
  - arnoldjm 3 years ago
    
    I looked into this issue with rsqrt (and with rcp as well) between Intel and AMD processors in connection with CERN some years ago (2016). An unpublished report can be found at [1].
    TL;DR: The same (very small) executables gave different results when run on Intel and AMD processors because the rsqrt and rcp instructions produced slightly different outputs on the two systems.
    [1]: https://github.com/jeff-arnold/math_routines/blob/main/rsqrt....
    
    magicalhippo 3 years ago
    
    Interesting results, thanks for sharing!
- hansvm 3 years ago
  
  Determinism in the LSB is often a prerequisite for epsilon-determinism in the result. Algorithms with many possible solutions and any step that bifurcates on a floating point value should be treated with suspicion.
  Classic examples include most under-constrained randomized algorithms, like training a neural network. Rejection sampling is required to accurately produce some sorts of randomness, and that yields bifurcations in the initialization if you don't have LSB determinism. The complicated loss landscape then virtually guarantees you'll converge to a different minima. Even with a deterministic seed, algorithms guaranteed to converge, a principled way to ensure that concurrent computations yield bitwise identical results no matter the execution order, and most of your operations being bitwise identical, a few stray LSB issues in your inverse square roots or transcendentals will still nearly ensure that the final result isn't even approximately the same.
  As to why that latter thing matters, it varies, but at a minimum reproducibility makes lots of federated processes cheaper (and not just federated in the "folding at home" sense, but generally when some people are performing computations and other people are making actions based on them -- being able to explain credit scores or parole denials or whatever, or validating that several people you trust yield the same compiled binary). Bitwise reproducibility would be better, but even being approximately right would probably be good enough and isn't tenable without bitwise identical building blocks.
- moralestapia 3 years ago
  
  It is typical to have a need for exact bit-by-bit equality between outputs. For integrity and security purposes.
- ufo 3 years ago
  
  What does LSB mean?
  - nomel 3 years ago
    
    For IEE 754 [1] the last bit of the the mantissa [2] is also the last bit of the binary representation. So, changing it is results in the smallest possible change in the number.
  - bee_rider 3 years ago
    
    Least significant bit
snek_case 3 years ago

> Your only remaining option is to use a totally IEEE-754 compliant algorithm that is guaranteed to work the same on every CPU that implements IEEE-754 floats
OK but, doesn't intel internally use 80 bits of precision for float64 computations? If that's the case, you can't even guarantee that float64 multiplication and addition would behave the same on say x86 vs arm.
- josefx 3 years ago
  
  That was the old x87 stack, that has been (almost) completely replaced by SSE and other vector instruction sets that operate on 32/64 sized elements. Unless you need 80 bit floats or have a mountain of ancient 32 bit intel CPUs without SSE support lying around you should get the expected rounding behavior by default.
- kevingadd 3 years ago
  
  The higher internal precision only applies to the classic x87 operations under most circumstances, AFAIK.

bee_rider 3 years ago

I wonder to what extent the Newton-Rhapson strategy plays nicer with big fancy reordering/pipelining/superscalar chips. It has more little instructions to shuffle around, so my gut says it should be beneficial, but the gut can be misleading for this kind of stuff.

Also,

-funsafe-math-optimizations

Fun, safe math optimizations should be turned on by default! ;)

pclmulqdq 3 years ago

If you have to do this in software, Goldschmidt's algorithm parallelizes a lot better than Newton-Raphson, but isn't always general-purpose (IIRC). It uses a multiplicative update rule instead of an addition like NR. Division, square root, and inverse square root all use that algorithm under the hood (at least in AMD and IBM processors).

stephc_int13 3 years ago

The benchmark should not average the values but take the lowest.

I would not write a better explanation than Daniel Lemire on his blog:

https://lemire.me/blog/2023/04/06/are-your-memory-bound-benc...

robocat 3 years ago

You also need to randomise the object code location across multiple builds.
I would not write a better explanation than Emery Berger speaks: https://youtu.be/r-TLSBdHe1A?t=10m41s Key: “Layout changes can shift performance by +/-40%”
Not sure it matters so much on small in-cache programs, but worth thinking about when profiling larger programs.
cycomanic 3 years ago

From that post:
> It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean. However, it happens much more easily with a log normal.
Should that read it is not _impossible_ otherwise it is quite wrong. I mean it's a gaussian probability distribution you sure can be several standard deviations away from the mean, admittedly at smaller and smaller probability

microtonal 3 years ago

A colleague and I were once discussing the fast inverse square root and joked that we need to make a (neural net) activation function that uses an inverse square root as an excuse to use the fast inverse square root. At any rate, I did come up with an activation function that is very similar to Swish/GELU but uses an inverse square root:

https://twitter.com/danieldekok/status/1484898130441166853?s...

It's quite a bit cheaper, because it doesn't need expensive elementary functions like exp or erf.

(We did add it to Thinc: https://thinc.ai/docs/api-layers#dish)

kadarakos 3 years ago

Haha I remember that the internal name for it was DoomSwish for a while a or something like that due to the fast inverse square root being often (I think wrongfully?) attributed to John Carmack. But it was used in Quake anyways right? XD

jabl 3 years ago

It's a shame that -fno-math-errno isn't the default. It pessimizes many common operations, as can be seen in the article. Also e.g. a simple sqrt() call like

#include <math.h>

double mysqrt(double d) { return sqrt(d); }

with and without -fno-math-errno: https://godbolt.org/z/bvrz9r8ce

One can see that with -fno-math-errno the function can be entirely inlined. But if errno is enabled, it has to first check whether the input is negative, and in that case call the libc sqrt() function which sets errno.

As for why it's not the default, I guess it's historical. The errno approach was common back in the days before IEEE 754 with its exception model provided another way.

E.g. for glibc: https://man7.org/linux/man-pages/man7/math_error.7.html

Musl libc, being newer, does away with that and never sets errno in libm functions: https://wiki.musl-libc.org/mathematical-library.html

commandlinefan 3 years ago

I've always wondered why they did the casting rather than a union like:

    float my_rsqrt( float number )
    { 
        float x2;
        union {
          float y;
          long i;
        } u;
        const float threehalfs = 1.5F;
    
        x2 = number * 0.5F;
        u.y = number;
        u.i  = 0x5f3759df - ( u.i >> 1 );               // what the fuck?
        u.y  = u.y * ( threehalfs - ( x2 * u.y * u.y ) );   // 1st iteration
    
        return u.y;
    }

Were unions not supported by the compilers back then?

adgjlsfhk1 3 years ago

I believe that's UB.

geertj 3 years ago

Awesome write up, a lot of effort must have gone into this.

I believe the benchmark program outputs the wrong units? It should be picoseconds (ps) instead of femtoseconds (fs)?

nhellmanOP 3 years ago

I believe you are correct, thanks for pointing that out. clock_gettime gives us ns and I multiply it by 1000 which gives us ps. Not sure how I initially got it to be fs. Updated it now.

jabl 3 years ago

As a small nit on the benchmark code, should use CLOCK_MONOTONIC rather than CLOCK_REALTIME.

clircle 3 years ago

Inverse square root and reciprocal square root are not the same. Inverse square root means x^2, not 1/sqrt(x)

mikerg87 3 years ago

Anyone have an idea which one is more power efficient? Is there a tool that could help make that determination ?

seventytwo 3 years ago

Hell of a write up! Nice work.

djmips 3 years ago

Yes, it's very thorough and even then there's still more ground to cover when you get this detailed.

nikanj 3 years ago

This would never pass the code review today. "Why are you optimizing this?" "Why use a magic constant?" "Optimization is evil!"

Settings

Revisiting the Fast Inverse Square Root – Is It Still Useful?

Keyboard Shortcuts