Beware of fast-math

simonbyrne.github.io

189 points by simonbyrne 4 years ago · 110 comments

Reader

Fun fact: when working on Herbie (http://herbie.uwplse.org), our automated tool for reducing floating-point error by rearranging your mathematical expressions, we found that fast-math often undid Herbie's improvements. In a sense, Herbie and fast-math are opposites: one makes code more accurate (sometimes slower, sometimes faster), while the other makes code faster (sometimes less accurate, sometimes more).

kazinator 4 years ago

If you have a program which finds some order of items in order to optimize something, and then you introduce some confounding technology which scrambles the order of the items afterward, that's indistinguishable from introducing bugs into that optimizing program; the output no longer implements the optimization requirements that the program tries to ensure.
I don't see how Herbie's accuracy improvements could not be undone, if Herbie's output is fed to a back-end which doesn't preserve Herbie's order of operations as Herbie requires and depends on.
Quekid5 4 years ago

That's a real "fun fact" kind of thing. Love it.
Of course, the 'real' solution is actual Numerical Analysis (as I'm sure you know) to keep the error properly bounded, but it's really interesting to have a sort of middle ground which just stabilizes the math locally... which might be good enough.
Other fun fact: Numerical Analysis is a thing that's really hard to imagine unless you happen to be introduced to it during an education. It's so obviously a thing once you've heard of it, but REALLY hard to come up with ex nihilo.
simonbyrneOP 4 years ago

Herbie is a great tool, especially for teaching.
shoo 4 years ago

thank you for sharing the link to Herbie, that looks like a useful tool.
If I follow at high level, it looks like Herbie is trying to rewrite expressions to minimise error without runtime performance constraints.
Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?
i guess compilers are generally focused on the latter problem, perhaps without giving the user much control over the degree of error they are willing to tolerate.
- mattpharr 4 years ago
  
  > Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?
  There are! See followup work by @pavpanchekha and others on "Pherbie", which finds a set of Pareto-optimal rewritings of a program so that it's possible to trade-off error and performance: https://ztatlock.net/pubs/2021-arith-pherbie/paper.pdf.
  - shoo 4 years ago
    
    Fantastic! This appears to be available in the command line version of herbie with the `--pareto` flag:
    > Enables multi-objective improvement. Herbie will attempt to simultaneously optimize for both accuracy and expression cost. Rather than generating a single "ideal" output expression, Herbie will generate many output expressions. This mode is still considered experimental. This will take a long time to run. We recommend timeouts measured in hours.
  - adgjlsfhk1 4 years ago
    
    I've generally been disappointed by ferbie when I've tried it, but this does look really cool. I would love it if this provides a path to making it easier to get good tradeoffs between precision and accuracy.

headPoet 4 years ago

-funsafe-math-optimizations always makes me laugh. Of course I want fun and safe math optimisations

nusaru 4 years ago

Personally I’m a fan of Kotlin’s “fun factory()”
seba_dos1 4 years ago

If you expect them to be safe, you're in for some fun!
krylon 4 years ago

They're not fun and safe, though, they're "fun-safe", so you don't enjoy yourself (too much) while doing math.

SeanLuke 4 years ago

The other examples he gave trade off significant math deficiencies for small speed gains. But flushing subnormals to zero can produce a MASSIVE speed gain. Like 1000x. And including subnormals isn't necessarily good floating point practice -- they were rather controversial during the development of IEEE 754 as I understand it. The tradeoff here is markedly different than in the other cases.

adrian_b 4 years ago

Flushing subnormals to zero produces speed gains only on certain CPU models, while on others it almost does not have any effect.
For example Zen CPUs have negligible penalties for handling denormals, but many Intel models have a penalty between 100 and 200 clock cycles for an operation with denormals.
Even on the CPU models with slow denormal processing, a speedup between 100 and 1000 exists only for the operation with denormals itself and only when the operation belonged to a stream of operations working at the maximum CPU SIMD speed, when during the one hundred and something lost clock cycles the CPU could have done 4 or 8 operations during every clock cycle.
Any complete computations cannot have a significant percentage of operations with denormals, unless they are written in an extremely bad way.
So for a complete computation, even on the models with bad denormal handling, a speedup of more than a few times would be abnormal.
The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers, instead of designing the FPU in the right way.
When the correctness of the results is not important, like in many graphic or machine-learning applications, using flush-to-zero is OK, otherwise it is not.
- titzer 4 years ago
  
  When we were debating whether WebAssembly should support subnormal numbers (i.e. be IEEE compliant), some people often cited these mythical subnormal slowdowns. So Dan Gohman ran some benchmarks and the scary-sounding slowdowns amounted to something like less than 1% (i.e. noise) for almost all benchmarks. Interestingly, one benchmark did not converge correctly with FTZ (i.e. no subnormals) and actually ran 3x more iterations, leading to a 3x slowdown.
  Outside of a vanishingly few edge cases, I think the subnormal debate is basically over, except, apparently, inside of Intel. Every single other architecture and microarchitecture manages to handle subnormals with relative ease, with only a handful of clock cycle penalty. I think Intel hardware should be called out, not programmers who just want the 35 year old floating point standard to be fast like it is on other chips.
  Similar stories happened in the GPU world, and my understanding is that essentially all GPUs are converging on IEEE compliance by default now.
  - SeanLuke 4 years ago
    
    > When we were debating whether WebAssembly should support subnormal numbers (i.e. be IEEE compliant), some people often cited these mythical subnormal slowdowns. So Dan Gohman ran some benchmarks and the scary-sounding slowdowns amounted to something like less than 1% (i.e. noise) for almost all benchmarks. Interestingly, one benchmark did not converge correctly with FTZ (i.e. no subnormals) and actually ran 3x more iterations, leading to a 3x slowdown.
    I recently built a modular additive music synthesizer called Flow (https://github.com/eclab/flow). When certain modules in the synthesizer [gradually] push certain state variables into the denormal range, my synthesizer will experience a roughly 100x slowdown. Mind you, this isn't due to DSP or even sound processing, and Flow isn't written in C, but in 100% pure *Java*. Since Java can't turn off denormals, I have to manually check for and zero them at strategic locations to avoid getting mired in the denormal quicksand.
    
    adrian_b 4 years ago
    
    This is strange, so it is likely that it might be more of a Java problem than a CPU problem.
    I do not know how Java handles this, but maybe it actually enables exceptions for underflow which invoke some handler.
    Otherwise I cannot see how you can obtain such a huge slowdown, unless your code consists entirely of back-to-back operations with denormals and of nothing else.
    I am not sure what you mean by "state variables", but if they are pushed into the denormal range, they should be changed to double, not float.
    If you push double variables into the denormals range, then it is likely that the algorithm must be modified, because this should not happen.
    Underflows, i.e. denormals, are difficult to avoid when using float variables, which can be mandatory in DSP algorithms for audio or video, but outside the arrays processed with SIMD instructions at maximum speed, the scalar variables can be double, which should never underflow in most correct algorithms.
    For computations run on CPUs, not GPUs, only very seldom there can be reasons to use a scalar float variable. Normally float should be used only for arrays.
    
    nitrogen 4 years ago
    
    entirely of back-to-back operations with denormals
    In the context of sound, I could see this happening with an exponentially decaying envelope generator (or an IIR filter).
    
    SeanLuke 4 years ago
    
    100% correct. Many audio and synthesis algorithms, including mine, perform LOTS of iterated exponential or high-polynomial decays on variables, such as x <- x * 0.1, or x <- x * x or whatnot. These decays rapidly pull values to the denormal range and keep them there, never hitting zero. Depending on the CPU, this in turn forces everything to go into microcode or software emulation, producing a gigantic slowdown. There are other common cases as well.
    The only way to get around this in languages like Java, which cannot flush to zero, is to vigilantly check the values and flush them to zero manually.
    
    spacechild1 4 years ago
    
    Yes. It is very easy to accidentally produce denormals in recursive audio algorithms.
    
    jcelerier 4 years ago
    
    > unless your code consists entirely of back-to-back operations with denormals and of nothing else.
    Ending with the data being entirely in the denormal range is a common occurrence in some audio algorithms (and in there, intel CPUs dominate by such a large margin it's not even funny) ; if that happens at the beginning of your signal processing pipeline you're in for a rough time
    
    adrian_b 4 years ago
    
    I agree that this happens, but solving such cases with FTZ is a lazy solution, which is guaranteed to give bad results, due to the loss of precision.
    Even when 32-bit floating point is used, for a greater dynamic range and for a 24-bit precision, instead of using 16-bit fixed-point numbers, proper DSP algorithm implementations still need to use some of the techniques that are necessary with fixed-point number algorithms, i.e. suitable scale factors must be inserted in various places.
    A correct implementation must avoid almost all underflows and overflows, by appropriate scalings.
    
    jcelerier 4 years ago
    
    the problem is (speaking from the end user side), you can't guarantee that every plug-in you are going to use is going to be coded properly - and you don't want that 2007 plug-in whose author has been dead for a decade but is super important for your sound to bring your whole performance down when it gets silence-ish input
    
    titzer 4 years ago
    
    Complain to Intel. AMD and ARM chips have no such 100x penalties.
    
    SeanLuke 4 years ago
    
    Perhaps true. But the point is: you're calling denormal failures "edge cases", yet my primary experience with denormals is exactly this.
    
    titzer 4 years ago
    
    I sympathize. Software/hardware is littered with one person's edge cases being another person's entire world. But in the grand scheme, yes, subnormals are exceedingly rare. Clearly Intel microarchitecture designers think that, as they seem perfectly willing to continue punishing some applications with a massive performance cliff. Their mitigation should never have been "we'll add a cheat switch for speed" but rather "we'll work as hard as our competitors do to make these cases fast." Standards are supposed to do that, but cheaters abound (and yes I am being a bit perjorative--cheaters don't think of themselves as cheating, they merely have important use cases that demand special dispensation).
    GPU hardware is a different, but similar story, from what I can see. It saves transistors to do FTZ, and the originally niche usage of FP to put pixels on the screen didn't really care so much about niggling details. But GPUs became general purpose and important, and they've been dragged into full compliance by application demands. It's the only sane outcome in the end. Instead, all this FTZ stuff has just made a mess at layers above. It would all be unnecessary if subnormals were as fast as AMD, ARM, IBM, and other chip manufacturers have managed to make them.
  - simonbyrneOP 4 years ago
    
    That's fascinating thread, thanks: https://github.com/WebAssembly/design/issues/148
- Someone 4 years ago
  
  > The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers
  You could also say some companies have been kind enough to make hardware for gamers that doesn’t have costly features they do not need.
  - titzer 4 years ago
    
    Except CPUs do have those features, they are just slow. FTZ is kind of a cheat mode for extra speed. The problem is that cheats just mushroom into software problems and generally make a crappier, less reliable platform. The situation is rife in computer hardware.
StefanKarpinski 4 years ago

That's true, but the danger of flushing subnormals to zero is correspondingly worse because it's global CPU state and there's commonly used code that relies on not flushing subnormals to zero in order to work correctly, like `libm`. The example linked in the post is of a case where loading a shared library that had been compiled with `-Ofast` (which includes `-ffast-math`) broke a completely unrelated package because of this. Of course, the fact that CPU designers made this a global hardware flag is atrocious, but they did, so here we are.
- Joker_vD 4 years ago
  
  Wait, what is "local" CPU state/hardware flag? In any case, since x64 ABI doesn't require MXCSR to be in any particular state on function entry, libm should set/clear whatever control flags it needs on its own (and restore them on exit since MXCSR control bits are defined to be callee-saved).
  - StefanKarpinski 4 years ago
    
    Local would be not using register flags at all and instead indicating with each operation whether you want flushing or not (and rounding mode, ideally). Some libms may clear and restore the control flags and some may not. Libm is just an example here and one where you're right that most of function calls that might need to avoid flushing subnormals to zero are expensive enough that clearing and restoring flags is an acceptable cost. However, that's not always the case—sometimes the operation in question is a few instructions and it may get inlined into some other code. It might be possible to handle this better at the compiler level while still using the MXCSR register, but if it is, LLVM certainly can't currently do that well.
  - simonbyrneOP 4 years ago
    
    In theory, every function should do that to check things like rounding mode etc. But that would be pretty slow, especially for low-latency operations (modifying mxcsr will disrupt pipelining for example).
  - pcwalton 4 years ago
    
    That wouldn't be practical. C math library performance really matters for numerical-intensive apps like games.

dmitrygr 4 years ago

Contrarian waypoint: beware of not-fast-math. Making things like atan2f and sqrtf set errno takes you down a very slow path, costing you significant perf in cases where you likely do not want it. And most math will work fine with fast-math, if you are careful how you write it. (Free online numerical methods classes are available, eg [1]) Without fast-math most compilers cannot even use FMA instructions (costing you up to 2x in cases where they could be used otherwise) since they cannot prove it will produce the same result - FMA will actually likely produce a more accurate result, but your compiler is handicapped by lack of fast-math to offer it to you.

[1] https://ocw.mit.edu/courses/mathematics/18-335j-introduction...

mbauman 4 years ago

That's precisely the part that makes it so impossible to use! Sometimes it means fewer bits of accuracy than IEEE would otherwise give you; sometimes it means more. Sometimes it results in your code being interpreted in a more algebra-ish way, sometimes it's less.
That's why finer-grained flags are needed — yes, FMAs and SIMD are essential for _both_ performance and improved accuracy, but `-ffast-math` bundles so many disparate things together it's impossible to understand what your code does.
> And most math will work fine with fast-math, if you are careful how you write it.
The most hair-pulling part about `-ffast-math` is that it will actively _disable_ your "careful code." You can't check for nans. You can't check for residuals. It'll rearrange those things on your behalf because it's faster that way.
an1sotropy 4 years ago

(in case anyone reading doesn't know: FMA = Fused Multiply and Add, as in a*b+c, an operation on 3 values, which increases precision by incurring rounding error once instead of twice)
I'm not an expert on this, but for my own code I've been meaning to better understand the discussion here [1], which suggests that there ARE ways of getting FMAs, without the sloppiness of fast-math.
[1] https://stackoverflow.com/questions/15933100/how-to-use-fuse...
- simonbyrneOP 4 years ago
  
  -ffp-contract=fast will enable FMA contraction, i.e. replacing a * b + c with fma(a,b,c). This is generally okay, but there are a few cases where it can cause problems: the canonical example is computing an expression of the form:
  a * d - b * c
  If a == b and c == d (and all are finite), then this should give 0 (which is true for strict IEEE 754 math), but if you replace it with an fma then you can get either a positive or negative value, depending on the order in which it was contracted. Issues like this pop up in complex multiplication, or applying the quadratic formula.
- a_e_k 4 years ago
  
  C99 [0] and C++11 [1] both have fma() functions that let you directly request it without the need to mess around with sloppier FP contracts to infer it.
  [0] https://en.cppreference.com/w/c/numeric/math/fma
  [1] https://en.cppreference.com/w/cpp/numeric/math/fma
- StefanKarpinski 4 years ago
  
  The way Julia handles this is worth noting:
  - `fma(a, b, c)` is exact but may be slow: it uses intrinsics if available and falls back to a slow software emulation when they're not
  - `muladd(a, b, c)` uses the fastest possibly inexact implementation of `a*b + c` available, which is FMA intrinsics if available or just doing separate `*` and `+` operations if they're not
  That gives the user control over what they need—precision or speed. If you're writing code that needs the extra precision, use the `fma` function but if you just want to compute `a*b + c` as fast as possible, with or without extra precision, then use `muladd`.
  - adgjlsfhk1 4 years ago
    
    Note that this is only true in theory. In practice, there are still some bugs here that will hopefully be fixed by julia 1.8
- dahart 4 years ago
  
  > which suggests that there ARE ways of getting FMAs, without the sloppiness of fast-math.
  There are ways, indeed, but they are pretty slow, it’s prioritizing accuracy over performance. And they’re still pretty tricky too. The most practical alternative for float FMA might be to use doubles, and for double precision FMA might be to bump to a 128 bit representation.
  Here’s a paper on what it takes to do FMA emulation: https://www.lri.fr/~melquion/doc/08-tc.pdf
  - an1sotropy 4 years ago
    
    I remember a teacher who said (when I was a student) something like "if you care about precision use double". Now that I'm teaching, I force students to only use single-precision "float"s in their code, with the message that FP precision is a finite resource, and you don't learn how to manage any resource by increasing its supply. I think my students hate me.
    
    dahart 4 years ago
    
    Knowing said teacher ;) I wonder if he’d still say the same thing now… It’s good practice to have to use single precision (or even half-precision!) now and then in order to be forced to deal with precision issues. Yes, use doubles if you really need them and aren’t trying to learn. But they’re often a lot more than 2x more expensive, and they might not be necessary at all. I’ve heard people who develop commercial rendering software for movies you’ve probably seen say out loud that you never need doubles, you just need to understand how to use floats.
    
    couchand 4 years ago
    
    Perhaps you were in the same lecture as me, when I asked the lead developer on Big Hero 6 why they didn't just use doubles to solve their precision woes, and he informed me that they literally couldn't afford to use doubles at that scale.
    
    dahart 4 years ago
    
    You know, that is actually ringing a bell, I think I might have indeed. Above I was thinking of someone else who works on a certain renderer made in New Zealand, but it’s true that many studios using doubles either sparingly or not at all. That might be getting even more true as GPUs blend into production…
    
    a_e_k 4 years ago
    
    I worked on a certain hopping lamp renderer for more than 11 eleven years. I can confirm that probably 99+% of the floating point math in it was in single precision.
    And to this day, typing out the 'f' suffix on single precision literals is muscle memory for me after having had Steven Parker for my Ph.D. advisor.
    
    dahart 4 years ago
    
    Is everyone on this thread in Steve’s sphere?? I’m surprised (in a good way) to see so many familiar faces, and I guess a little surprised it’s in a thread about fast-math and not a thread about ray tracing. Okay on second thought it’s not very surprising.
    Pixar has told me recently they still use doubles for some things in the CPU side of RenderMan, but I don’t know what for. There are some legitimate cases for it, and occasionally I dip my toes in hot water attempting to give advice to avoid doubles to people who know more than I do about how floats work and why they need doubles.
    
    an1sotropy 4 years ago
    
    In the coding framework I created for students we have a "real" typedef that is either for float or double, and with C99's <tgmath.h> you just write "cos" once, and it will turn into cosf or cos depending on how the typedef is set, which allows controlled experimentation on how FP precision affects performance. But for submitted code the grading scripts grep for "double" and turn on various extra warnings to ensure that there are no implicit casts from double to float, in an effort to ensure that single precision is always being used (but I should probably scan the assembly).
    Steve Parker was the first person to explain to me (while he was still a student, and I was a much younger one) the sometimes surprising cost of having image sizes be powers of two (because of cache conflicts). Small world.
    
    johncowan 4 years ago
    
    Memory is a finite resource too, but would you force your students to run all their programs in 12K of memory, just because that was how much memory I had in the machine I learned to program on in 1972?
    
    dahart 4 years ago
    
    Why not? It’s a professor’s absolute prerogative what lessons they’re offering, and working in low memory is a great lesson to learn. Kids these days are lazy and spoiled with their gigabytes of ram and terabytes of disk. In my day… wait, never mind, I’m starting to sound old, eh?
    The flip side question to you is, why should students get away with more than they need? Memory and cycles are wasting energy. We need engineers to understand how to be deeply efficient, not careless with resources. Memory is generally much more expensive than compute cycles in terms of energy use. Yes, please, teach the students how to program with less memory.
    Low memory programming is a fantastic exercise for learning modern GPU programming, since you still need to conserve individual bytes when you’re trying to run ten thousand threads at the same time. Or if you’re just into Arduinos.
    Other lessons that are great to learn, but take time to appreciate are how to avoid using any dynamic memory, how to avoid recursion, how to avoid function pointers or any of today’s tricky constructs (closures/futures/monads/y-combinators/etc.) I’m of course referring to how some people (like NASA) think of safety critical code https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev... But I will add that many of these rules have applied to console video game programming for a long time. They’re easing up lately, but the concepts still apply since coding for a console is effectively embedded programming.
    
    adgjlsfhk1 4 years ago
    
    One reason is that one of the main resources students should learn to be efficient with is their time. There are definitely places where low memory use is important, but 95% of the time, the first place you should go is to use all the tricks you have to make writing code faster. Knowing how to be careful with precision is great, but so is just using Double (or even BigFloat) to get something that will work robustly without having to analyze as carefully.
    
    dahart 4 years ago
    
    I agree students should learn how to be efficient with their time to learn the concepts they need to learn and pass the courses they choose to take in the school they’re choosing to attend, knowing the lessons are going to help them in their future careers. If a student thinks learning how floats work isn’t valuable, Computer Science might not be their thing.
    It’s not the professor’s job to minimize the student’s effort, it’s the student’s job. The arrangement is the opposite of your implication. The professor’s job is to get lazy students to confront and learn these concepts and have them practice enough to understand the concepts. Having to analyze carefully is the whole point.
    I also agree about the first place you should go is to use all the tricks you have to make writing code faster… at least in business. I’m not sure that applies in school. But either way, this is precisely why school should have students practice things like floating point analysis and low-memory programming until they are part of the students’ bag of tricks, until they can do high quality engineering fluently.
    BTW, just using doubles in the name of not having to analyze is not particularly great outside of school either. That does not fly where I work now (on GPU ray tracing), and would not have been acceptable when I worked in CG films or video games either. You might be underestimating how expensive doubles are. If you don’t know whether you need doubles, you probably don’t. If you have a problem that needs more than floats, and accuracy is that important, then you’ll need to justify why doubles are enough, so in practice you’ll have to analyze carefully anyway.
    Maybe you’re just teasing me with the BigFloat suggestion, I can’t tell. Since they might be orders of magnitude slower than floats, they’re rarely justifiable as a robustness replacement, especially by someone who hasn’t analyzed carefully. That might be a firing offense at some jobs if done more than once. :P
    
    an1sotropy 4 years ago
    
    (setting aside the anachronistic snark) you may have noticed other comments here attesting to how managing 32 bits of FP precision endures today as a relevant skill.
    
    shoo 4 years ago
    
    each time complaints are raised about single precision, you could deduct 1 bit from the allowance of bits per floating-point value for the next assignment
  - adgjlsfhk1 4 years ago
    
    That's not what the parent meant. The parent meant that there are ways of generating fma instructions without using fast-math. Emulating an fma instruction is almost always a bad idea (I should know I've written fma-emulation before. It sucks)
    
    dahart 4 years ago
    
    Oh, my mistake, thanks. Yes you can use FMA instructions without the fast-math compiler flag for sure. Emulation being a bad idea is the impression I got; I’m glad to hear the confirmation from experience.
simonbyrneOP 4 years ago

My point isn't that fast-math isn't useful: it very much is. The problem is that it is a whole grab bag of things that can do very dangerous things. Rather than using a sledgehammer, you should try to be selective and enable only the useful optimizations, e.g. you could just enable -ffp-contract=fast and -fno-math-errno.
- djmips 4 years ago
  
  One thing I don't think you pointed out is that tracking down issues with NaNs seems hard with fast-math since, I believe, it also disables any exceptions that might be useful to being alerted to their formation?
kristofferc 4 years ago

> Free online numerical methods classes are available
How can you use any numerical methods (like error analysis) if you don't have a solid foundation with strict rules to analyze on top on?

vchuravy 4 years ago

Especially the fact that loading a library compiled with GCC and fast math on, can modify the global state of the program... It's one of the most baffling decisions made in the name of performance.

I would really like for someone to take fast math seriously, and to provide well scoped and granular options to programmers. The Julia `@fastmath` macro gets close, but it is two broad. I want to control the flags individually.

Also the question how that interacts with IPO/inlining...

mhh__ 4 years ago

D (LDC) lets you control the flags on a per function basis.
So one can (and we do at work) have @optmath which is a specific set of flags (just a value we defined at compile time, the compiler recognizes it as a UDA) we want as opposed to letting the compiler bulldoze everything.
- physicsguy 4 years ago
  
  You can do that on a per-compiler basis for e.g. with
  #pragma GCC optimize(“fast-math")
  - mhh__ 4 years ago
    
    Hopefully it still works as an attribute but my point is that you can (say) opt in to allowing more liberal use of FMA without (say) opting in to aggressive NaN assumptions
    
    gnufx 4 years ago
    
    GCC attributes give you per-function control over both target and optimization options (unfortunately not with Fortran). I'd have to look up what's available, but some per-loop control is possible with OpenMP pragmas too (perhaps with GCC's -fopenmp-simd if you don't want the threading).
    
    physicsguy 4 years ago
    
    Fast math is a collection of optimisations, you can enable them individually. You just have to look up what the appropriate ones are
    
    mhh__ 4 years ago
    
    As I have said above the whole point I am making is that you want to turn a set of them on for only one function.
    
    physicsguy 4 years ago
    
    You can, that is exactly what the pragma allows you to do?
titzer 4 years ago

It should always have been a bit in the instruction encoding, never global state.

smitop 4 years ago

The LLVM IR is more expressive than clang is for expressing fast-math: it supports making an operation use fast-math optimization on a per operation basis (https://llvm.org/docs/LangRef.html#fastmath).

simonbyrneOP 4 years ago

Do you know what happens when you have ops with different flags? e.g. if you have (a + b) + c, where one + allows reassoc but one doesn't?
- diamondlovesyou 4 years ago
  
  (a+b)+c has two ops in LLVM: addition is a binop, meaning it has two "arguments", thus (a+b) and adding "c" are separate instructions. You can't directly add three or more values.
  - gmueckl 4 years ago
    
    I believe the parent wants to know how a sequence of IR instructions with alternating modes is translated to the target.
    My guess would be that explicit state state changes show up in the generated machine code.
    
    StefanKarpinski 4 years ago
    
    Right, the question becomes: if you have `(a + b) + c` and one of the `+` operations allows re-association but the other one doesn't, then what happens? The problem being that associativity is a property that only makes sense for an entire tree of associable operations, not individual ones.

dlsa 4 years ago

Never considered fast-math. I get the sense its useful but can create awkward and/or unexpected surprises. If I was to use it I'd have to have a verification test harness as part of some pipeline to comfirm no weirdness. Literally a bunch of example canary calculations to determine if fast-math will kill or harm some real use case.

Is this a sensible approach? What are others experiences around this? I've never bothered with this kind of optimisation and I now vaguely feel like I'm missing out.

I tend to use calculations for deterministic purposes rather than pure accuracy. 1+1=2.1 where the answer is stable and approximate is still better and more useful than 1+1=2.0 but where the answer is unstable. Eg because one of those is 0.9999999 and the precision triggers some edge case.

simonbyrneOP 4 years ago

I tried to lay out a reasonable path: incrementally test accuracy and performance, and only enable the necessary optimizations to get the desired performance. Good tests will catch the obvious catastrophic cases, but some will inevitably be weird edge cases.
As always, the devil is in the details: you typically can't check exact equality, as e.g. reassociating arithmetic can give slightly different (but not necessarily worse) results. So the challenge is coming up with appropriate measure of determining whether something is wrong.

willis936 4 years ago

I use single precision floating point to save memory and computation in applications where it makes sense. I had a case where I didn't care about the vertical precision of a signal very much. It had a sample rate in the tens of thousands of samples per second. I was generating a sinusoid and transmitting it. On the receiver the signal would become garbled after about a minute. I slapped my head and immediately realized I ran out of precision by using a single precision time value feeding the sin function when t became too large with the small increment.

  sin(single(t)) == bad

  single(sin(t)) == good

adgjlsfhk1 4 years ago

IMO, sinpi(x)=sin(pi*x) is a better function because it does much better here. the regular trig functions are approximately 20% slower for most implementations in order to accurately reduce mod 2pi, while reducing mod 2 is pretty much trivial.
- willis936 4 years ago
  I think the real solution I should have adopted is incrementing t like this:
  t = mod(t + ts, 1 / f)
  Since I'm just sending a static frequency the range of time never needs to be beyond one period. However, using a double here is far from the critical path in increasing performance and it all runs fast enough anyway.
- willis936 4 years ago
  
  Also, thanks. I had not heard of this function before, but apparently it was added to MATLAB in 2018.

zoomablemind 4 years ago

On the subject of the floating-point math in general, I wonder what's the practical way to treat the extreme order values (close to zero ~ 1E-200, or infinity ~ 1E200, but not zero or inf)? This can take place in some iterative methods, expansion series, or around some singularities.

How reliable is it to keep the exreme orders in expectation that the resp. quatities would cancel the orders properly yielding a meaningful value (rounding wise)?

For example, calculating some resulting value function, expressed as

v(x)=f(x)/g(x),

where both f(x) and g(x) are oscillating with a number of roots in a given interval of x.

simonbyrneOP 4 years ago

The key thing about floating point is that it maintains relative accuracy: in your case, if you have say f(x) and g(x) are both O(1e200), and are correct to some small relative tolerance, say 1e-10 (that is, the absolute error is 1e190). Then the relative for f(x)/g(x) stays nicely bounded to about 2e-10.
However if you do f(x) - g(x), the absolute error is on the order of 2e190: if f(x) - g(x) is small, then now the relative error can be huge (this is known as catastrophic cancellation).
- zoomablemind 4 years ago
  
  Both f(x) and g(x) could be calc'ed to proper machine precision (say, GSL double prec ~ 1E-15). Would this imply, that beyond the machine precision the values are effectively fixed to resp. zero or infinity, instead of carrying around the extreme orders of magnitude?
  - simonbyrneOP 4 years ago
    
    I'm not exactly sure what you're asking here, but the point is that "to machine precision" is relative: if f(x) and g(x) are O(1e200), then the absolute error of each is still O(1e185). However f(x)/g(x) will still be very accurate (with absolute error O(1e-15)).
gsteinb88 4 years ago

If you can, working with the logarithms of the intermediate large or small values is one way around the issue
One example talking about this here: http://aosabook.org/en/500L/a-rejection-sampler.html#the-mul...
junon 4 years ago

Not very reliable. For precision one usually seeks out the MPFR library or something similar.
- zoomablemind 4 years ago
  
  If anyone else wonders, MPFR (Multi-Precision Floating-point calc with correct Rounding). The application domain is Interval Math [1].
  [1]:http://cs.utep.edu/interval-comp/applications.html
- adgjlsfhk1 4 years ago
  
  ARB https://arblib.org/ is often better than mpfr.
  - junon 4 years ago
    
    Thanks, that's a new one.

kzrdude 4 years ago

It looks like -fassociative-math is "safe" in the sense that it can not be used to get UB in working code? That's a good property to make it easier to use in the right context.

mbauman 4 years ago

See the one footnote: you can re-associate a list of 2046 numbers such that they sum to _any_ floating point number between 0 and 2^970.
https://discourse.julialang.org/t/array-ordering-and-naive-s...
- StefanKarpinski 4 years ago
  
  To be fair though, as noted further down in that thread, naive left-to-right summation is the worst case here since the tree of operations is as skewed as possible. I think that any other tree shape is better and compilers will tend to make the tree more balanced, not less, when they use associativity to enable SIMD optimizations. So while reassociation is theoretically arbitrarily bad, in practice it's probably mostly ok. Probably.
  - stephencanon 4 years ago
    
    This is exactly right. Also, _all_ of the summation orderings satisfy the usual backwards-stability bounds, and so all of them are perfectly stable in normal computations.
- evilotto 4 years ago
  Floating point math is fun:
  def re_add(a,b,c,d,e): return (a+b+c+d+e) == (2 * (c+a+b+d+e)) print(re_add(1e17, -1e17, 3, 2, 1))
simonbyrneOP 4 years ago

Not necessarily: if your cospi(x) function is always returning 1.0 (https://github.com/JuliaLang/julia/issues/30073#issuecomment...), but you wrote your code assuming the result was in a different interval, then you could quite easily invoke undefined behavior.
adgjlsfhk1 4 years ago

Yeah. It's safe in that you won't get UB, but it's bad in that you can get arbitrarily wrong answers.
- wiml 4 years ago
  
  To be fair, if you're using floating point at all you can get arbitrarily wrong answers. The nice thing about ieee754 conformance is that you can, with a lot of expertise, somewhat reason about the kinds of error you're getting. But for code that wasn't written by someone skilled in numerical techniques, and that's the vast majority of fp code, is fast-math worse than the status quo?
  - simonbyrneOP 4 years ago
    
    From personal experience, yes: I've seen multiple cases of scientists finding the ultimate cause of their bugs was some fast-math-related optimization.
    The problem isn't necessarily the code they wrote themselves: it is often that they've compiled someone else's code or an open source library with fast-math, which broke some internal piece.

gnufx 4 years ago

You will generally want at least -funsafe-math-optimizations for performance-critical loops. Otherwise you won't get vectorization at all with ARM Neon, for instance. You also won't get some simple loops vectorized (like products) or generally(?) loop nest optimizations. You just may not be able to afford the maybe order of magnitude cost if your code is bottlenecked on such things (although HPC code actually may well not be).

In my experience much scientific Fortran code, at least, is OK with something like -ffast-math, at least because it's likely to have been used with ifort at some stage, and even with non-754-compliant hardware if it's old enough. Obviously you should check, though, and perhaps confine such optimizations to where they're needed.

BLIS turned on -funsafe-math-optimizations (if I recall correctly) to provide extra vectorization, and still passed its extensive test suite. (The GEMM implementation is possibly the ultimate loop nest restructuring.)

pfdietz 4 years ago

The link to Kahan Summation was interesting.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

optimalsolver 4 years ago

"-fno-math-errno" and "-fno-signed-zeros" can be turned on without any problems.

I got a four times speedup on <cmath> functions with no loss in accuracy.

owlbite 4 years ago

Unless, of course, you have some algorithm that depends on signed zeros. Which is basically the same with all the optimizations the article complains about.
I'd suggest -ffp-contract=fast is a good idea for 99% of code. It's only going to break things where very specific effort has gone in to the numerical analysis, and likely the authors of such things are sufficiently fp-savy to tell you not to do the thing.
- adgjlsfhk1 4 years ago
  
  Is there any algorithm that depends on signed zeros? I'm not aware of any.

jjgreen 4 years ago

One trick that I happened upon was speeding up complex multiplication (like a factor of 5) under gcc with the --enable-cx-fortran switch.

bruce343434 4 years ago

-fcx-fortran-rules

    Complex multiplication and division follow Fortran rules. Range reduction is done as part of complex division, but there is no checking whether the result of a complex multiplication or division is NaN + I*NaN, with an attempt to rescue the situation in that case.

    The default is -fno-cx-fortran-rules.

[1] https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Optimize-Option...

markhahn 4 years ago

NaN's should trap, but compilers should not worry about accurate debugging.

Settings

Beware of fast-math

Keyboard Shortcuts