How do computers calculate sine?
androidcalculator.comI recently learned how Doom was ported to the SNES. It's quite impressive. The SNES hardware was nowhere near fast enough to do all the trig calculations needed for the game but cartridge based games had a trick up their sleeve: they could include actual hardware inside the cart that the game code could make use of. It was more expensive but if you expected to sell a boatload of copies, it could be worth it. However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision. They were helped by the fact that the game resolution in this case was fairly low.
If you're interested, you can peruse the C code that was used to generate the tables. Here's the file for sine/cosine:
https://github.com/RandalLinden/DOOM-FX/blob/master/source/m...
Games targetting pre-Pentium PCs also used precomputed trig tables.
Pentium was fast enough that it didn't matter as much.
Just a few years later it was slower to read a trig precomputed table.
Yup, I remember watching a video about how the RAM bus is the bottleneck when running Super Mario 64 on the N64. The original implementation used trig lookup tables, but the person optimized it by instead using Taylor series (I think) and some negation / shifting.
For anyone curious, the video:
Also enjoy his discovery that, while the game got flak online for years due to the release build have files that were not optimized, it turns out most of the optimizations that were done were actually for the worse due the low instruction cache/ram (I forget the exact details.) Things like unrolling loops just increased the code size and required more slow code paging.
in other words, for those of us who remember, they used the equivalent of a slide rule
More like a Trigonometry table, which predates even slide rules:
Back in the 80's, when I made demos on the C64, we also used pre-calculated sines. I remember going to the library to get a book with the values.
Abramowitz and Stegun's Handbook of Mathematical Functions[0] - I still use it to test whenever I need to implement any kind of fundamental math function that's not built into whatever environment I'm using. Haven't used it in some time but it comes in handy.
I just wrote a BASIC program generating values using the SIN function [1] and poked them into memory.
It turns out that SNES DOOM missed out on a big optimization that people figured out later on. If you use the SNES's Mosaic feature combined with scrolling tricks, you can nearly double the fill rate. Rather than outputting two pixels, you let the SNES's mosaic hardware do the pixel doubling.
> […] pre-calculated lookup tables […]
The approach is older than that. I remember my grandfather's engineering books from 1950's – nearly each of them had a large addendum with the said pre-calculated sine, cosine, tangent and logarithm lookup tables. And there was at least one book that only had such tables and no other information.
That is how engineers used to calculate before the advent of computers.
The classic of this field of books is Abramowitz and Stegun's "Handbook of Mathematical Functions" - although the two listed names are merely those of the compilation editors, as the calculations of the numerous tables of values (and sheets of mathematical identities) required hundreds of human computers operating for years. Ironically, on publication in 1964 it was just in time to see the dawn of the electronic computer age that would supplant it.
I still use it when testing implementations of mathematical functions. Like if all I need is a bessel function, why pull in a whole CAS to do that?
Once I tested lookup tables for a path tracer (ray tracer).
It is interesting that you can get very decent results even with low quality tables. Of course there will be artifacts but due to the randomness of a path tracer this is not always very noticeable.
I always wonder when hearing about these old optimizations why they aren't used in contemporary code. Wouldn't you want to squeeze every bit of performance even on modern hardware?
The "processor-memory performance gap" is a big reason why lookup tables aren't as clear of a win on modern hardware as they were on the SNES.
If it takes two CPU cycles to read from RAM, a lookup table will basically always be faster than doing the math at runtime. If it takes fifty cycles (because, while your RAM may be faster, your CPU is a lot faster), and your processor has more advanced hardware that can do more math per cycle, maybe just doing the math is faster.
I think this is the only answer that addresses the issue. We always underestimate the cost of a read. and overestimate the cost of a compute.
Some of these old optimizations are now deprecated. For example, there’s a famous trick for inverse square root: https://en.wikipedia.org/wiki/Fast_inverse_square_root Modern processors have a special instruction for that. The hardware instruction is several times faster, and couple orders of magnitude more precise: https://www.felixcloutier.com/x86/rsqrtps
Other optimizations are now applied automatically by compilers. For example, all modern compilers optimize integer division by compile-time constants, here’s an example: https://godbolt.org/z/1b8r5c5MG
Squeezing performance out of modern hardware requires doing very different things.
Here’s an example about numerical computations. On paper, each core of my CPU can do 64 single-precision FLOPs each cycle. In reality, to achieve that performance a program needs to spam _mm256_fmadd_ps instructions while only loading at most 1 AVX vector per FMA, and only storing at most 1 AVX vector per two FMAs.
Artifacts are ugly. So why force it on modern hardware when GPUs are extremely fast?
For reference: I was doing a path tracer in PHP :) so yeah, that renders like ancient hardware.
(The browser requested different buckets of an image. A PHP script then rendered and returned that bucket. So it was a kind of multi-threading but still very slow.)
A lot get antiquated by instruction additions, like the infamous inverse square root
Doing so costs time/wage dollars.
> However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision.
Is this really the order of events? I imagine the pre-calculated route is what you'd try first, and only go for extra hardware if that failed somehow.
Lookup tables were commonplace in 80s-90s graphics programming. It should have been used before any consideration of custom hardware solutions.
imagine thinking lookup tables are innovative
This made me realize that trigonometric functions are not deterministic across different CPU architectures, OS, and programming languages (floating point precision aside).
E.g. I would assume that Math.sin(x) returns the same thing in NodeJS on Windows and Mac/M1, but it turns out it is necessarily so. https://stackoverflow.com/questions/74074312/standard-math-f...
Rounding transcendentals correctly has unknown time and space complexity. [1] Sort of brushes up against the halting problem. With limited precision, the upper limit becomes calculable but it's rather large - packages that offer correct rounding on 64-bit floating point use potentially hundreds of bytes to deal with a single floating point value. Dedicated circuitry to implement it fast would be big and complicated even by today's standards.
https://en.wikipedia.org/wiki/Rounding#Table-maker's_dilemma
True in general, almost false for binary64. Important univariate functions have been thoroughly mapped for known hard-to-round cases, which resulted in the fact that we only need at most triple-double format to do the correct rounding. (Bivariate functions like `pow` are much harder, and not yet fully mapped as of this writing.) As a result we now have a mathematical library that almost ensures correct rounding [1], and a further optimization is currently in progress.
unknown time and space complexity
True in general but for the basic datatypes sent through hardware regusters your processor architecture has fixed precision. So the time and space complexity is O(1)
We should just use analog circuit for this sort of thing where the exact precision doesn't matter.
Yeah, but you'd be surprised at how frequently they appear to be the same. I once worked on a HTML5 game that that relied on deterministic simulation for networking. And it wasn't untill pretty late in development that a build of Chrome shipped on some platform that finally triggered a desync in our extensive cross-platform test suite.
We implemented a deterministic approximation, and moved on. But I learned something important about trig functions that day.
Well, what's fun is that (AFAIK) trigonometric functions tend not to be implemented in the newer floating point instructions, such as AVX or SSE.
So while what you say is true about the x87 implementation of those functions, for anything targeting a machine built in the last 20 years it's likely the code will run consistently regardless the architecture (barring architecture floating point bugs, which aren't terribly uncommon in the less significant bits and when overclocking comes into play).
x86 compilers won't use x87 instructions when SSE2 and later are available. x87 is just a really weird and funky instruction set that's best left in the gutter of history.
Sadly even SSE vs. AVX is enough to often give different results, as SSE doesn't have support for fused multiply-add instructions which allow calculation of a*b + c with guaranteed correct rounding. Even though this should allow CPUs from 2013 and later to all use FMA, gcc/clang don't enable AVX by default for the x86-64 targets. And even if they did, results are only guaranteed identical if implementations have chosen the exact same polynomial approximation method and no compiler optimizations alter the instruction sequence.
Unfortunately, floating point results will probably continue to differ across platforms for the foreseeable future.
That's a bit of a different problem IMO.
Barring someone doing a "check if AVX is available" check inside their code, binaries are generally compiled targeting either SSE or AVX and not both. You can reasonably expect that the same binary thrown against multiple architectures will have the same output.
This, of course, doesn't apply if we are talking about a JIT. All bets are off if you are talking about javascript or the JVM.
That is to say, you can expect that a C++ binary blob from the Ubuntu repo is going to get the same numbers regardless the machine since they generally will target fairly old architectures.
> Barring someone doing a "check if AVX is available" check inside their code
Afaik that is exactly what glibc does internally
GCC won't use FMA without fast-math though. Even when AVX is otherwise enabled.
Sure it will:
> -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them
> The default is -ffp-contract=off for C in a standards compliant mode (-std=c11 or similar), -ffp-contract=fast otherwise.
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#ind...
Oh, wow, forgot about fp-contract. It says it is off in C by default, what about C++?
Read closer, it defaults to fast, not off
I would have expected to be a bug in the documentation? Why would they turn FMA off for standard compliant C mode, but not for standard compliant C++ mode?
But the documentation does appear to be correct: https://godbolt.org/z/3bvP136oc
Crazy.
it defaults to off for standard-compliant mode. Which in my mind was the default mode as that's what we use everywhere I have worked in the last 15 years. But of course that's not the case.
In any case, according to the sibling comment, the default is 'fast' even in std-compliant mode in C++, which I find very surprising. I'm not very familiar with that corner of the standard, but it must be looser than the equivalent wording in the C standard.
> x87 is just a really weird and funky instruction set that's best left in the gutter of history
hmmm, can you use the long doubles in sse or avx? They are glorious, and as far as I see from playing with godbolt, they still require dirtying your hands with the x87 stack.
The 80bit float? Not as far as I'm aware. However, it's fairly trivial to represent a 127bit float with 2 64bit floats. And with the nature of AVX/SSE, you don't really take much of a performance hit for doing that as you are often operating on both parts of the double with the same instruction.
Do you know if there's language support for that? Are there obscure gcc options that make "long double" be quadruple precision floats?
You can just use standard C _Float128 type https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html
Which language?
For C++, there's this: https://en.cppreference.com/w/cpp/types/floating-point
They do, however, have some intrinsics for trig functions in AVX in their compilers. Not as good as having an instruction of course.
What about GPU ISAs?
> deterministic
I would use the word "consistent."
Non-determinism implies randomness.
Safer to assume that floats are never deterministic.
Floats follow a clear specification which determines precisely how basic arithmetic should work. They should work the same on all popular modern platforms. (Whether specific software libraries are the same is a separate question.)
But transcendentals like sine are not part of the strictly defined basic arithmetic; they are intentionally defined with relaxed behavior.
If you implement sine in software using the same sequence of basic arithmetic instructions, the result should be the same across platforms. If you make two different implementations using different arithmetic, then of course you can't rely on them being the same.
Point being that IEEE 754 defines two sets of operations, the required operations (section 5) that should be produce correctly rounded results to the last digit, and recommend operations (section 9) with relaxed requirements. And sine belongs to the latter section, so IEEE 754 does not mandate reproducible results for sine.
My understanding is that most software always uses some software implementation of sine, rather than calling a hardware instruction. Which is definitely what you should do if you care about getting the exact same results across platforms.
Software implementations can and do differ (even dynamically) based on the hardware though - e.g. glibc's sin(x) function, what C code will end up using (if not other languages relying on the C stdlib), uses FMA instructions on my CPU, and thus the exact same binary on the exact same OS with the exact same glibc should behave differently on a very old CPU without FMA where it should have a different implementation (as generally things using FMA cannot be exactly ported to hardware without it without a gigantic drop in performance which'd be extremely unacceptable).
Yeah, the lack of FMA in some contexts is a serious bummer. It would be great if every popular CPU platform would figure out a way to get FMA implemented, and if programming languages would figure out better ways to help programmers use it explicitly without making their code too ugly.
at this point, Intel, AMD, Arm, and RiscV all do and have for a while. The only one that is at all relevant that doesn't is Apple m-series under rosetta.
Even there many standard libraries provide very good precision at least within sane domain.
Yeah, that’s kind of my point. 99% consistent isn’t.
This is not always a safe assumption (in certain scenarios floating point results being nondeterministic has the possibility to introduce bugs and security issues) and is also a kind of sad way to look at the world. The response to "I don't understand how this works" should not be to adopt an incorrect viewpoint, but to know the limitations of your understanding.
It’s not that I don’t understand, it’s that I do. Floats are inherently lossy representations. Yes, this means the more operations you perform on a float input, the fuzzier the value is.You ignore that harsh reality at your peril. If you find engineering rigor sad, I don’t know what to tell you.
"Floats are not deterministic" is not engineering rigor, it's just wrong. They are specified precisely by IEEE-754 in how they must behave and which operations are allowed to produce which results.
IEEE 754 conforming floats conform to IEEE-754. If they actually conform. Low end devices with shitty software implementations often get the hard edge cases wrong.
Yes and when that happens it is important to know what went wrong rather than just handwaving "oh it's that float stuff again I can't trust it".
> the more operations you perform on a float input, the fuzzier the value is
No, any float always precisely represents a specific number. The issue is that only a finite number of numbers are representable.
Some algorithms are poorly conditioned and when implemented using floating point arithmetic will lead to a result that is different than what you would get in idealized real number arithmetic. That doesn't make any floating point value "fuzzy".
> No, any float always precisely represents a specific number. The issue is that only a finite number of numbers are representable.
A float always precisely represents a specific number, but that number is not always precisely the equal to the algebraic result of the operations performed (even when ignoring transcendental and irrational functions). This should be obvious since there is no limit to rational numbers, but finite floating point numbers.
If you design your algorithms very carefully, you can end up with the ratio of the output of your algorithm to the ratio of the algebraic result close to unity over a wide domain of inputs.
Irrational numbers cannot have an exact representation in digital bits. (Computers use rational numbers with modular arithmetic under the hood.)
Yes I am aware of this fact
It depends. If you're constrained to one chip and one platform you can characterize or you can estimate the characteristics of a float that matter in your application. In some applications like embedded that's actually totally fine, and modern embedded chips can often do floating point as fast or faster than they can emulate fixed point to work around floating point's drawbacks. On one project I worked on they originally wrote everything fixed point out of fear that floating point would introduce some deleterious effect. But in the end they rewrote parts of the project using floating point to no ill effect and great performance improvement. And there were features of the product that they had to strike because the rewrite needed to support them couldn't touch certain sensitive areas of the code that had been tested extensively in the 2 or 3 years of development. It would have been much better to evaluate the assumption that floats are bad early on in the project and make the decision based on real information. The heuristic they were applying ended up costing part of the product that was strategically important.
> constrained to one chip and one platform
and constrained to one compiler at a precise version, and one set of compiler options
Floats are well defined, and it is perfectly possible to reason about how algorithms based on them should behave. Few languages specify the accuracy of things like trig functions, so relying on them can be tricky, and JavaScript is particularly bad in that respect.
They're always deterministic in some sense (and as long as your OS respects the rounding mode after a context switch properly). This might sound pedantic but it determines how we think about floats — the behaviour is specified quite exactly.
Curiously, early Intel 386 processors had a bug where 32-bit multiplies were genuinely nondeterministic: some answers would depend on the voltage, frequency, temperature, and manufacturing conditions. The problem was essentially analog, a layout issue, where the signal didn't always have enough margin. (This is unrelated to the famous Pentium FDIV bug.) Until Intel got the problem fixed, they would stamp bad chips with "16 BIT S/W ONLY", while good chips were labeled "ΣΣ".
What I mean is that the same code running on different hardware/os may not always give the same answer. It’ll be close, but you can’t always expect bit for bit identical.
They're always deterministic, just as long as physics is.
But why? We all know 0.1+0.2 won’t give 0.3 with floats but at least we should expect deterministic result for same numbers and same operations and same order, no?
I don't think that's safe at all. Catastrophic cancellation would be quite a lot less catastrophic if rounding errors were random but accurate on average.
Somewhat annoyingly the ascribe standard only specifies that various math functions return an approximation but does not set any bounds on that approximation. So for many functions you could just return NaN and still be compliant.
Isn’t NaN the one value that can’t possibly count as an approximation, because it’s not a number and unordered? ;)
You might think so, but if it’s not specified in the standard…
It is the worst possible approximation though.
I was submitting an interpretation of the standard.
some languages (e.g. Julia) provide their own math library do that you get the same results across across operating systems.
So you can use sin(x) for various x to tell what you are running on. Maybe even in the browser?
There is a shader which uses similar idea to detect what kind of GPU you have https://www.shadertoy.com/view/slySWV
V8 and SpiderMonkey have converged to the same underlying library (fdlibm), partly for the interoperability, so you generally can't.
It's much clearer if you read one of the source code of the libm.
Plan 9: https://9p.io/sources/plan9/sys/src/libc/port/sin.c
Freebsd: https://cgit.freebsd.org/src/tree/lib/msun/src/k_sin.c
FreeBSD code is missing the range reduction step (it's named a "kernel" for the reason): https://cgit.freebsd.org/src/tree/lib/msun/src/e_rem_pio2.c
Ah, I always referred to the musl implmenetation, but I just now realized that they copied the Freebsd one.
it says "Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved."
and the directory name is literally: `msun`
Does 9front keep the same sin.c implementation?
P.J. Plauger's _The Standard C Library_ provides an implementation for all functions in the (then) C standard: https://www.amazon.com/Standard-Library-P-J-Plauger/dp/01383...
After reducing the interval, you don't want to use the Taylor series as you're building an approximation that's really good in 0 but not so good moving away from 0. It's better to use an interpolating polynomial (Chebychev comes to mind) over the whole target interval.
There are many ways to do this. It's not a difficult problem unless memory is constrained.
I've seen the third order Taylor series used, but with the coefficients calculated at various offsets for a quarter wave. So you lookuo where you are in the quarter wave, then look up the 3 or 4 cofficients. This keeps the error somewhat bounded as the size of X is a small so the series does not diverge too much.
You might also find this video interesting:
"Finding the BEST sine function for Nintendo 64"
Anyone experienced with the Remez algorithm mentioned at the end of the article?
The degree-9 polynomial, said to be a thousand times better than the original Taylor approximation in maximum error, also appears to be very close to the Taylor series in the first place.
Rounding the Taylor coefficients to 6 digits after the decimal:
1/3! = 0.166667
1/5! = 0.008333
1/7! = 0.000198
1/9! = 0.000027(56)
The first 2 are exact, the third is 5 digits only (so 0.000190), and the fourth is more different starting from the 6th digit (0.000026019).
The delta in the 9-th order is expected if you were to truncate the Taylor series starting from the 11th order to infinity (+ x^11 / 11! - x^13/13! ...).
I don't think the polynomial given in the article was calculated via Remez. Perhaps the author merely meant it as an illustration. Here is Wolfram Alpha plotting the error of equation from the article:
https://www.wolframalpha.com/input?i=plot+sin%28x%29+-+P%28x....
You'll note that the error is small near zero, and large near pi/4, which is characteristic of a Taylor series around zero, and not the characteristic "level" oscillation characteristic of Remez approximations[1]. Note also that the polynomial only includes odd terms, which is not something I would expect from Remez unless it was run on the symmetric interval [-pi/4, pi/4].
I ran Remez for the problem and after 4 iterations obtained a degree 8 polynomial with error less than 1e-9, but it didn't look anything like the polynomial given in the article.
Although of course the first few digits of the low-order matching terms will be very similar - any polynomial approximation method will agree there because they are fitting the same function, after all. But by the time we reach x^5 or x^7 the agreement is very loose, really only the first couple digits are the same.1.7209863008943345e-05x^8 -0.00024575124459624625x^7 +7.194649190849227e-05x^6 +0.008268794893899754x^5 +3.425379759410762e-05x^4 -0.16667692317020713x^3 +1.5422400957890642e-06x^2 +0.9999999106213322x +8.526477071446453e-10> Perhaps the author merely meant it as an illustration.
Yes I too was expecting oscillations (since I saw Chebyshev polynomials mentioned on the wiki page). I'm beginning to think that the word "might" in their statement "One way to calculate is Remez’s algorithm. The results might look like this:" is doing a lot of heavy lifting.
A couple of notes about your results:
- The interval only needs to be [0, pi/16] and not [0, pi/4]; adjusting that in your wolframalpha query yields a similar bound to what the author lists: 3.5 e-9. (Although same observation about the change of behavior at the bounds as you made).
- The polynomial you obtained actually looks pretty close to the Taylor one: the even coefficients are small, and rounding to 5 or 6 digits might yield a close one, maybe if we look for a degree 9 polynomial instead of degree 8. (I don't have an implementation of Remez handy right now to try it myself).
> Although of course the first few digits of the low-order matching terms will be very similar
Makes sense.
https://en.wikipedia.org/wiki/Remez_algorithm
It’s a very simple iterative algorithm, essentially the dumbest thing that could possibly work (like most good algorithms). It fails to converge for functions that have poles nearby unless you have a very good initial guess (the Chebyshev or Carathéodory-Fejér approximants are ~always good starting points and easily computed). In practice you want to optimize a weighted L-inf norm rather than absolute, because floating-point errors are measured in a relative norm.
Robin Green goes into this in this GDC 2002 article.
https://basesandframes.files.wordpress.com/2016/05/rgreenfas...
I blogged my adventure of implementing cosine from scratch and how others have done it:
This was my first use of open source, around 1978. I wondered how calculators and computers did trig functions, and was also using Unix V7. We had a large disk and kept the source on line. So I was able to find this: https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/libm/s... and from there this book: https://www.biblio.com/book/computer-approximations-john-f-h...
1 - cos^2(x), obviously
instead, you could just double negate to optimize away the square root
-(-(sin(x))clear, but too verbose. 1/csc is what you want.
or for style points just -cos'
http://steve.hollasch.net/cgindex/math/inccos.html is a great technique if you need a fast integer approximation for some some arbitrary sampling interval (i.e. motor control)
I've been doing FoC motor control for a long time and I've settled on a neat little fixed point approximation for sin/cos. I haven't been able to find the blog I got the idea from. It's accurate to 9 bits, but is very symmetric and hits 1,0,-1 exactly. It's also smooth which usually makes it better than lookup tables.
Just using the 32 entry LUT and the small-angle approximations (sin(x) = x, cos(x) = 1-x^2/2) lets you calculate sin(x) within +/- 0.00015, which is rather impressive for something that can be done quickly by hand.
If you use cos(x) = 1 then you are still accurate to within 1%
[edit]
I think I also found an error in TFA? It seems like picking the "best" N would allow r to be in the range -pi/32 < r < pi/32, which makes the 3rd order Taylor series have an error of 4e-12, significantly better than the error range for 0 < r < pi/16
CORDIC is how it's usually done in hardware (and FPGAs).
CORDIC is pretty obsolete, AFAIK. Its advantage is that its hardware requirements are absolutely tiny: two (?) accumulator registers, and hardware adders and shift-ers—I think that's all. No multiplication needed, in particular. Very convenient if you're building things from discrete transistors, like the some of those earlier scientific calculators!
(Also has a nice property, apparently, that CORDIC-like routines exist for a bunch of special functions and they're very similar to each other. Does anyone have a good resource for learning the details of those algorithms? They sound elegant).
CORDIC still is used in tiny microcontrollers (smaller than Cortex-M0) and in FPGAs when you are very resource-constrained. Restricting the domain and using Chebyshev/Remez is the way to go pretty much everywhere.
I just recently came across CORDIC in a fun read I just finished, 'Empire of the Sum', by Keith Houston. It's a history of calculators, and there's a chapter on the HP-35 which used CORDIC. The author goes into some details how it was used by that constrained device.
There are a lot of references given in the book including more details on CORDIC.
CORDIC doesn't make sense if multiplies have a similar cost to adds. If you have cheap multiplications then a successive approximation is faster. If multiplications are expensive relative to adds then CORDIC can still generally win. Basically this is only the case nowadays if you are doing ASICs or FPGAs.
I have used CORDIC to compute fixed-point log/exp with 64 bits of precision, mostly because I wanted to avoid having to implement (portable) 64x64->128 bit integer multiplication in C. It is probably still slower than doing that, but the code was really simple, and very accurate.
I think still used out of necessity when hw floating point not available (like fpgas)
Multiplication is pretty much needed in cordic! And is far from obsolete! It works perfectly fine, and dont have any of the problems said in the article.
CORDIC doesn't use multipliers. That's the whole appeal for low performance hardware since it's all shifts and adds. It can still be useful on more capable platforms when you want sin and cos in one operation since there is no extra cost.
It uses multiplication by powers of two, which is a floating-point bit shift.
no it's not. cordic has awful convergence of 1 bit per iteration. pretty much everyone uses power series.
Actually pretty much everyone implements double precision sin/cos using the same (IIRC) pair of 6th order polynomials. The same SunPro code exists unchnaged in essentially every C library everywehre. It's just a fitted curve, no fancy series definition beyond what appears in the output coefficients. One for the "mostly linear" segment where the line crosses the origin and another for the "mostly parabolic" peak of the curve.
That is 64 iterations for a double, that is nothing!
53, but that's still a lot more than the 5th degree polynomial that you need.
yeah, but 52 adds can be a lot cheaper than a few multiplies, if you're making them out of shift registers and logic gates (or LUT). in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation.
> in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation
Moving data is indeed expensive, but there’s another reason to not care. Modern CPUs take same time to add or multiply floats.
For example, the computer I’m using, with AMD Zen3 CPU cores, takes 3 cycles to add or multiply numbers, which applies to both 32- and 64-bit flavors of floats. See addps, mulps, addpd, mulpd SSE instructions in that table: https://www.uops.info/table.html
> moving around the data is 100x more expensive than the ALU operation.
This is exactly the problem with CORDIC. 52 dependent adds requires moving data from a register to the ALU and back 52 times.
It's the problem with CORDIC in that context, yes!
Why would you ever use CORDIC if you had any other option?
It's great for hardware implementations, because it's simple and you get good/excellent accuracy. I wouldn't be surprised if that's still how modern x86-64 CPUs compute sin, cos, etc.
That said, last time I had to do that in software, I used Taylor series. Might not have been an optimal solution.
EDIT:
AMD's Zen 4 takes 50-200 cycles (latency) to compute sine. I think that strongly suggests AMD uses CORDIC. https://www.agner.org/optimize/instruction_tables.pdf page 130.
Same for Intel, Tiger Lake (Intel gen 11) has 60-120 cycles of latency. Page 353.
I'd guess usually ~50 cycles for Zen 4 (and ~60 for Intel) for float32, float64/float80 datatype. Denormals might also cost more cycles.
They switched away from CORDIC at one point: https://www.intel.com/content/www/us/en/developer/articles/t...
(there doesn't seem to actually be a linked article there, just the summary)
Pretty weird Intel's sine computation latency hasn't changed all that much over the years. Latencies have been pretty similar for 20 years.
EDIT: That's a paper for a software library, not the CPU's internal implementation. Which is probably still done with CORDIC.
> EDIT: That's a paper for a software library, not the CPU's internal implementation.
Unless you're seeing something I'm not, it's talking about x87, which hasn't been anything other than 'internal' since they stopped selling the 80486sx.
Ah you're right.
Anyways I wonder why it's still so slow.
60-120 cycles sure looks like a CORDIC implementation, but perhaps not.
I was truly amazed when my high school computer science teacher expanded the sine function into a Taylor series for us to implement in class.
Couldn't wrap my head around the concept until the topic was revisited in college, but idea was there and helped me understand the tradeoffs brought by "fast math" libraries I was using during high school.
Link doesn't appear to be valid, but aren't these usually precalculated and stored in a lookup table?
Think about how big that lookup table would be for an 64bit double float.
They use a lookup table for nπ/16, and then a polynomial to approximate the difference to other values.
That would be quite a big lookup table... Half of float numbers are in -1…+1 range. You don't need the negative part, so that's a quarter of float numbers, over one billion numbers. And then some to get from 1 to π/2.
And that's only float32!
What’s the best way to calculate it by hand?
I’m brushing up my math basics (I graduated CS while dodging the math requirements) and it frustrates me that in trig I need to remember values at all. The values such as sqrt(2)/2 make sense but how hard is it to calculate sin(5 degrees) by hand?
Use a Taylor series with a four function calculator.
0 is a decent approximation of sin near 0.
x is a better one.
x - x^3/6 is an even better one.
x - x^3/6 + x^5/120 ...
Note that x here is in radians rather than degrees so convert (degrees * pi/180) first. Repeat until you're satisfied with how many stable digits you get
Five is small enough that you can get away with sin(5 deg) = 5pi/180
I remember seeing the source code for a version of SpaceWar! running on an Adage Graphics Terminal (a one's complement machine) around 1970 that used a precomputed sine table.
I wonder what the first program was that ever used a precomputed trig table.
The same way you can eat an elephant: one byte at a time.
Any calculating job can be undertaken by a proper Turing machine. You just have to keep in mind the old triangle of usage: Cost, capability and speed.
If a human can calculate a sine, so can any full-blown computer.
I thought modern CPUs since the late 1980's used a lookup table for trig/transcendental functions. Is the LUT just an expansion of the polynomial? I never really understood how FPUs worked...
That would take way too much room. A full lookup table would have 2^64 entries of 64 bits each, at 2^70 bits of ROM.
For comparison:
- Apple’s M2 Ultra has about 134 billion transistors. That’s about 2^38.
- Avogadro’s number is about 2^79.
Reducing the argument to a small range around zero decreases that a lot, but not enough by a far stretch. There are 2^52 doubles in [0.5, 1.0), 2^52 more in [0.25, 0.5], 2^52 more in [0.125, 0.25], etc. so you’d still easily need 2^52 entries or 2^58 bits (likely way, way more)
They DO use a lookup table because that’s what the FDIV but came from:
https://en.m.wikipedia.org/wiki/Pentium_FDIV_bug
“It is implemented using a programmable logic array with 2,048 cells, of which 1,066 cells should have been populated with one of five values: −2, −1, 0, +1, +2.”
Not sure what you’re trying to demonstrate, they wouldn’t store every single float!! I hope don’t program. ;)
Older CPUs generally have used CORDIC (which does use LUT but that's only a part of the algorithm) due to its simplicity and compactness, while later CPUs with extensive microcode support would do the same thing as software implementations.
TFA says they use a 32 entry LUT, then do some math on the result.
Is this still current? The paper has a publication year of 1999.
Ignorant question:
Given the ridiculous number of transistors and so on we can use in CPUs and GPUs nowadays how feasible is a relatively huge trig lookup table burned into rom?
There is huge number of 64bit floats and huge portion of those are between 0..pi/2.
A lookup table (for any function) that covered all values between 0 and 1 in single precision, would be ~4GB; there are approximately 1B values between 0 and 1, and the result of each value is 4 bytes.
Such a table for double-precision would be much, much larger.
Seems like a terrible idea on latency grounds alone.
Why isn't this just done with an industry standard lookup table these days?
sine is easy because the series is globally convergent and fast converging
It would also be extremely inaccurate. The x^n numerators grow very quickly and digits get lost because unlimited precision isn't available. Likewise, the n! denominators also grow rapidly. Then the series is alternating which means cancellation is happening for every added term.
If you don't believe me try for x=10.
You need around 26 terms for x=10 if you do it without reduction and you want an accuratish result for double precision.
You wouldn't evaluate the terms naively from left-to-right.
x - x^3/3! + x^5/5! - ... = x * (1 - x^2/(2*3) * (1 - x^2/(4*5) * ... ) )
I just checked in python and you get a result that is around 1000*machine epsilon off. Not great, not terrible.
Did you read the article? It is specifically about how the series looks simple, but the error is actually very bad if you do things naively.
That still makes it easier compared to computing constants in which the series are not globally convergent, like inverse trig functions. Obviously, you would have to break it apart to speed convergence.
And arcsin?
harder due to convergence issues