Why is division so much more complex than other arithmetic operations?

3 min read Original article ↗

While all current CPU's appear to use an iterative approach as aterrel suggests, there has been some work done on non-iterative approaches. Variable Precision Floating Point Division and Square Root talks about a non-iterative implementation of floating point division and square root in an FPGA, using lookup tables and taylor series expansion.

I suspect that the same techniques may make it possible to get these operations down to a single cycle (throughput, if not latency), but you are likely to need huge lookup tables, and thus infeasibly large areas of silicon real-estate to do it.

Why would it not be feasible?

In designing CPU's there are many trade-offs to make. Functionality, complexity (number of transistors), speed and power consumption are all interrelated and the decisions made during design can make a huge impact on performance.

A modern processor probably could have a main floating point unit which dedicates enough transistors on the silicon to perform a floating point division in a single cycle, but it would be unlikely to be an efficient use of those transistors.

The floating point multiply made this transition from iterative to non-iterative a decade ago. These days, single cycle multiply and even multiply-accumulate are commonplace, even in mobile processors.

Before it became an efficient use of transistor budget, multiply, like division, was often performed by an iterative method. Back then, dedicated DSP processors might dedicate most of their silicon to a single fast multiply accumulate (MAC) unit. A Core2duo CPU has a floating point multiply latency of 3 (the value comes out of the pipeline 3 cycle after it went in), but can have 3 multiplies in flight at once, resulting in a single-cycle throughput, meanwhile it's SSE2 unit can pump out multiple FP multiplies in a single cycle.

Instead of dedicating huge areas of silicon to a single-cycle divide unit, modern CPU's have multiple units, each of which can perform operations in parallel, but are optimised for their own specific situations. In fact, once you take into account SIMD instructions such as SSE or the CPU integrated graphics of the Sandy Bridge or later CPU's, there may be many such floating-point divide units on your CPU.

If generic floating point division were more important to modern CPU's then it might make sense to dedicate enough silicon area to make it single cycle, however most chip makers have obviously decided that they can make better use of that silicon by using those gates for other things. Thus one operation is slower, but overall (for typical usage scenarios) the CPU is faster and/or consumes less power.