# x86 Processors and Infinity

other papers...

This article was written in 2003, and while the basic message remains accurate the details have changed. An updated version of this article is available here, as part of a series of blog posts that explored many aspects of floating-point math.

That said, on to the article!

# Old article starts here

Modern microprocessors, such as the Pentium 4 and the Athlon, can do floating point calculations at a tremendous speed. Floating point addition takes three to five clock cycles, so at 3 GHz a Pentium 4 can do about 600 million additions per second. Better yet, the floating point unit is heavily pipelined, meaning that many operations can be going on simultaneously. If your code is well designed and if there aren't too many dependencies, these processors can do a floating point add every cycle - so that 3 GHz Pentium 4 can actually do about 3 billion additions per second! That's pretty amazing considering that these calculations are being done with up to 19 decimal digits of accuracy. Wow.

Floating point math is described by IEEE 754-185 and IEC-559. These standards describe the formats and behaviours of floating point math. Some of the things mandated by the standard are special bit patterns for infinities, NANs (Not A Numbers), and denormals.

Infinities are important for handling divide by zero and overflow. An infinity records the fact that the result is really big, possibly infinitely big, in a manner that a really large finite number just can't. It can make some calculations, like the formula for the resistance of a parallel circuit, magically give the correct answer even when one of the resistances is zero, causing a divide by zero and then a divide by infinity.

NANs are important for signifying meaningless results. If you take the logarithm of a negative number, divide zero by zero, subtract infinity from itself, or any other operation where there is no meaningful answer, a NAN signals this fact. Since any operation that involves a NAN returns another NAN, you are guaranteed to be able to detect when some part of your calculations went haywire.

Denormals are special floating point numbers that are not normalized - instead of being 1.x times the exponent they are 0.x times the exponent. They are used for very small numbers that would otherwise flush to zero. Their non-standard format causes difficulty for many floating point units, and was the cause of much wrangling when IEEE 754 was being worked on. However, they have some nice properties. With denormals it is always true that if x-y==0, then x == y. Without denormals this is not necessarily true. They make floating point math more predictable and regular, and fill in a yawning gap in the real number line.

The trouble with special numbers is that it's hard to make processors that handle the regular numbers really fast, while still dealing with the special numbers. It's not impossible, but it's hard, and tradeoffs sometimes have to be made.

The Intel Pentium 4 handles infinities, NANs, and denormals very badly.

If you write code that adds floating point numbers at the rate of one per clock cycle, and then throw infinities at it as input, the performance drops. A lot. A huge amount. Suddenly it takes about 850 cycles to add two numbers together! Infinity plus infinity is equal to infinity, but a Pentium 4 takes 850 times as long to figure that out as it takes to figure out that 1234.1341234874 + 843.09821341234 = 2077.2323368997399.

Denormals are a bit trickier to measure. A number that is a denormal in ‘double precision' is no longer a denormal once it's loaded into the 80-bit registers of an Intel compatible floating point unit (because the exponent range is increased), so denormals add just fine in many tests, but there is a substantial penalty for loading and saving them. Loading and saving denormals runs about 350 times slower than with normal numbers.

Is this inevitable? Is this the price we pay for having a floating point unit that is usually fast - the risk that it will occasionally run almost a thousand times slower? Well, not necessarily. The AMD Athlon can also do one floating point add per cycle, to exactly the same accuracy as the Intel processor. It also handles infinities, NANs, and denormals. However it handles infinities and NANs with no penalty. Zero. In fact, addition with infinities and NANs seems to run a tiny bit faster than with normal numbers.

Denormals do slow down the Athlon a bit, but not as much as they slow down the Pentium 4. Exact and meaningful numbers are harder to come by once you start dealing with loading and storing numbers, but the Athlon seems to load and store denormals in about one eighth as many clock cycles as the Pentium 4.

Now, this isn't an entirely fair comparison. The Pentium 4, as of this writing, is available at speeds up to 3.06 GHz. The fastest Athlon you can buy only runs at 2.16 GHz. So, in normal usage the Pentium 4 may execute floating point code faster. However, a 50% clock speed advantage doesn't always matter. Consider how much faster (in clock ticks per operation) the Athlon is on various operations, including those discussed above and some others:

• Infinities - 850:1
• NANs - 930:1
• Divides - 2:1
• Square roots - 4:3
• Cache unaligned reads - 9:1
• Cache unaligned writes - 35:1

There are probably places where the Pentium 4 runs faster per clock-tick than the Athlon. However there are certainly some places where the Pentium 4 runs spectacularly slower than the Athlon.

But does this really matter? If one floating point instruction out of a million involves an infinity, then it doesn't matter. But if one out of a hundred involves a NAN, then your floating point performance just dropped by a factor of 10.

This web site documents one person who ran into this problem, finding that the Pentium 4 ran approximately two orders of magnitude (that's one hundred times) slower than a comparable Athlon.

The Fractal eXtreme shareware fractal program also runs into this problem. Unwinding of the fractal calculation loop means that overflow to infinity sometimes happens. The code handles this properly, but there can be a performance hit.

Code that initializes variables with +- infinity as the largest number in order to avoid special case checks is vulnerable. In most cases this code can be ‘fixed' by replacing infinity with DBL_MAX, but infinity is really the correct value to use.

The really peculiar thing about this is that the Pentium 4 actually handles NANs and infinities at full speed. It also handles them with about a 900 times penalty. It all depends on which floating point unit you use. If you use the regular stack based floating point unit that has been around since the ‘287, you get the huge penalty. However, if you use the new SSE2 instructions for double precision floating point, there is no penalty for operating on infinities or NANs.

Maybe the stack based architecture of the original floating point unit made further optimizations too difficult, or maybe Intel decided to devote their transistors to SSE2 instead, but for whatever reason, if you want reliably fast math on the Pentium 4, you need to think about SSE2.

SSE2 still has the penalties for denormals, but it also adds a couple of flags to disable the generation and use of denormals. This lets you tell the processor to flush SSE2 denormals to zero, which avoids the penalties quite nicely, at the cost of giving up true IEEE 754 compliance, and potentially some numerical stability.

The test results that isolated these problems are available for a wide range of processors. So is the source code that you can use to verify these results. All recent Intel processors suffer from these problems, but the Pentium 4 definitely has the largest penalties.

Thanks to Brian Brown for bringing this issue to my attention and writing the SSE2 test code. Thanks to Paul Komarek for documenting his problems with anomalous performance on the P4.

Pentium and Athlon are trademarks of Intel and AMD.