Geek Page

Geek Page - Floating-Point Arithmetic

Geek Page - Floating-Point Arithmetic

Why computation is necessarily inaccurate.

The recent Pentium fiasco reminds us that arithmetic is not innate to computers. Although these bundles of circuits are often thought of as the physical manifestation of mathematics, they are far more limited and imprecise. A computer requires a "floating-point" unit to perform arithmetic operations on real numbers. Until recently, this unit was found in a separate chip called a math co-processor. Now integrated into most microprocessors, the unit remains one of the computer's most complex parts.

The correct way to perform arithmetic with real numbers has long been a subject of controversy among computer scientists. That may sound strange - few of us harbor doubts about how to multiply or divide. But because a computer has finite memory to represent an infinitely long, infinitely divisible number line, the range and accuracy of computer arithmetic is limited. The method chosen to approximate the infinite can be critical for scientific applications, since approximations can quickly snowball into inaccurate results when millions of calculations are done in a row.

Historically, every computer manufacturer had its own, slightly different scheme for approximating arithmetic. But after a decade-long battle between committee members, the Institute of Electrical and Electronics Engineers published a set of guidelines that became an industry standard in 1985.

The cornerstone of the IEEE standard is its definition of how to store a number - how many bits should be used and how these bits should be interpreted. Because most personal computers operate on 32-bit chunks, the IEEE chose to use 32 bits for its basic format. A question then arose about how to use those bits so that both very large numbers (the number of atoms in a gram, for example) and very small numbers (say, the diameter of an atom) could be represented.

One scheme, found on very early computers, is the "fixed-point" format, which represents a real number as a string of digits interrupted by a decimal point. But this method allows only a small range of numbers to be represented. For example, if we had only enough bits to store six digits and we fixed the decimal point in the middle, the largest number would be 999.999 and the smallest positive number would be 0.001.

So, modern computers use an alternate format: floating point. As in scientific notation, numbers are represented in terms of a significand and an exponent. For example, in 3.2275-e8, 3.2275 is the significand and 8 is the exponent. Because we can move, or "float," the decimal point by changing the exponent, this scheme can efficiently represent both very large and very small numbers. With six digits, for example, we can represent 9.9999-e9 - a number 10 million times larger than the largest fixed-point number.

The IEEE standard uses a floating-point representation: it allocates 23 bits to store the significand, 8 bits to store the exponent, and 1 bit to specify if a number is positive or negative. All numbers are "normalized" so they have exactly one nonzero digit before the decimal point. For example, 32.275-e2 must be stored as 3.2275-e3. Because the IEEE standard specifies that numbers are stored in binary, in which numbers are represented by 1s and zeros, the nonzero digit before the decimal point can only be a 1. Therefore, there is no need to store it, and we save a whole bit.

But this 32-bit representation of the infinite leaves many important questions unanswered. What happens when calculations result in numbers too large or too small to be stored? What happens when a result is nonnumerical: the square root of -1, say, or zero divided by zero? How should numbers be rounded? The IEEE committee fought long and hard over all these questions.

The most contentious issue proved to be underflow - numbers smaller than the smallest possible number that can be represented. Because of the implicit "1" before the decimal point, the smallest possible normalized number that can be represented is 1.0 x 2 -126 (-126 is the smallest exponent that can be represented with the IEEE standard). This results in a rather odd-looking number line: numbers slowly decrease until they reach 1.0 x 2 -126 and then suddenly jump to zero.

Many people found this precipitous jump acceptable, but a faction of the IEEE standardization committee, led by William Kahan of the University of California at Berkeley, proposed using a special value in the exponent field to signal that a number includes a zero before the decimal point. This allows smaller values to be represented, but at the expense of accuracy. For example, the result of 1.234 divided by 100 can now be represented as the "denormalized number" 0.012, but two significant digits have been lost. Despite arguments that denormalized numbers can lead to inaccurate results, Kahan's scheme was eventually approved.

Unfortunately, the IEEE standard is not yet universally supported. Although most modern personal computers obey the key strictures, there is enough variance that different computers will give slightly different answers to multiplication and division problems. And many mainframes and supercomputers continue to ignore the standard altogether. Cray supercomputers, for example, are notorious for their sloppy arithmetic. It might seem strange that a machine used almost exclusively for number crunching would treat numerical accuracy cavalierly, but Cray was able to shave a few nanoseconds off arithmetic operations by sacrificing accuracy. As Kahan likes to say, "the fast drives out the slow even if the fast is wrong."

It's a choice of priorities all too common in the computer industry; it's why the IEEE floating-point standard is so significant. By defining what constitutes correctness, the standard marks the first step toward achieving it.

Steve G. Steinberg (steve@wired.com) is a section editor at Wired.