Leonid Pospeev

Jan 8, 20195 min read

IEEE 754 floating point numbers

General

IEEE 754 is a IEEE standard used to define how floating point numbers should efficiently be stored and processed in a computer memory. The main challenge for storing such numbers is that floating point numbers row is virtually endless. If you have the beginning and the end of some numeric range (from 0 to 10 for example), an easy task will be to store each natural number falling into this range as the amount of such numbers is countable. For floating point numbers, one can count an infinite amount of floating point numbers falling into the range.

IEEE 754 provides the most widely used approach to storing floating point numbers. This article describes the conversion principles. For convenience, if not specified explicitly, the numbers with the following 2 in subscript will indicate the numbers written in base 2 system while the numbers with no subscripts will be for base 10 system.

How numbers are stored

Most widely, the floating point numbers are stored as 32 bit (single precision) and 64 bit (double precision) values, although the standard allows other number of bits to be used.

Each floating point number is stored as three values:

sign;
biased exponent;
mantissa (or significand).

Sign number is just a single bit indicating the sign of a stored number. If the number is positive, the sign bit is 0. If the number is negative, the sign bit is 1.

Mantissa and exponent are used to represent the number in scientific notation. For base 10 numbers, the scientific notation is a multiplication of a power of 10 and a number in the range [1; 10) which returns the original number. Say, if you have the number 48991.9223, its scientific notation will be

For base 2 numbers the principle is the same, but instead of a power of 10 the power of 2 is used (which is still technically a power of 10 in base 2 system).

Say, you have a fraction number in base 2 system 1011011.110101. Its scientific notation in base 2 will be

Here the number 1.011011110101 is the mantissa. Leading 1 is not stored in the memory as a scientific notation in base 2 system assumes only 1 can be a leading number. This allows 1 to be assumed implicitly without a need to allocate a bit to store it and to store only a fraction of mantissa (011011110101 in this case).

The power 10 is raised to is the exponent. However, since both large and small numbers need to be stored, the exponent can be both negative and positive. To do so, the bias is added to the exponent to form the biased exponent. For 32 bit numbers the bias is 127, for 64 bit numbers the bias is 1023. If a stored biased exponent value in 32 bit number is 129, then the actual exponent value is:

If 32 bit format is used, bit 0 to 22 are used to store mantissa, 23 to 30 to store exponent and bit 31 is for sign, as shown on the picture:

For 64 bit format, 0 to 51 are used to store mantissa, 52 to 62 are for exponent and 63 is for sign, as shown on the picture:

Conversion examples

Let’s consider a 32 bit format. Say, we have this number:

Here the bits are numbered from right to left, i.e. the 0th bit is the rightmost and 31st is the leftmost.

The number is stored as:

sign bit is 1;
biased exponent is 10000011;
mantissa is 1.10111011000000000000000.

This makes it possibly to instantly write down the scientific notation of the stored number in base 2 system:

Here 1111111 in the power is the bias value (127 as decimal). Since the sign bit is 1, the number is negative.

To get the stored number in base 10 we’ll convert both biased exponent and mantissa to decimal format.

Biased exponent is:

Mantissa is:

Finally, the stored number is:

Hovewer converting decimal numbers to a bit representation is trickier. Consider a simple looking number 12.45. To convert it, we first need to take separately integer and fraction parts:

Converting 12 to base 2 is simple:

To convert 0.45 to base 2 what needs to be done is a successive multiplication by 2. Resulting integer numbers will give the numbers from base 2 notation. The multiplication is repeated for the fraction parts with the integer parts omitted.

As you can see, the numbers are then repeated. This yields a base 2 representation:

The accuracy of such operation can be checked by converting the number back. Say, we take the first 23 numbers (as the number of bits in 32 bit format mantissa) of the fraction part:

The bigger part of the fraction is taken, the more accurate the conversion is.

This allows for the 32 bit number representation as:

Here the mantissa fraction is trimmed to 23 bits. Next, all the sign, mantissa and exponent values are:

sign bit is 0 since the value is positive;
biased exponent is 1111111 + 11 = 10000010;
mantissa is 1.10001110011001100110011.

Finally, the 32 bit representation of 12.45 is 01000001010001110011001100110011.

As can be seen from the above text, the conversion error happens due to mantissa trimming. To calculate the error value, we can convert the number back from base 2 to base 10. The actual stored value is:

Conversion error in this case is:

Special values

The standard reserves specific bits patterns for special values. These values are:

Zero. In this case both mantissa and exponent are all 0. Depending on the sign bit value, +0 and -0 exist.
Infinity. In this case exponent is all 1 and mantissa is all 0. Similar to zero value, +infinity and -infinity exist depending on the sign bit value.
Denormalised number. In this case the exponent is 0 and the mantissa is assumed to have a leading 0 number instead of 1 as in normalized numbers.
Not a number (NaN). In this case exponent is all 1 and mantissa is not all 0. If the most significant mantissa bit is 1, the NaN type is Quiet NaN (QNaN), if the most significant mantissa bit is 0, NaN is signaling (SNaN). QNaN can be an indication of a mathematical operation that did not return a number due to some reason, but they can be passed further to other operations. SNaN also represent the results of the operations that did not return a number, but can not be cast to further calculations causing a calculation exception.

IEEE 754 numbers can be used to store numbers in these ranges:

Special cases when the standard capabilities are not enough to provide a reasonably accurate representation of a number are:

Overflow here means that the number is too large to be stored, underflow means the number is too small (close to zero) for the standard to maintain its precision.

Combining the ranges for normalized and denormalized numbers, the effective values ranges to be stored as IEEE 754 numbers are:

The ranges are symmetrical around zero since the sign of a number is only affected by the sign bit.

Further readings

Cover image credits:

IEEE 754 floating point numbers

Recent Posts

Comments