Jamil Khatib

Introduction to Floating point calculations and IEEE 754 standard

Introduction to Floating point calculations and IEEE 754 standard

1  Introduction

Floating Point is a representation of real (fractional) numbers. In this representation the location of the fractional point can be moved from one location to another according to the precision. ``thats where the name came from''.
It takes the general format as
Exp.   0. Fraction

Although these numbers solve many problems of integers, it has its own problems and considerations. This is because the resources available for storing these numbers are limited and so we can not represent infinite number of digits and not like real life numbers and calculations. Rounding method and order of calculations are few considerations of floating number calculations.

2  Introduction to IEEE-754 standard

In the early days of computers, vendors start developing their own representations and methods of calculations. These different approaches lead to different results in calculations. So the IEEE organization defined in the IEEE-754 standard a representation of the floating point numbers and the operations.

3  Representation

SingleSingle-ExtendedDoubleDouble-ExtendedQuad-Precision
Exponent(max)+1271023+1023+16383+16383
Exponent(min)-1261022-1022-16382-16382
Exponent Bias+127+1023+1023+16383+16383
Precision(#bits)24 ³ 3253 ³ 64113
Total Bits32 ³ 436480128
Sign bits11111
Exp Bits811111515
Fraction23 ³ 325264112

4  Precision

5  Normalization

6  Special values

The IEEE-754 standard supports some special values that gives special functions and give some signals.

Table of Special values
NameExponentFractionsignExp BitsFract Bits
+0min-1 = 0+All zerosAll Zeros
-0min-1 = 0-All zerosAll Zeros
Numbermin £ e £ maxany anyAnyAny
+¥max+1 = 0+All onesAll zeros
-¥max+1 = 0-All onesAll zeros
NaNmax+1 ¹ 0anyAll onesAny

6.1  Zero

6.2  NaN

Sources of NaN
OperationProduced by
+¥+(-¥)
×¥
/0/0, ¥/¥
REMx REM 0, ¥REM y
Ö[] Öx (when x < 0)

6.3  Infinity

7  Exceptions

Exceptions in IEEE 754 standard
ExceptionCased byResult
OverflowOperation produce large number±¥
UnderflowOperation produce small number0
Divide by Zerox/0±¥
InvalidUndefined OperationsNaN
InexactNot exact resultsRound(x)

8  Rounding modes

9  Comparison

References

[1]
What every computer scientist should know about Floating-Point arithmetic. by David Glodberg. ``http://www.''

[2]
Lectures notes on the Status of IEEE standard 754 for Binary Floating-Point Arithmetic. ``http://www.''

[3]
An Interview with the Old Man of Floating-Point. ``http://www.cs.berkeley.edu/ wkahan/ieee754status/754story.html''


File translated from TEX by TTH, version 2.67.
On 10 Aug 2000, 00:07.