## Floating Point mathematics for 6502

Any application using at least a rudimentary amount of mathematics should use floating point.

Consider for example a simple linear interpolation/extrapolation:

At time T1 the value was Y1 and at time T2 the value was Y2. What is the interpolated value of Y for another time T?

This interpolated value is given by the formula

Y=Y1+(T-T1)*(Y2-Y1)/(T2-T1)

Even if the input values T1, T2, Y1, Y2 are all integers and the resulting T is needed as an integer, i.e. as the integer that is closest to Y, the intermediate computations should be made in floating point.

The basic principle of floating point with 8 bits for the exponent and 24 bits for the mantissa (IEEE single precision format) is to represent a value as

M * 2**(E-23-B)

where M is an integer between 2**23 = \$800000 and 2**24-1 = \$FFFFFF and E is an integer between 0 and 2**8-1 = \$FF. B is a fixed "bias" which for the IEEE format is selected as 127

Some examples:

 Value Exponent Mantissa 1 127 = \$7F 2**23 = \$800000 2 128 = \$80 2**23 = \$800000 3 128 = \$80 3*2**22 = \$C00000 4 129 = \$81 2**23 = \$800000 5 129 = \$81 5*2**21 = \$A00000

The "bias" is in fact the exponent for the value "1" for which the mantissa is 2**23.

With this representation the most significant bit of the mantissa is always set. The IEEE specification says that this "implied bit" shall be left out to make room for a "sign bit" within the 32 bit word indicating the sign of the floating point number, 0 meaning "+" and 1 meaning "-".

The value 0 cannot be represented in this form. The IEEE rule is that an exponent of zero indicates that the value is zero. This then means that the smallest positive value that can be represented has mantissa 2**23 = \$800000 and exponent 1, i.e. it is 2 ** (-126). In a similar way the exponent 255 = \$FF is reserved for "over-flow", i.e for a number too large to be represented. The largest value that can be represented therefore has the mantissa 2**24-1 = \$FFFFFF and exponent 254 = \$FE, i.e it is 2**(254-150) * (2**24-1).

An additional example, the value PI as IEEE floating point value:

3.141593 = 2**(128-150) * 2**22 * 3.141593

The mantissa is therefore 2**22 * 3.141593 = \$C90FDC and the exponent 128 = \$80

The COMMODORE floating point format used for the PET computer deviates from the IEEE standard in that 32 bits are used for the mantissa and that there is a bias of 129 for the exponent. One therefore has

 Value Exponent Mantissa 1 129 = \$81 2*31 = \$80000000 2 130 = \$82 2*31 = \$80000000 3 130 = \$82 3*2**30 = \$C0000000 4 131 = \$83 2*31 = \$80000000 5 131 = \$83 5*2**29 = \$A0000000

The value PI as COMMODORE floating point value:

3.141593 = 2**(130-31-129) * 2**30 * 3.141593

The mantissa is therefore 2**30 * 3.14159265 = \$C90FDAA2 and the exponent 130 = \$82.

With the Commodore "REGISTER FORMAT" used in the "Floating Point Accumulator" (zero page \$B0-\$B5) and in the "Floating Point Argument" (zero page \$B8-\$BD) one has

 \$B0 (\$B8) The exponent \$B1-\$B4 {\$B9-\$BC) The mantissa \$B5 (\$BD) The sign, \$00 for "+" and \$FF for "-"

The most significant bit (the "implied bit") of the \$B1 and the \$B9 byte is explicitely set. The sign is instead indicated by the last byte. The exception is the value zero that is defined by that the byte \$B1 (\$B9) takes the value zero. If this is the case the value of the other five bytes can be ignored. This is then different from the IEEE convention where the "implied bit" is suppressed and the value zero instead must be indicated by that the exponent is zero. To save space when storing floating point values Commodore "MEMORY FORMAT" is used without the byte defining the sign, instead the "implied bit" is replaced by a sign bit just as for the IEEE format. With this "MEMORY FORMAT" zero is then defined by the exponent being zero as for the IEEE format.

Floating point addition, subtraction, multiplication and division can all lead to "over-flow", i.e. to a number too large to be represented. This is indicated by that the exponent is set to 255 = \$FF. "Under-flow" is on the other hand not really an error. If for example the result of multiplicating 2 floating point numbers is too small to be represented with the range of exponents available the product is simply set to a pure zero.

The basic software for floating point computation:

The subroutine "FLOAT" will transform a right-justified positive integer in FAC+1,..,FAC+4 to a floating point number in Commodore format. For an application where signed integers are used and a negative integer shall be transformed one would have to transform corresponding positive integer to floating point and then change the sign of the floating point value by setting the last byte to \$FF.

The subroutine "FADD" will add the floating point numbers in FAC,..,FAC+5 and WORK,..,WORK+5 and put the result in FAC,..,FAC+5.

The subroutine "FSUB" will first change the sign of the value in FAC,..,FAC+5 and then continue as "FADD".

The subroutine "FDIV" will divide the floating point numbers in WORK,..,WORK+5 with the value in FAC,..,FAC+5 and put the result in FAC,..,FAC+5..

The subroutine "FMUL" will multiply the floating point numbers in WORK,..,WORK+5 with the value in FAC,..,FAC+5 and put the result in FAC,..,FAC+5..

As an example of use, take the initially addressed problem of interpolation/extrapolation for which

Y=Y1+(T-T1)*(Y2-Y1)/(T2-T1)

must be computed.

This can be done by the following main program calling the floating point subroutines:.

## Original Commodore code from the MS Basic interpreter implemented for PET computer

Corresponding routines are available in the ROM of the PET computer. In principle this code can be derived from a ROM dump by a "Commodore hacker" using the information obtained from a 30 years old "PET Assembler Programmers Guide" telling that

• FADD has entry point \$D73F
• FMUL has entry point \$D900
• FDIV has entry point \$D9E4
Correct results are indeed obtained (except for the sign byte?) by calls "JSR \$D73F", "JSR \$D900","JSR \$D9E4" provided that

• the two floating point numbers are in \$B0-\$B5 and in \$B8-\$BD
• the zero-flag is set correctly before the call (in general unset!) as the first statement of all these routines is "BNE/BEQ"
The Microsoft version of "FADD" is of less interest as it not only is less accessible/documented then the FADD version provided above but also slower.

The Microsoft versions of FMUL/FDIV could on the contrary be of some interest as they are faster then corresponding modules provided above. Any "Commodore hacker" is invited to extract the relevant code out of the ROM dump to build source code for portable and self-contained subroutines FMUL, FDIV with "Microsoft flavour".

Something completely different: