Floating-Point Representation¶
Learning Objectives: IEEE 754 standard floating-point representation.
In learning C, we already encountered floating-point data types float and double. Let’s now go through how the computer actually represents and processes floating-point numbers.
Floating-point numbers are the computer’s way of representing real numbers. The standard floating-point representation in computer memory is based on the IEEE 754 standard, which we will outline here. A precise explanation for those interested can be found in textbooks and/or online.
But before going further, let’s first look at how decimal points are represented in binary.
Just like in base 10, we use a decimal point. As shown in the image, the powers of two to the left of the decimal point are positive, while those on the right are negative.
!["Decimal in binary form"](/media/images/tkj-bindec_1eoS1UL.png)
Examples.
Binary number 10.101 = 1*2^1 + 0*2^0 + 1*2^-1 + 0*2^-2 + 1*2^-3 = 2 + 0 + 1/2 + 0 + 1/8 = 2.625 Binary number 101.01 = 4 + 0 + 1 + 0 + 1/4 = 5.25 Binary number 1010.1 = 8 + 2 + 1/2 = 10.5 Binary number 0.111111 = 0 + 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64 = 63/64 = 0.984375
Notice that the smallest number that we can represent with the chosen bit length is the LSB value.
Example. The smallest number in the above cases would be either..
00.001, which is 2^-3 = 1/8 0.000001, which is 2^-6 = 1/64
This is why, with binary numbers, we can only approximate real numbers up to the chosen precision. The approximation could be made more accurate by increasing the number of bits in the decimal part, but the precision achievable is naturally limited by the programming language, computer memory, and processor architecture.
Example. Try to represent the real number 0.1 as a binary number with a 2-bit decimal part. The decimal part bits correspond to 2^-1 = 0.5 and 2^-2 = 0.25. Now it’s impossible to represent 0.1 precisely. However, if we add a bit to the decimal part, corresponding to 2^3 = 0.125, we get closer to 0.1.
Also, how would you represent a real number in binary that cannot be expressed in powers of 2 (2^-n) within the selected precision—for instance, 1/3? In this case, the binary number is repeating, just as it is in basic mathematics, resulting in the recurring pattern
1/3 = 0.01010101010101..
.Binary Rounding¶
Binary numbers can also be rounded, just like decimal numbers. The reason for discussing this here is that floating-point conversions on computers also involve relatively rough rounding.
Example. Let’s round a binary number to the nearest 1/4. The rounding precision achievable is two bits in the decimal part,
.01
, or 2^-2 = 1/4. Since we're rounding to the nearest 1/4, we keep only two bits after the binary point. The bit after these two is the midpoint, and it is used to decide whether bits should be rounded up or down: yyy.aamxxx where m is the midpoint and 'xxx' are bit following the midpoint.Rounding Algorithm: Start with the bits preceding the rounding precision:
- If the bits following the two significant decimal bits xx (i.e., the midpoint bit) is
0xx
, round down. - If the bits following the two significant decimal bits xx
1xx
, include it in the rounding: - If the three bits 1xx are > 100, round up.
- If the three bits 1xx = 100, check the next preceding bit:
- If the midpoint preceding bit is 1xxx, round up.
- If the midpoint preceding bit is 0xxx, round down.
Examples. Rounding to 2^2 = 1/4 precision, i.e., two bits in the decimal part
.xx
10.00011 // now the last two bits are 11 and the midpoint bit is 0, // so it rounds down to .xx precision, i.e., 10.00 (= 2.00) 10.00110 // now the midpoint bit is 1, so it rounds up to 10.01 (= 2.25) 10.11100 // rounds up to 11.00 (= 3.00) 10.10100 // rounds down to 10.10 (= 2.50)
IEEE 754 Floating-Point Standard¶
Before diving into standards, let’s note that there are two main types of representations for floating-point numbers in computers/microchips.
- Fixed-point representation: Here, the decimal point location is fixed, meaning there is always a predetermined number of decimal bits in the representation. This type of representation is convenient because it simplifies the implementation of the calculating logic circuit.
- Floating-point representation: Here, the location of the decimal point "floats" between bits as needed, and the number of decimal bits can vary. However, due to the limited number of bits, we can still only represent an approximation of the decimal number.
The method we presented earlier to represent floating-point numbers was a fixed-point type, and now we will look at how the floating-point representation is constructed.
The standard for floating-point representation includes three components (using base 2, which is suitable for computers):
!["Floating-point representation format"](/media/images/tkj-float_vslUFlj.png)
- Sign s, where s=0 is positive and s=1 is negative.
- The representation of the Mantissa M (fractional part)
- M = 1 + xxx = 1.xxx, where
xxx
represents the fraction. - The fraction is found by shifting the decimal point in the binary number (in both directions). See example below.
- The representation of the Exponent E
- E = exponent - bias
- bias = (2^(n-1)) - 1, where n is the number of exponent bits
- For single-precision floating-point numbers, bias = 127, and for double-precision floating-point numbers, bias = 1023.
Example: Converting the decimal number 12345.0 to a float-type floating-point number.
The number 12345.0 in binary is 11000000111001 * 2^0 (<-LSB) 1. Find M by shifting the decimal point to the left: (The decimal number must remain the same while shifting the decimal point!) 11000000111001 * 2^0 = 12345.0 -> E=0 1100000011100.1 * 2^1 = 12345.0 -> E=1 110000001110.01 * 2^2 = 12345.0 -> E=2 ... 1.1000000111001 * 2^13 -> E=13 2. Calculate the fraction using M. Drop the integer part from the front of the number, yielding 1000000111001. Add ten 0-bits to make it 23 bits long, -> fraction = 10000001110010000000000 3. Calculate the exponent: With E=13 and bias=127, exponent = E + bias = 13 + 127 = 140, which is 10001100 4. Finally, the sign s=0 because the number is positive.
The result of the above example is:
!["Floating-point example format"](/media/images/tkj-float-esim_a3NoADN.png)
The floating-point representation described above is common. An example is the cute (Japanese: kawaii) 8-bit representation shown below. Here, bias = 7.
!["Kawaii"](/media/images/tkj-float-cute_sN68My6.png)
Additionally, the standard defines two types of floating-point precision:
- Single precision, which uses a 32-bit representation, with one sign bit, 8 exponent bits, and 23 bits for the fraction.
- Double precision, which uses a 64-bit representation, with one sign bit, 11 exponent bits, and 52 bits for the fraction.
!["Floating-point representation format"](/media/images/tkj-float-bit_DWVsVCc.png)
It’s clear that double-precision floating-point numbers have a much larger and more accurate range, though the trade-off is increased memory usage.
Special Cases¶
Since the IEEE 754 representation cannot represent all numbers, certain special cases, such as zero, must be defined.
- Zero: exp = 000..0 and frac = 000..0
- Infinity: exp = 111..1 and frac = 000..0
- NaN (Not a Number): exp = 111..1 and frac is not equal to 000..0
- NaN represents cases that cannot be calculated, such as sqrt(-1), infinity - infinity, etc.
Precision of Floating-Point Numbers¶
As previously mentioned, floating-point numbers are always an approximation of decimal numbers and also contain a rounding error. Let’s examine this through the cute 8-bit floating-point representation, as the small number range makes it easier to visualize.
The image below shows the range of values when the exponent has 4 bits and the fraction has 3 bits. We can see that the range is dense near zero but becomes sparser as the numbers increase. The lower image zooms in on the range near zero.
!["4-3"](/media/images/tkj-float-big_dZ2Ugg9.png)
-> Thus, the rounding error resulting from this representation is large.
Let’s change the floating-point representation so that the exponent now has 3 bits and the fraction has 4 bits (bias is now 3) and see what happens to the range. We observe that the range is now narrower and denser near zero.
!["3-4"](/media/images/tkj-float-small_8oCtxnO.png)
-> As a result, the rounding error decreases within the number range.
In conclusion, the range of floating-point numbers can be increased by adding bits to the exponent, and the precision (resolution) can be improved by adding bits to the fraction. However, rounding errors must always be taken into account.
Note! This may seem trivial, as the standardized number ranges are quite large and accurate, but it wouldn’t be the first time that scientific calculations encountered insufficient precision or overlooked rounding errors.
Conclusion¶
Today, we also have floating-point representations larger than 32 and 64 bits.
The floating-point representation is covered in detail in textbooks. For this course, it is enough that we understand the IEEE 754 standard format and can convert floating-point numbers between representations.
Additional material¶
- You can have a look at this interesting video, where they explain how Quake programmers used IEEE754 float representation properties to simplify a complex math operation. This video was recommended by Eetu Vierimaa.