LEARNING GOALS:
2. Learn about floating point arithmetic.
3. Learn about logical operations on binary numbers.
TABLE OF CONTENTS
3.1.2 Representation of floating point numbers
3.1.3 The IEEE floating point format together with a Field Trip
3.3 LOGICAL OPERATIONS ON BINARY NUMBERS
3.4 SUMMARY
If you want to do some exercises, here is a little quiz.
3.1.1 Why do we use the floating point
format?
In general, computers store real numbers in Scientific Notation,
or Floating Point Format. That means that instead of storing the
binary number 1010.1101 as it is written on the screen right now, the computer
may represent it internally as 0.10101101*24. The reasons for
this choice of format will become clear when we examine the differences
between the fixed-point and floating point representations.
Example:
| Integer Arithmetic | Fixed-Point Arithmetic Case I | Fixed-Point Arithmetic Case II |
| 587965
+ 197548 ________ 785513 |
587.965
+ 197.548 _________ 785.513 |
00000.25858
+ 78454.10000 ____________ 78454.35858 |
Remember that with fixed-point representation we reserve a fixed number
of bits on the left and on the right of the binary point. In many cases,
we would need to reserve several words to represent a large range of values.
Consider an astrophysicist who works with numbers ranging from the mass
of the Sun (1990000000000000000000000000000000 grams) to the mass of the
electron (0.000000000000000000000000000910956 grams), or an economist who
works with values ranging from $0.01 to the national debt of Canada or
the US (the Canadian debt is approx. $575 billion (1998 figures), the American
debt is a couple of times larger). Those people would have to take an extravagantly
large number of bits to represent an very large interval of numbers, to
be sure that any number from the interval of interest can fit in the allocated
space. For example, the astrophysicist would need approx. 14 bytes for
the integer part of the number and 12 bytes for the fractional part - that
makes a 26-byte (208-bit) word! The use of Scientific Notation, or Floating
Point Format, would greatly reduce the storage space needed for huge numbers
like the ones above, because those numbers have lots of zeroes, but few
significant figures. In Scientific Notation, the mass of the Sun becomes
1.99 * 1033 grams. Needless to say the numbers 1.99, 10 and
33 require much less storage space than the 34-digit fixed-point number.
Note that for a particular machine working with a small range of numbers,
the fixed-point system is a viable alternative for representing real numbers,
because of its simple implementation. This is not the case, however, for
general-purpose computers.
Now that we've seen why the floating point format is much more popular
than the fixed-point format, our new challenge is to find a systematic
method of representing floating point numbers as strings of ones and zeroes.
3.1.2 Representation of floating point
numbers
Floating point numbers are represented in the form m * re,
where m is the mantissa, r is the radix or
base, and e is the exponent.
Example:
A binary floating point number can be represented internally in the computer as a binary sequence with 2 fields, as illustrated below:1.0110112 * 23 is equivalent to 1011.0112
The mantissa is 1.011011, the radix is 2, and the exponent is 3.
|
|
|
The radix r is understood to be 2 and the computer doesn't need
to store it explicitly.
There are many different floating point number formats, and each one
of them can have different levels of precision. Depending on the format
that is used, the total size of the binary sequence of a real number and
the relative size of the different fields in that sequence can be variable.
For example, as we'll see later with the IEEE floating point format, the
exponent can occupy 8 bits out of a total of 32 bits (25% of the total
size), or 11 out of 64 bits (17%), or 15 out of 128 bits (12%). A
floating point number doesn't need to be stored in a single memory location.
The binary sequence of the number can be spread over several words to achieve
a greater precision.
The number of bits allocated to the exponent part and to the mantissa
part of the number depends entirely on the needs of the user. The number
of bits allocated to the exponent will determine the range of numbers
that can be represented. For example, the astrophysicist from above has
to deal with numbers from 2*1033 to 9*10-28, a range
of 1061, which means that (s)he needs to allocate enough bits
to represent 61 different exponent values from -2810 to +3310
.
The number of bits allocated to the mantissa part will determine the
maximum number of significant figures that can be stored, which in turn
determines the precision that a number can have. The number
written as 3.14159 is more precise than the number
written
as 3.14 . Please don't confuse precision with accuracy. Accuracy
is a measure for correctness, while precision is a measure of exactitude.
The number
written as 3.04159
is more precise, but less accurate than the number
written
as 3.14 (note the typo in 3.04159).
Normalizing a mantissa: There are two accepted ways of normalizing a mantissa. One of them is to always constrain the mantissa in the range from 0.100...00 to 0.111.....11 . If the result of a binary operation is of the form 1.xxx.... * 2e, it would be normalized to 0.1xxx...*2e + 1, and if the result is of the form 0.01..... * 2e it would be normalized to 0.1..... * 2e - 1. Note that a special exception has to be made for zero, because it can't be normalized.
Another accepted way is to constrain mantissas in the range 1.000...00 to 1.111...11 , so that the floating point number is always expressed in the form 1.xxx..... * 2e .
The main advantage of normalizing a mantissa is the gain of precision.
The 8 bit unnormalized mantissa 0.00000011 has only 2 significant figures,
while the 8-bit normalized mantissa 0.11010010 has 8 significant figures.
Biasing an exponent: In real life, we deal with both positive and negative mantissas, as well as with negative and positive exponents. A negative mantissa is in general stored in two's complement form, but exponents are always stored in biased from. Remember that an n-bit exponent allows you to represent 2n possible unsigned exponent values from 0 to 2n - 1. If we subtract a constant value ( bias ) of 2n - 1 from each of those unsigned values, we get a series of numbers from -2n - 1 to 2n-1-1 that is being represented by the unsigned numbers from 0 to 2n - 1. Adding a constant to the most negative term to make it equal to zero is simply a another way of representing negative numbers.
Example:
A simple formula relates the true value of the exponent to the exponent stored in the computer:If (in decimal) we have the series -5,-4,-3,-2,-1,0,1,2,3,4 , we can add 5 to each of those numbers to obtain the new series 0,1,2,3,4,5,6,7,8,9. Thus, we can use the unsigned numbers from 0 to 9 to represent the numbers from -5 to +4, knowing that we have to subtract 5 from the unsigned series to get to the signed series.
Es is the stored, or biased, exponent, Et is the true exponent and B is the bias (a constant), chosen so that B added to the most negative true exponent always give 0. As far as we are concerned in this course, B = 2n - 1 , where n is the number of bits allocated to the exponent.Es = Et + B
Example:
The following table illustrates the relationship between the true and the biased exponent for the above example.With n = 3, 2n = 8, and the bias is 2n - 1 = 4. In this case,
the true exponent ranges from -4 to 3, and it will be stored
as a biased exponent ranging from 0 to 7.0.000101112 is normalized to 0.10111 * 2-3
The true exponent is -3, it will be stored in its biased form -3 + 4 = 1
| True exponent | Biased exponent | Stored in binary as |
| -4 | 0 | 000 |
| -3 | 1 | 001 |
| -2 | 2 | 010 |
| -1 | 3 | 011 |
| 0 | 4 | 100 |
| 1 | 5 | 101 |
| 2 | 6 | 110 |
| 3 | 7 | 111 |
The advantage of representing exponents in biased form is that the most
negative exponent value is always stored as 0, so that we can store signed
values as unsigned numbers.
3.1.3 The IEEE floating point format
There are several different formats for storing real data in computers. We've already seen a rather simplistic one: a binary sequence divided into two fields - exponent and mantissa. We will take a look now at the most popular format that is used in the real world: the IEEE floating point format.
The IEEE (IEEE for Institute of Electronics and Electrical Engineers) format is itself divided in three basic formats, called single, double, and quad, each one of them having a specific level of precision.
An IEEE floating point number is defined as following:
N = (-1)S * 1.F * 2Ewhere S is the Sign bit : 0 for a positive mantissa, 1 for a negative mantissa; F is the fractional part of the mantissa; and E is the biased exponent. Note that E - bias gives you the true exponent.
The single precision format is laid out as follows:
The number occupies a total of 32 bits. The first bit is the sign bit (0 <=> positive, 1 <=> negative), the next 8 bits are reserved for the biased exponent, and the remaining 23 bits are left for the fractional part of the mantissa.
Why do we store only the fractional part of the mantissa, and not the
entire mantissa? The IEEE floating point numbers are normalized so that
their mantissas always have an integer part equal to 1, i.e. the mantissas
always lie in the range from 1.000....00 to 1.111....11 (unless otherwise
specified, we always talk about binary numbers). Since by the definition
of the IEEE format the integer part of the mantissa will always be 1, there
is no need to store it explicitly, which saves one bit. This way, the fractional
part of the mantissa can have one extra significant bit. Double precision
and quad precision deal with the mantissa in the same way.
| S | 8 bits for the exponent | 23 bits for the fractional part of the mantissa |
| Single | Double | Quad | ||
| Number of bits taken by: | ||||
| Sign | 1 | 1 | 1 | |
| Exponent | 8 | 11 | 15 | |
| Fractional mantissa | 23 | 52 | 111 | |
| Total | 32 | 64 | 128 | |
| Exponent: | ||||
| Bias | 127 | 1023 | 16383 | |
| Range of (biased) exponent: | 0..255 | 0..2047 | 0..32767 |
In all three formats, the minimum exponent (i.e. 0) together with a mantissa = 0 (both the integer and the fractional part equal to 0) is used to represent signed zero, while the maximum exponent is used to represent plus or minus infinity, or NaN. NaN means Not A Number. Also note that since the IEEE format uses a sign bit, there is no need for two's complement arithmetic here.
It is time now for an example.
Example:
Represent -69.12510 in a) single-precision IEEE formatSolution: a) 69.12510 = 1000101 . 0012 |
Click here to visit
a wonderful page which converts for you IEEE numbers to decimal form and
vice versa, and shows in detail how an IEEE number is decomposed.
Highly recommended.
3.2 FLOATING POINT ARITHMETIC
Floating point arithmetic is a little bit more complicated than integer
and fixed point arithmetic.
Consider the addition of two real decimal numbers as fixed point numbers:
1234.00
+
56.78
_______
1290.78
Now if we try to add the same numbers written in floating point notation, we see that simply adding the mantissas will not make sense unless the exponents are equal:
0.1234 * 104
+
0.5678 * 102
___________
???????
Thus, the following steps must be carried out before adding / subtracting two floating point numbers:
1. Make the exponents of the two numbers equal by making the smaller exponent equal to the larger and dividing the mantissa of the smaller number by the same factor by which its exponent was increased, in order to preserve the actual value of the number.
2. Add / subtract the mantissas.
3. If necessary, re-normalize the result (this is
called post-normalization).
We apply those steps to the example above:
1. 0.1234 * 104 + 0.5678
* 102 = 0.1234 * 104
+ 0.005678 * 104
2. 0.1234 * 104 + 0.005678
* 104 = 0.129078 * 104
3. In this example, the result is already normalized.
Note that the result of the operation has a greater precision than the two operands: the result has a mantissa with seven significant figures, while the operands have a mantissa with 5 significant figures. The result has overflowed, and since computers work with a mantissa of fixed length (five digits in this case), something has to be done to bring the result down to five significant figures. The action to be taken can be either truncation (simply dropping the unwanted digits) or rounding. Rounding is more accurate than truncation, and it leads to an unbiased error, i.e. the result is sometimes over evaluated, sometimes under evaluated, while truncation always under evaluates the result. However, truncation is much more simple to implement than rounding.
When you multiply two floating point numbers, follow the following
steps:
1. Add the exponents.
2. Multiply the mantissas (as unsigned numbers).
3. Like signs give a positive result, different signs
give negative result.
When performing floating point arithmetic, you should always bear in mind the following facts:
1. Because in general the exponent and the mantissa share the same word, it is often necessary to unpack them, i.e. separate them, before carrying out operations on them. For example, with the IEEE format, the exponent and the mantissa are packed together in a 32-bit word to minimize storage space when they are stored in memory. When the number is unpacked in order to participate in an operation, the format is said to be extended . The leading 1 is inserted at the front of the 23-bit fractional mantissa, which results in a 24-bit mantissa, and then the mantissa is extended to 32 bits. In a computer like the 68000, where words have 16 bits, the mantissa is extended to occupy two full words, which greatly increases its precision. All calculations are performed with the extended 32-bit mantissa. After the "number has done its job", its is packed back to its basic format and stored in memory.
2. If the two exponents differ by more than m + 1, where m is the number of significant bits of the mantissa, there is no point in adding or subtracting those two numbers, because the smaller number is too small to affect in a significant way the larger number. There is no point to add 0.123987 * 103 to 0.112233 * 1041 because this addition will have no effect on a 6 digit mantissa. The end result will be effectively equal to the larger operand.
3. If the exponent of the result of an operation is
greater than the maximum possible value, or smaller than the minimum possible
value, we have a case of exponent overflow or exponent underflow,
respectively. Those two cases represent out-of-range conditions, because
in each case the number is outside the range of numbers the computer can
handle. In case of exponent underflow, the number is in general made equal
to zero, while exponent overflow in general results in an error requiring
special measures.
3.3 LOGICAL OPERATIONS ON BINARY NUMBERS
The logical operations we are concerned with in this course are AND,
OR, EOR, and NOT. They are called "logical" because
they consider 1 as true and 0 as false (this is the accepted
convention, but there is no reason why 0 can't be called true and 1 false).
Those operations are essentially bitwise, i.e. they are executed bit
by bit. The NOT operation is the only one who takes one operand
instead of two. The other operations are carried out between pairs of corresponding
bits in the two operands. We will later see that the Motorola 68000 processor
has specific instructions that execute those operations.
| a | b | a AND b |
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
When a logical operation is applied to two words of n bits each, the operation is carried out between corresponding pairs of bits. E.g., if we have abcd AND wxyz , the result is mnop where p = d AND z, o = c AND y, n = b AND x, m = a AND w.
Example:
A = 01011001You can see that the bits in the result are set to 1 only when the two corresponding bits in the operands are set to 1. That means that the AND operator acts as a selective mask. If you want to mask, or clear (set to 0), the 4 most significant bits of A and leave the other bits of A unchanged, just AND A with a number B = 00001111 that has its 4 most significant bits set to 0, the other bits set to 1.
B = 00001111
A AND B = C = 00001001
| a | b | a OR b |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |
Example:
A = 01011001You can see that the bits in the result are set to zero only when the two corresponding bits in the operands are set to zero. That means that the OR operator can be used to selectively set bits (to 1) . If you want to set to 1 all of the 4 least significant bits of A and leave its other bits unchanged, just OR A with a number B = 00001111 that has its 4 least significant bits set to 1 and its other bits set to 0. In certain cases the OR operator can be considered as the inverse of the AND operator.
B = 00001111
A OR B = C = 01011111
| a | b | a EOR b |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Example:
A = 01011001This example shows that the EOR operation can be used to selectively toggle (change the state of) bits. If you want to invert the 4 least significant bits of A and leave its other bits unchanged, just EOR A with B = 00001111. Note that if we neglect any carry bits, the EOR operation is identical to an addition.
B = 00001111
A EOR B = C = 01010110
| a | NOT a |
| 0 | 1 |
| 1 | 0 |
A = 1010011 NOT A = 0101100 = A's complementNote that the two's complement of N can be defined as NOT(N) + 1 .
N = (-1)S * 1.F * 2E
where S is the Sign bit : 0 for a positive mantissa, 1 for a negative mantissa; F is the fractional part of the mantissa; and E is the biased exponent. E - bias gives you the true exponent.
| Single | Double | Quad | ||
| Number of bits taken by: | ||||
| Sign | 1 | 1 | 1 | |
| Exponent | 8 | 11 | 15 | |
| Fractional mantissa | 23 | 52 | 111 | |
| Total | 32 | 64 | 128 | |
| Exponent: | ||||
| Bias | 127 | 1023 | 16383 | |
| Range of (biased) exponent: | 0..255 | 0..2047 | 0..32767 |
1. Make the exponents of the two numbers equal by
making the smaller exponent equal to the larger and dividing the
mantissa of the smaller number by the same factor by which its exponent
was increased, in order to preserve the actual value of the number.
2. Add / subtract the mantissas.
3. If necessary, re-normalize the result (this is called post-normalization).
| a | b | a AND b |
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
| a | b | a OR b |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |
| a | b | a EOR b |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| a | NOT a |
| 0 | 1 |
| 1 | 0 |
Copyright
© McGill University, 1998. All rights reserved.