Lecture 3


LEARNING GOALS:

1. Learn how real data is stored in binary.

2. Learn about floating point arithmetic.

3. Learn about logical operations on binary numbers.


TABLE OF CONTENTS

3.1 REAL DATA AND COMPUTERS

3.1.1 Why do we use the floating point format?

3.1.2 Representation of floating point numbers

3.1.3 The IEEE floating point format together with a Field Trip

3.2 FLOATING POINT ARITHMETIC

3.3 LOGICAL OPERATIONS ON BINARY NUMBERS

3.4 SUMMARY

If you want to do some exercises, here is a little quiz.


3.1 REAL DATA AND COMPUTERS
 

3.1.1 Why do we use the floating point format?

In general, computers store real numbers in Scientific Notation, or Floating Point Format. That means that instead of storing the binary number 1010.1101 as it is written on the screen right now, the computer may represent it internally as 0.10101101*24. The reasons for this choice of format will become clear when we examine the differences between the fixed-point and floating point representations.
 

In the fixed-point representation (the one that we humans use in every day life), a fixed predetermined number of bits is allocated for the integer part and for the fractional part of the real number, and the radix point is assumed to lie between the two. Since we know where the radix point is assumed to lie, we don't need to store it explicitly in the computer, and we can treat (internally) fixed-point numbers just like integers, knowing that there is a radix point sitting at a given place in the middle of the number. Consider the following example of the use of fixed-point representation in decimal arithmetic:

Example:
 
 
Integer Arithmetic  Fixed-Point Arithmetic Case I Fixed-Point Arithmetic Case II
    587965
+
    197548  
  ________

    785513

    587.965
+  
    197.548  
 _________
    
    785.513
    00000.25858
+  
    78454.10000   
   ____________

    78454.35858

 
As you can see, the calculations in fixed-point and integer arithmetic are entirely identical, all that we need to do is to keep track of the location of the decimal point. The computer can therefore represent internally real numbers as integers, carry out integer arithmetic on them, and simply scale the end results before outputting them. The advantage of fixed-point representation is that it requires no complex software or hardware to be implemented. Then why isn't this a convenient way of representing real numbers?
 

Remember that with fixed-point representation we reserve a fixed number of bits on the left and on the right of the binary point. In many cases, we would need to reserve several words to represent a large range of values. Consider an astrophysicist who works with numbers ranging from the mass of the Sun (1990000000000000000000000000000000 grams) to the mass of the electron (0.000000000000000000000000000910956 grams), or an economist who works with values ranging from $0.01 to the national debt of Canada or the US (the Canadian debt is approx. $575 billion (1998 figures), the American debt is a couple of times larger). Those people would have to take an extravagantly large number of bits to represent an very large interval of numbers, to be sure that any number from the interval of interest can fit in the allocated space. For example, the astrophysicist would need approx. 14 bytes for the integer part of the number and 12 bytes for the fractional part - that makes a 26-byte (208-bit) word! The use of Scientific Notation, or Floating Point Format, would greatly reduce the storage space needed for huge numbers like the ones above, because those numbers have lots of zeroes, but few significant figures. In Scientific Notation, the mass of the Sun becomes 1.99 * 1033 grams. Needless to say the numbers 1.99, 10 and 33 require much less storage space than the 34-digit fixed-point number. Note that for a particular machine working with a small range of numbers, the fixed-point system is a viable alternative for representing real numbers, because of its simple implementation. This is not the case, however, for general-purpose computers.
 

Now that we've seen why the floating point format is much more popular than the fixed-point format, our new challenge is to find a systematic method of representing floating point numbers as strings of ones and zeroes.
 
 

Back to the top of page

 

3.1.2 Representation of floating point numbers
 

Floating point numbers are represented in the form m * re, where m is the mantissa, r is the radix or base, and e is the exponent.
 

Example:

1.0110112 * 23 is equivalent to 1011.0112
The mantissa is 1.011011, the radix is 2, and the exponent is 3.
A binary floating point number can be represented internally in the computer as a binary sequence with 2 fields, as illustrated below:
 
Exponent (e)
Mantissa (m)
 

The radix r is understood to be 2 and the computer doesn't need to store it explicitly.
 
There are many different floating point number formats, and each one of them can have different levels of precision. Depending on the format that is used, the total size of the binary sequence of a real number and the relative size of the different fields in that sequence can be variable. For example, as we'll see later with the IEEE floating point format, the exponent can occupy 8 bits out of a total of 32 bits (25% of the total size), or 11 out of 64 bits (17%), or 15 out of 128 bits (12%).  A floating point number doesn't need to be stored in a single memory location. The binary sequence of the number can be spread over several words to achieve a greater precision.
 

The number of bits allocated to the exponent part and to the mantissa part of the number depends entirely on the needs of the user. The number of bits allocated to the exponent will determine the range of numbers that can be represented. For example, the astrophysicist from above has to deal with numbers from 2*1033 to 9*10-28, a range of 1061, which means that (s)he needs to allocate enough bits to represent 61 different exponent values from -2810 to +3310 .
 

The number of bits allocated to the mantissa part will determine the maximum number of significant figures that can be stored, which in turn determines the precision that a number can have. The number  written as 3.14159 is more precise than the number  written as 3.14 . Please don't confuse precision with accuracy. Accuracy is a measure for correctness, while precision is a measure of exactitude. The number  written as 3.04159 is more precise, but less accurate than the number  written as 3.14 (note the typo in 3.04159).
 

 
By convention, floating point numbers are always normalized (unless otherwise stated, in this section we are always dealing with binary numbers). The normalization of the  mantissa, however, has nothing to do with the normalization of the exponent. The normalization of the exponent has even a separate name of its own: biasing.
 

Normalizing a mantissa: There are two accepted ways of normalizing a mantissa. One of them is to always constrain the mantissa in the range from 0.100...00 to 0.111.....11 . If the result of a binary operation is of the form 1.xxx.... * 2e, it would be normalized to 0.1xxx...*2e + 1, and if the result is of the form 0.01.....  * 2e it would be normalized to 0.1..... * 2e - 1. Note that a special exception has to be made for zero, because it can't be normalized.

Another accepted way is to constrain mantissas in the range 1.000...00 to 1.111...11 , so that the floating point number is always expressed in the form 1.xxx..... * 2e .

The main advantage of normalizing a mantissa is the gain of precision. The 8 bit unnormalized mantissa 0.00000011 has only 2 significant figures, while the 8-bit normalized mantissa 0.11010010 has 8 significant figures.
 

Biasing an exponent: In real life, we deal with both positive and negative mantissas, as well as with negative and positive exponents. A negative mantissa is in general stored in two's complement form, but exponents are always stored in biased from. Remember that an n-bit exponent allows you to represent 2n possible unsigned exponent values from 0 to 2n - 1. If we subtract a constant value ( bias ) of 2n - 1 from each of those unsigned values, we get a series of numbers from -2n - 1 to 2n-1-1 that is being represented by the unsigned numbers from 0 to 2n - 1. Adding a constant to the most negative term to make it equal to zero is simply a another way of representing negative numbers.

Example:

If (in decimal) we have the series -5,-4,-3,-2,-1,0,1,2,3,4  , we can add 5 to each of those numbers to obtain the new series 0,1,2,3,4,5,6,7,8,9. Thus, we can use the unsigned numbers from 0 to 9 to represent the numbers from -5 to +4, knowing that we have to subtract 5 from the unsigned series to get to the signed series.
A simple formula relates the true value of the exponent to the exponent stored in the computer:
Es = Et + B
Es is the stored, or biased, exponent, Et is the true exponent and B is the bias (a constant), chosen so that B added to the most negative true exponent always give 0. As far as we are concerned in this course, B = 2n - 1 , where n is the number of bits allocated to the exponent.

Example:
 

With n = 3, 2n = 8, and the bias is 2n - 1 = 4. In this case,
the true exponent ranges from -4 to 3, and it will be stored
as a biased exponent ranging from 0 to 7.

0.000101112 is normalized to 0.10111 * 2-3
The true exponent is -3, it will be stored in its biased form -3 + 4 = 1

The following table illustrates the relationship between the true and the biased exponent for the above example.
 
 
True exponent Biased exponent Stored in binary as
-4 0 000
-3 1 001
-2 2 010
-1 3 011
0 4 100
1 5 101
2 6 110
3 7 111
 

The advantage of representing exponents in biased form is that the most negative exponent value is always stored as 0, so that we can store signed values as unsigned numbers.
 

Back to the top of page
 

3.1.3 The IEEE floating point format
 

There are several different formats for storing real data in computers. We've already seen a rather simplistic one: a binary sequence divided into two fields - exponent and mantissa. We will take a look now at the most popular format that is used in the real world: the IEEE floating point format.

The IEEE (IEEE for Institute of Electronics and Electrical Engineers) format is itself divided in three basic formats, called single, double, and quad, each one of them having a specific level of precision.

An IEEE floating point number is defined as following:

 N = (-1)S * 1.F * 2E
where S is the Sign bit : 0 for a positive mantissa, 1 for a negative mantissa; F is the fractional part of the mantissa; and E is the biased exponent. Note that E - bias gives you the true exponent.

The single precision format is laid out as follows:

The number occupies a total of 32 bits. The first bit is the sign bit (0 <=> positive, 1 <=> negative), the next 8 bits are reserved for the biased exponent, and the remaining 23 bits are left for the fractional part of the mantissa.

Why do we store only the fractional part of the mantissa, and not the entire mantissa? The IEEE floating point numbers are normalized so that their mantissas always have an integer part equal to 1, i.e. the mantissas always lie in the range from 1.000....00 to 1.111....11 (unless otherwise specified, we always talk about binary numbers). Since by the definition of the IEEE format the integer part of the mantissa will always be 1, there is no need to store it explicitly, which saves one bit. This way, the fractional part of the mantissa can have one extra significant bit. Double precision and quad precision deal with the mantissa in the same way.
 
 

total number of bits: 32
S 8 bits for the exponent 23 bits for the fractional part of the mantissa
 
The layout of the fields is the same in all three formats, i.e. first comes the sign bit, then the exponent bits, and then the mantissa bits. The only difference between single precision, double precision and quad precision is that each of them allocates a different number of bits for the exponent and for the mantissa. Remember: the more bits allocated to the mantissa, the more precise it is, and the more bits allocated to the exponent, the greater the range of numbers that can be represented. Here are the features of those 3 formats:
 
 
Single Double Quad
Number of bits taken by:
Sign 1 1 1
Exponent 8 11 15
Fractional mantissa 23 52 111
Total  32 64 128
Exponent:
Bias 127 1023 16383
Range of (biased) exponent: 0..255 0..2047 0..32767

In all three formats, the minimum exponent (i.e. 0) together with a mantissa = 0 (both the integer and the fractional part equal to 0) is used to represent signed zero, while the maximum exponent is used to represent plus or minus infinity, or NaN. NaN means Not A Number. Also note that since the IEEE format uses a sign bit, there is no need for two's complement arithmetic here.

It is time now for an example.

Example:
 

Represent -69.12510  in

a) single-precision IEEE format
b) double-precision IEEE format
Solution:
a)
69.12510 = 1000101 . 0012
Normalize your result to get an integer part equal to one:  1000101.001 = 1.000101001 * 26
Throw away the leading 1 to end up with a fractional mantissa of  000101001
The true exponent is 6, but you need it in its biased form. In single precision, the bias is 127, so
the biased exponent = 6 + 127 = 13310 = 100001012 . The number is negative, the sign bit is 1.

Now you have all the information you need, simply write the sign bit, exponent and mantissa in the IEEE format:

1    10000101    00010100100000000000000

Pack the three groups of bits into one word to get the final answer: 11000010100010100100000000000000
It is good practice to represent the final answer in hexadecimal : C28A4000

Note that our fractional mantissa had only 9 bits, so we had to add zeros at the right end to get a 23-bit mantissa as required by the IEEE single precision format.

b)

We can use the figures from above, we only need to update the exponent, because in double precision the bias is 1023. The true exponent is still 6, but the biased exponent becomes 6 + 1023 = 102910 = 100000001012
The fractional mantissa is still 000101001, you just need to add enough zeros at its right end to get a 52-bit mantissa. The sign bit is still one. Thus the answer is:

1 10000000101 000101001000....02 = C05148000000000016


FIELD TRIP:

Click here to visit a wonderful page which converts for you IEEE numbers to decimal form and vice versa, and shows in detail  how an IEEE number is decomposed. Highly recommended.
 
 

Back to the top of page


3.2 FLOATING POINT ARITHMETIC
 
Floating point arithmetic is a little bit more complicated than integer and fixed point arithmetic.

Consider the addition of two real decimal numbers as fixed point numbers:

   1234.00
+
       56.78
   _______

   1290.78

Now if we try to add the same numbers written in floating point notation, we see that simply adding the mantissas will not make sense unless the exponents are equal:

   0.1234 * 104
+
   0.5678 * 102 
   ___________

        ???????

Thus, the following steps must be carried out before adding / subtracting two floating point numbers:

1.    Make the exponents of the two numbers equal by making the smaller exponent equal to the larger and dividing the mantissa of the smaller number by the same factor by which its exponent was increased, in order to preserve the actual value of the number.

2.    Add / subtract the mantissas.

3.    If necessary, re-normalize the result (this is called post-normalization).
 

a*2A + b*2B = a*2A-B *2B + b*2B = (a*2A-B + b)*2B
 

We apply those steps to the example above:

1.    0.1234 * 104   +  0.5678 * 102   =   0.1234 * 104   + 0.005678 * 104
2.    0.1234 * 104   + 0.005678 * 104  =   0.129078 * 104
3.    In this example, the result is already normalized.

Note that the result of the operation has a greater precision than the two operands: the result has a mantissa with seven significant figures, while the operands have a mantissa with 5 significant figures. The result has overflowed, and since computers work with a mantissa of fixed length (five digits in this case), something has to be done to bring the result down to five significant figures. The action to be taken can be either truncation (simply dropping the unwanted digits) or rounding. Rounding is more accurate than truncation, and it leads to an unbiased error, i.e. the result is sometimes over evaluated, sometimes under evaluated, while truncation always under evaluates the result. However, truncation is much more simple to implement than rounding.

When you multiply two floating point numbers, follow the following steps:
1.    Add the exponents.
2.    Multiply the mantissas (as unsigned numbers).
3.    Like signs give a positive result, different signs give negative result.
 

(a*2A) * (b*2B) = (a*b)*2A*2B = (a*b)*2A+B
 

When performing floating point arithmetic, you should always bear in mind the following facts:

1.    Because in general the exponent and the mantissa share the same word, it is often necessary to unpack them, i.e. separate them, before carrying out operations on them. For example, with the IEEE format, the exponent and the mantissa are packed together in a 32-bit word to minimize storage space when they are stored in memory. When the number is unpacked in order to participate in an operation, the format is said to be extended . The leading 1 is inserted at the front of the 23-bit fractional mantissa, which results in a 24-bit mantissa, and then the mantissa is extended to 32 bits. In a computer like the 68000, where words have 16 bits, the mantissa is extended to occupy two full words, which greatly increases its precision. All calculations are performed with the extended 32-bit mantissa. After the "number has done its job", its is packed back to its basic format and stored in memory.

2.    If the two exponents differ by more than m + 1, where m is the number of significant bits of the mantissa, there is no point in adding or subtracting those two numbers, because the smaller number is too small to affect in a significant way the larger number. There is no point to add 0.123987 * 103 to 0.112233 * 1041 because this addition will have no effect on a 6 digit mantissa. The end result will be effectively equal to the larger operand.

3.    If the exponent of the result of an operation is greater than the maximum possible value, or smaller than the minimum possible value, we have a case of exponent overflow or exponent underflow, respectively. Those two cases represent out-of-range conditions, because in each case the number is outside the range of numbers the computer can handle. In case of exponent underflow, the number is in general made equal to zero, while exponent overflow in general results in an error requiring special measures.
 

Back to the top of page


3.3 LOGICAL OPERATIONS ON BINARY NUMBERS

The logical operations we are concerned with in this course are AND, OR, EOR, and NOT. They are called "logical" because they consider 1 as true and 0 as false (this is the accepted convention, but there is no reason why 0 can't be called true and 1 false). Those operations are essentially bitwise, i.e. they are executed bit by bit. The NOT operation is the only one who takes one operand instead of two. The other operations are carried out between pairs of corresponding bits in the two operands. We will later see that the Motorola 68000 processor has specific instructions that execute those operations.
 

All logical operations take operands which can be either true or false, and yield a result which can be either true or false. The result of the expression  a AND b is true if and only if both a and b are true. Otherwise, it is false. Here is the truth table for the AND operation. A truth table gives the result of the logical operation for all possible combinations of input values. Remember that 1 means true and 0 means false.
 
 
a b a AND b
0 0 0
0 1 0
1 0 0
1 1 1
 
As you can see, a AND b is 1 if and only if both a and b are 1.

When a logical operation is applied to two words of n bits each, the operation is carried out between corresponding pairs of bits. E.g., if we have    abcd  AND  wxyz ,   the result is mnop where  p = d AND z,   o = c AND y,   n = b AND x,   m =     a AND w.

Example:

A =                       01011001
B =                       00001111
A AND B = C =   00001001
You can see that the bits in the result are set to 1 only when the two corresponding bits in the operands are set to 1. That means that the AND operator acts as a selective mask. If you want to mask, or clear (set to 0), the 4 most significant bits of A and leave the other bits of A unchanged, just AND A with a number B = 00001111 that has its 4 most significant bits set to 0, the other bits set to 1.
  The result of the operation a OR b is false if and only if both a and b are false. It is true when both operators are true, or when only one of them is true. Here is the truth table for the OR operation:
 
a b a OR b
0 0 0
0 1 1
1 0 1
1 1 1
 
As you can see, it is enough for one operand to be true for the whole expression to be true. ORing two words with n bits each is done in a way similar to the ANDing of two words, the only difference being a different truth table.

Example:

A =                    01011001
B =                    00001111
A OR B = C =   01011111
You can see that the bits in the result are set to zero only when the two corresponding bits in the operands are set to zero. That means that the OR operator can be used to selectively  set bits (to 1) . If you want to set to 1 all of the 4 least significant bits of A and leave its other bits unchanged, just OR A with a number B = 00001111 that has its 4 least significant bits set to 1 and its other bits set to 0. In certain cases the OR operator can be considered as the inverse of the AND operator.
  The EOR, or Exclusive OR, operation, is similar to the OR operation, except that the result is true if and only if just one of the operands is true. It is false when both operands are true, or when both operands are false. Here is the truth table for the EOR operation:
 
a b a EOR b
0 0 0
0 1 1
1 0 1
1 1 0
 
In other words, the EOR operation selects for different input.

Example:

A =                      01011001
B =                      00001111
A EOR B = C =   01010110
This example shows that the EOR operation can be used to selectively toggle (change the state of) bits. If you want to invert the 4 least significant bits of A and leave its other bits unchanged, just EOR A with B = 00001111. Note that if we neglect any carry bits, the EOR operation is identical to an addition.
  The NOT operation is the simplest of the four. It takes only one operand, and simply inverts, or complements, its bits. Here is the truth table for the NOT operation:
 
 
a NOT a
0 1
1 0
 
Example:
A = 1010011            NOT A = 0101100 = A's complement
Note that the two's complement of N can be defined as  NOT(N) + 1 .

Back to the top of page


3.4 SUMMARY
 


If you want to do some exercises, here is a little quiz.



Copyright © McGill University, 1998. All rights reserved.
Reproduction of all or part of this work is permitted for educational or research purposes provided that this copyright notice is included in any copy.