Decision Rules

Notation:

As was discussed in the introduction, in most circumstances there is some information about the objects we are trying to classify.

For example, we may have the probability distribution for the colour of apples, as well as that for oranges. To introduce some notation, let wapp represent the state of nature where the fruit is an apple, let worg represent that state where the fruit is an orange, and let x be the continuous random variable that represents the colour of a fruit. Then the expression p(x|wapp) represents the density function for x given that the state of nature is an apple.

In a typical problem, we would know (or be able to calculate) the conditional densities p(x|wj) for j either an apple or an orange. We would also typically know the prior probabilities P(wapp) and P(worg), which represent simply the total number of apples versus oranges that are on the conveyer belt. What we are looking for is some formula that will tell us about the probability of a fruit being an apple or an orange given that we observe a certain colour x. If we had such a probability, then for some given color that we observed we would classify the fruit by comparing the probability that an orange had such a color versus the probability that an apple had such a color. If it were more probable that an apple had such a color, the fruit would be classified as an apple. Fortunately, we can use Baye's Formula which states that :

P(wj|x) = p(x|wj) P(wj)/p(x)
What this is means, is that using our a priori information, we can calculate the a posteriori probability of the state of nature being in state wj given that the feature value x has been measured. So, if you observe a certain x for a random fruit on the conveyer belt, then by calculating P(wapp|x) and P(worg|x) we would be inclined to decide that the fruit was an apple if the first value were greater than the second one. Similary, if P(worg|x) is greater, we would decide that the fruit was most likely an orange. Therefore, Baye's Decision Rule can be stated as:

Decide worg if P(worg|x) > P(wapp|x); otherwise decide wapp,
Since p(x) occurs on both sides of the comparison, the rule is equivalent to saying:

Decide worg if p(x|worg)P(worg) > p(x|wapp)P(wapp);
otherwise decide wapp.

The following graph shows the
a posteriori probabilities for the 2 class
decision problem. At every x, the
posteriors must sum to 1. The red region
on the x axes depicts values for x for
which the decision rule would decide
'apple'. The orange region represents
values for x for which you would decide
'orange'.
The probability that we make an error is just the minimum of the 2 curves at any point, since that represents the smaller prbability that we didn't pick. So
P(error|x) = min[p(wapp|x), p(worg|x)].

This formula represents the probability of making an error for a specific measurment x. But it is often useful to know the average probability of error over all possible measurements. This can be calculated using Bayes' Law of total Probabilities, which implies that

Allowing more than 1 feature and more than 1 class:

In a more general case, there are several different features that we measure, so instead of x we have a feature vector x in Rd for d different features. We also allow for more than 2 possible states of nature, where w1 ... wc represent the c states of nature. Bayes' formula can be computed in the same way as:

P(wj|x) = p(x|wj)P(wj)/ p(x), for j=1..c

but now p(x) can be calculated using the Law of Total Probabilities so that

As before, if we measure feature vector x, we want to classify an object into class j if p(wj|x) is the maximum of all the probability densities for j=1..c. This is the same as the Bayes' decision rule for the 2 category case.
The role of Neural Networks

There is an important relationship between neural networks and pattern classification.

A simple Neural Network contains a set of c discriminant functions gi(x), for i=1,..,c. The network will assign feature vector x to class wj if

gi(x) > gj(x) for all j!=i.

This classifier is like a machine that computes c different discriminant functions and selects the class corresponding to the largest discriminant. Clearly, the Bayes decision rule for the multi-category case can been seen as such a network if we assign
gi(x) = P(wi|x).

In this way, the neural network will classify the object according to the maximum a posteriori probability.

The choice of discriminant functions is not unique. You can always multiply the functions by a positive constant and still have the same decision rule. You can also calculate f(gi(x)) for some monotonically increasing function f, which will also give you a new set of discriminant functions for the same decision rule. Because applying these changes may lead to computational simplifications, there are 4 commonly used discriminant functions used for Bayes' Decision Rule:


The decision rules are equivalent for each of the above sets of discriminant functions.

Decision Regions

When any decision rule is applied to the d-dimentional feature space Rd, the result is that the space is split up into c decision regions R1, ..., Rc. In the above graph for the 2 category case, the decision regions were marked in red and orange at the bottom of the graph. In general, if x lies in decision region Ri then it means that the pattern classifer selected the function gi(x) to be the maximum of all the discriminant functions. The decision regions are any subset of the space Rd. For example, if the feature vector is a 2-dimentional vector, then the discriminant functions gi(x) will be functions of 2 variables and will be mapped in 3-D. The decision regions for this case will be subsets of the x-y plane. Here are 2 simple examples:

An example of 1 dimentional decision regions: R1 and R2.



An example of 2 dimentional decision regions: R1 and R2.


In general,decision regions may be any subsets of Rd and it is common to have a region Ri that is disconnected.

Obviosly, the shape of the decision bondary depends on the functions P(wi|x). The next section takes a closer look at discriminant functions and their corresponding decision regions for the Normal Density in particular.