Chapter 8 - Statistical Reasoning

So far, we have described several representation techniques that can be used to model belief systems in which, at any given point, a particular fact is believed to be true, believed to be false, or not considered one way or the other. For some kinds of problem solving, though, it is useful to be able to describe beliefs that are not certain but for which there is some supponing evidence. Let's consider two classes of such problems.

The first class contains problems in which there is genuine randomness in the world. Playing card games such as bridge and blackjack is a good example of this class. Although in these problems it is not possible to predict the world with certainty, some knowledge about the likelihood of various outcomes is available, and we would like to be able to exploit it.

The second class contains problems that could, in principle, be modeled using the techniques we described in the last chapter. In these problems, the relevant world is not random; it behaves "normally" unless there is some kind of exception. The difficulty is that there are many more possible exceptions than we care to enumerate explicitly (using techniyues such as AB and UNLESS). Many common sense tasks fall into this category, as do many expert reasoning tasks such as medical diagnosis. For problems like this, statistical measures may serve a very useful function as summaries of the world; rather than enumerating all the possible exceptions, we can use a numerical summary that tells us how often an exception of some sort can be expected to occur.

In this chapter we explore several techniques that can be used to augment knowledge representation techniques with statistical measures that describe levels of evidence and belief.

8.1 Probability and Bayes' Theorem

An important goal for many problem-solving systems is to collect evidence as the system goes along and to modify its behavior on the basis of the evidence. To model Ihis behavior, we need a statistical theory of evidence. Bayesian statistics is such a theory. The fundamental notion of Bayesian statistics is that of conditional probability:

P(H|E)

Read this expression as the probability of hypothesis H given that we have observed evidence E. To compute this, we need to take into account the prior probability of H (the probability that we would assign to H if we had no evidence) and the extent to which E provides evidence of H. To do this, we need to define a universe that contains an exhaustive, mutually exclusive set of H_i_'s, among which we are trying to discriminate. Then, let

P(H_i_ | E) =the probability that hypothesis H_i_ is true given evidence E

P(E | H_i_) =the probability that we will observe evidence E given that hypothesis i is true

P(H_i_) =the a priori probability that hypothesis i is true in the absence of any specific evidence. These probabilities are called prior probabilities or priors.

k =the number of possible hypotheses

Bayes' theorem then states that

P(H_i_|E)={P(E|H_i_)*P(H_i_)∑_n-1_^k^P(E|H_n_)*P(H_n_)}

Suppose, for example, that we are interested in examining the geological evidenee at a particular location to determine whether that would be a good place to dig to find a desired mineral. If we know the prior probabilities of finding each of the various minerals and we know the probabilities that if a mineral is present then certain physical characteristics will be observed, then we can use Bayes' formula to compute, from the evidence we collect, how likely it is that the various minerals are present. This is, in fact, what is done by the PROSPECTOR program [Duda et al.,1979], which has been used successfully to help locate deposits of several minerals, including copper and uranium.

The key to using Bayes' theorem as a basis for uncertain reasoning is to recognize exactly what it says. Specifically, when we say P(A|B), we are describing the conditional probability of A given that the only evidence we have is B. If there is also other relevant evidence, then it too must be considered. Suppose, for example, that we are solving a medical diagnosis problem. Consider the following assertions:

S: patient has spots

M: patient has measles

F: patient has high fever

Without any additional evidence, the presence of spots serves as evidence in favor of measles. It also serves as evidence of fever since measles would cause fever. But suppose we already know that the patient has measles. Then the additional evidence that he has spots actually tells us nothing about the likelihood of fever. Alternatively either spots alone or fever alone would constitute evidence in favor of measles. If both are present, we need to take both into account in determining the total weight of evidence. But, since spots and fever are not independent events, we cannot just sum their effects. Instead, we need to represent explicitly the conditional probability that arises from their conjunction. In general, given a prior body of evidence e and some new observation E, we need to compute

P(H|E,e)=P(H|E)*{P(e|E,H)P(e|E)}

Unfortunately, in an arbitrarily complex world, the size of the set ofjoint probabilities that we require in order to compute this function grows as 2" if there are n different propositions being considered. This makes using Bayes' theorem intractable for several reasons:

Despite these problems, though, Bayesian statistics provide an attractive basis for an uncertain reasoning system. As a result, several mechanisms for exploiting its power while at the same time making it tractable have been developed. In the rest of this chapter, we explore three of these:

We also mention one very different numerical approach to uncertainty, fuzzy logic.

There has been an active, strident debate for many years on the question of whether pure Bayesian statistics are adequate as a basis for the development of reasoning pro- grams. (See, for example, Cheeseman [1985] for arguments that it is and Buchanan and Shortliffe [1984] for arguments that it is not.) On the one hand, non-Bayesian approaches have been shown to work well for some kinds of applications (as we see below). On the other hand, there are clear limitations to all known techniques. In essence, the jury is still out. So we sidestep the issue as much as possible and simply describe a set of methods and their characteristics.

8.2 Certainty Factors and Rule-Based Systems

In this section we describe one practical way of compromising on a pure Bayesian system. The approach we discuss was pioneered in the MYCIN system [Shortliþe, I976; Buchanan and Shortliffe, I984; Shortlitfe and Buchanan,1975], which attempts to recommend appropriate therapies for patients with bacterial infections. It interacts with the physician to acquire the clinical data it needs. MYCIN is an example of an expert system, since it performs a task normally done by a human expert. Here we concentrate on the use of probabilistic reasoning; Chapter 20 provides a broader view of expert systems.

MYCIN represents most of its diagnostic knowledge as a set of rules. Each rule i has associated with it a certainryfacror, which is a measure of the extent to which the evidence that is described by the antecedent of the rule supports the conclusion that is given in the rule's consequent. A typical MYCIN rule looks like:

If: (1) the stain of the organism is gram-positive, and

(2) the morphology of the organism is coccus, and

(3) the growth conformation of the organism is clumps,

then there is suggestive evidence (0.7) that

the identity of the organism is staphylococcus.

This is the form in which the rules are stated to the user. They are actually represented internally in an easy-to-manipulate LISP list structure. The rule we just saw would be represented internally as

PREMISE: ($AND (SAME CNTXT GRAM GRAMPOS)

(SAME CNTXT MORPH COCCUS)

(SAME CNTXT CONFORM CLUMPS))

ACTION: (CONCLUDE CNTXT IDENT STAPHYLOCOCCUS TALLY 0.7)

MYCIN uses these rules to reason backward to the clinical data available from its goal of finding significant disease-causing organisms. Once it finds the identities of such organisms, it then attempts to select a therapy by which the disease(s) may be treated. In order to understand how MYCIN exploits uncertain information, we need answers to two questions: "What do certainty factors mean?" and "How does MYCIN combine the estimates of certainty in each of its rules to produce a final estimate of the certainty of its conclusions?" A further question that we need to answer, given our observations about the intractability of pure Bayesian reasoning, is, "What compromises does the MYCIN technique make and what risks are associated with those compromises?" In the rest of this section we answer all these questions.

Let's start first with a simple answer to the first question (to which we return with a more detailed answer later). A certainty faetor (CF[h, e]) is defined in terms of two components:

From these two measures, we can define the certainty factor as

CF(h, e] = MB[h, e] - MD[h, e]

Since any particular piece of evidence either supports or denies a hypothesis (but not both), and since each MYCIN rule corresponds to one piece ofevidence (although it may be a compound piece of evidence), a single number suffices for each rule to define both the MB and MD and thus the CF.

Figure 8.1: Combining Uncertain Rules

The CF's of MYCIN's rules are provided by the experts who write the rules. They reflect the experts' assessments of the strength of the evidence in support of the hy- pothesis. As MYCIN reasons, however, these CF's need to be combined to reflect the operation of multiple pieces of evidence and multiple rules applied to a problem. Fig- ure 8.1 illustrates three combination scenarios that we need to consider. In Figure 8.1 (a), several rules all provide evidence that relates to a single hypothesis. In Figure 8.1(b), we need to consider our belief in a collection of several propositions taken together. In Figure 8.1 (c), the output of one rule provides the input to another.

What formulas should be used to perform these combinations? Before we answer that question, we need first to describe some properties that we would like the combining functions to satisfy:

  • Since the order in which evidence is collected is arbitrary, the combining functions should be commutative and associative.
  • Until certainty is reached, additional confirming evidence should increase MB (and similarly for disconfirming evidence and MD).
  • If uncertain inferences are chained together, then the result should be less certain than either of the inferences alone.

Having accepted the desirability of these properties, let's first consider the scenario in Figure 8.1(a), in which several pieces of evidence are combined to determine the CF of one hypothesis. The measures of belief and disbelief of a hypothesis given two observations s_1_, and s_2_ are computed from:

One way to state these formulas in English is that the measure of belief in h is 0 if h is disbelieved with certainty. Otherwise, the measure of belief in h given two observations is the measure of belief given only one observation plus some inerement for the second observation. This inerement is computed by first taking the difference between 1 (certainty) and the belief given only the first observation. This difference is the most that can be added by the second observation. The difference is then scaled by the belief in h given only the second observation. A corresponding explanation can be given, then, for the formula for computing disbelief. From MB and MD, CF can be computed. Notice that if several sources of eorroborating evidence are pooled, the absolute value of CF will inerease. If contiicting evidence is introduced, the absolute value of CF will decrease.

A simple example shows how these functions operate. Suppose we make an initial observation that confirms our belief in h with MB = 0.3. Then MD[h,s_1_] = 0 and CF[h,s_1_] = 0.3. Now we make a second observation, which also confirms h, with MB[h, s_2_] = 0.2. Now:

MB[h,s_1_ ^ s_2_] = 0.3 + 0.2 ù 0.7 = 0.44

MD(h,s_1_ ^ s_2_] = 0.0

CF[h,s_1_ ^ s_2_] = 0.44

You can see from this example how slight confirmatory evidence can accumulate to produce increasingly larger certainty factors.

Next let's consider the scenario of Figure 8. l(b), in which we need to compute the certainty factor of a combination ofhypotheses. In particular, this is necessary when we need to know the certainty factor of a rule antecedent that contains several clauses (as, for example, in the staphylococcus rule given above). The combination certainty factor can be computed from its MB and MD. The formulas MYCIN uses for the MB of the conjunetion and the disjunetion of two hypotheses are:

MB[h_1_ ^ h_2_,e] = min(MB[h_1_,e), MB[h_2_, e))

MB[h_1_ ^ h_2_,e] = max(MB[h_1_,e), MB[h_2_, e))

MD can be computed analogously.

Finally, we need to consider the scenario in Figure 8.1 (c), in which rules are chained together with the result that the uncertain outcome of one rule must provide the input to another. Our solution to this problem will also handle the case in which we must assign a measure of uncertainty to initial inputs. This could easily happen in situations where the evidence is the outcome of an experiment or a laboratory test whose results are not completely accurate. In such a case, the certainty factor ofthe hypothesis must take into account both the strength with which the evidence suggests the hypothesis and the level of contidence in the evidence. MYCIN provides a chaining rule that is defined as follows. Let MB'[h, s] be the measure of belief in h given that we are absolutely sure of the validity of s. Let e be the evidence thut led us to believe in s (for example, the actual readings of the laboratory instruments or the results of applying other rules). Then:

MB[h, s] = MB'[h, s] ù max(0, CF[s, e])

Since initial CF's in MYCIN are estimates that are given by experts who write the rules, it is not really necessary to state a more precise definition of what a CF means than the one we have already given. The original work did, however, provide one by defining MB (which can be thought of as a proportionate decrease in disbelief in h as a result of e) as:

Similarly, the MD is the proportionate decrease in belief in h as a result of e:

It turns out that these definitions are incompatible with a Bayesian view ofconditional probability. Small changes to them, however, make them compatible [Heckerman, 1986). In particular, we can redefine MB as

The definition of MD must also be changed similarly.

With these reinterpretations, there ceases to be any fundamental conflict between MYCIN's techniques and those suggested by Bayesian statistics. We argued at the end of the last section that pure Bayesian statistics usually leads to intractable systems. But MYCIN works [Buchanan and Shortliffe,1984]. Why?

Each CF in a MYCIN rule represents the contribution of an individual rule to MYCLN's belief in a hypothesis. In some sense then, it represents a conditional prob- ability, P(H|E). But recall that in a pure Bayesian system, P(H|E) deseribes the con- ditional probability of H given that the only relevant evidence is E. If there is other evidence, joint probabilities need to be considered. This is where MYCIN diverges from a pure Bayesian system, with the result that it is easier to write and more efficient to execute, but with the corresponding risk that its behavior will be counterintuitive. In particular, the MYCIN formulas for all three combination scenarios of Figure 8.1 make the assumption that all rules are independent. The burden of guaranteeing independence (at least to the extent that it matters) is on the rule writec Each of the combination scenarios is vulnerable when this independence assumption is violated.

Let's first consider the scenario in Figure 8.1(a). Our example rule has three antecedents with a single CF rather than three separate rules; this makes the combination rules unnecessary. The rule writer did this because thc three antecedents are not independent. To see how much difference MYCIN's independence assumption can make, suppose for a moment that we had instead had three separate rules and that the CF of each was 0.6. This could happen and still be eonsistent with the combined CF of 0.7 if the three conditions overlap substantially. If we apply the MYCIN combination formula to the three separate rules, we get

MB[h,s_1_ ^ s_2_] = 0.6 + (0.6 ù 0.4) = 0.84

MB[h,(s_1_ ^ s_2_) ^ s_3_] = 0.84 + (0.6 ù 0.16) = 0.936

This is a substantially different result than the true value, as expressed by the expert, of 0.7.

Now let's consider what happens when independence assumptions are violated in the scenario of Figure 8.1(e). Let's consider a concrete example in which:

S: sprinkler was on last night

W: grass is wet

R: it rained last night

We can write MYCIN-style rules that describe predictive relationships among these three events:

If: the sprinkler was on last night

then there is suggestive evidence (0.9) that

the grass will be wet this morning

Taken alone, this rule may accurately deseribe the world. But now consider a second rule:

If: the grass is wet this morning

then there is suggestive evidence (0.8) that

it rained last night

Taken alone, this rule makes sense when rain is the most common source of water on the grass. But if the two rules are applied together, using MYCIN's rule for chaining, we get

MB[W, S] = 0.8 {sprinkler suggests wet}

MB[R, W] = 0.8 ù 0.9 = 0.72 {wet suggests rains}

In other words, we believe that it rained because we believe the sprinkler was on. We get this despite the fact that if the sprinkler is known to have been on and to be the cause of the grass being wet, then there is actually almost no evidence for rain (because the wet grass has been explained some other way). One of the major advantages of the modularity of the MYCIN rule system is that it allows us to consider individual antecedent/consequent relationships independently of others. In particular, it lets us talk about thc implications of a proposition without going back and considering the evidence that supporled it. Unfortunately, this example shows that there is a danger in this approach whenever the justifications of a belief are important to determining its consequences. In this case, we need to know why we believe the grass is wet (e.g., because we observed it to be wet as opposed to because we know the sprinkler was on) in order to determine whether the wet grass is evidence for it having just rained.

It is worth pointing out here that this example illustrates one specific rule structure that almost always causes trouble and should be avoided. Notice that our first rule deseribes a causal relationship (sprinklercauses wet grass). The second rule, although it looks the same, actually deseribes an inverse causality relationship (wet grass is caused by rain and thus is evidence for its cause). Although one can derive evidence for a symptom from its cause and for a cause from observing its symptom, it is important that evidence that is derived one way not be used again to go back the other way with no new information. To avoid this problem, many rule-based systems either limit their rules to one structure or cfearly partition the two kinds so that Ihey cannot interfere with each other. When we discuss Bayesian networks in the next section, we deseribe a systematic solution to this problem.

We can summarize this discussion of certainty factors and rule-based systems as follows. The approach makes strong independence assumptions that make it relatively easy to use; at the same time assumptions create dangers if rules are not written carefully so that important dependencies are captured. The approach can serve as the basis of practical application programs. It did so in MYCIN. It has done so in a broad anay of other systems that have been built on the EMYCIN platform [van Melle et al., 1981], which is a generalization (often called a shell) of MYCIN with all the domain-specific rules stripped out. One reason that this framework is useful, despite its limitations, is that it appears that in an otherwise robust system the exact numbers that are used do not matter very much. The other reason is that the rules were carefully designed to avoid the major pitfalls we have just deseribed. One other interesting thing about this approach is that it appears to mimic quite well [Shultz et al.,1989] the way people manipulate certainties.

8.3 Bayesian Networks

In the last section, we described CF's as a mechanism for reducing the complexity of a Bayesian reasoning system by making some approximations to the formalism. In this section, we describe an alternative approach, Bayesian networks [Pearl, 1988], in which we preserve the formalism and rely instead on the modularity of the world we are trying to model. The main idc;a is that to deseribe the real world, it is not necessary to use a huge joint probabilility table in which we Iisl the probabilities of all conceivable combinations ofevents. Most events :tre conditionally independent ofmost other ones, so theirinteractions need not be considered. Instead, we can use a more local representation in which we will deseribe clusters ofevents that interact.

Recall that in Figure 8.1 we used a network notation to deseribe the various kinds of constraints on likelihoods that propositionscan have on each other. The idea of constraint networks turns out to be very powerful. We expand on it in this section as a way to represent interactionsamong events; we also return to it later in Sections 11.3.1 and 14.3, where we talk about other ways of representing knowledge as sets of constraints.

Let's return to the example of the sprinkler, rain, and grass that we introduced in the last section. Figure 8.2(a) shows the flow of constraints we deseribed in MYCIN-style

Figure 8.2: Representing Causality Unifomtly

rules. But recall that the problem that we encountered with that example was that the constraints flowed incorrectly from "sprinkler on" to "rained last night." The problem was that we failed to make a distinction that tumed out to be critical. There are two different ways that propositions can influence the likelihood of each other The first is that causes influence the likelihood of their symptoms; the second is that observing a symptom affects the likelihood of all of its possible causes. The idea behind the Bdyesian network structure is to make a clear distinction between these two kinds of in Ruence.

Specifically, we construct a directed acyclic graph (DAG) that represents causality relationships among variables. The idea of a causality graph (or network) has proved to be very useful in several systems, particularly medical diagnosis systems such as CAS- NET [Weiss et al.,1978] and INTERNIST/CADUCEUS [Pople,1982]. The variables in such a graph may be propositional (in which case they can take on the values TRUE and FALSE) or they may be variables that take on values of some other type (e.g., a specific disease, a body temperature, or a reading taken by some other diagnostic device). In Figure 8.2(b), we show a causality graph for the wet grass example. In addition to the three nodes we have been talking about, the graph contains a new node corresponding to the propositional variable that tells us whether it is currently the rainy season.

A DAG, such as the one we have just drawn, illustrates the causality relationships that occur among the nodes it contains. In order to use it as a basis for probabilistic reasoning, however, we need more information. In particular, we need to know, for each value of a parent node, what evidence is provided about the values that the child node can take on. We can state this in a table in which the conditional pnubabilities are provided. We show such a table for our example in Figure 8.3. For example, from the table we see that the prior probability of the rainy season is 0.5. Then, if it is the rainy season, the probability of rain on a given night is 0.9; if it is not, the probability is only 0.1.

To be useful as a basis for problem solving, we need a mechanism for computing the influence of any arbitrary node on any other. For example, suppose that we have observed that it rained last night. What does that tell us about the probabilitythat it is the

Figure 8.3: Conditional Probabilities for a Bayesian Network

rainy season? To answer this question requires that the initial DAG be converted to an undirected graph in which the ares can be used to transmit probabilities in either direction, depending on where the evidence is coming from. We also require a mechanism for using the graph that guarantees that probabilities are transmitted correctly. For example, while it is true that observing wet grass may be evidence for rain, and observing rain is evidence for wet grass, we must guarantee that no cycle is ever traversed in such a way that wet grass is evidence for rain, which is then taken as evidence for wet grass, and so forth.

There are three broad classes of algorithms for doing these computations: a message- passing method [Pearl, I988], a clique triangulation method [Lauritzen and Spiegelhalter, 1988], and a variety of stochastic algorithms. The idea behind these methods is to take advantage of the fact that nodes have limited domains of influence. Thus, although in principle the task of updating probabilities consistently throughout the network is intractable, in practice it may not be. In the clique triangulation method, for example, explicit arcs are introduced between pairs of nodes that share a common descendent. For the case shown in Figure 8.2(b), a link would be introduced between Sprinkler and Rain. This explicit link supports assessing the impact of the observation Sprinkler on the hypothesis Rain. This is important since wet grass could be evidence of either of them, but wet grass plus one of its causes is not evidence for the competing cause since an alternative explanation for the observed phenomenon already exists.

The message-passing approach is based on the observation that to compute the probability of a node A given what is known about other nodes in the network, it is necessary to know three things:

  • pi - the total support arriving at A from its parent nodes (which represent its causes).
  • lambda - the total support arriving at A from its children (which represent its symptoms).
  • The entry in the fixed conditionul probability matrix that relates A to its causes.

Several methods for propagating pi and lambda messages and updating the probabilities at the nodes have been developed.1'he structure of the network determines what approach can be used. For example, in singly connected networks (those in which there is only a single path between every pair of nodes), a simpler algorithm can be used than in the case of multiply connected ones. For details, see Pearl [ 1988].

Finally, there are stochastic, or randomized algorithms for updating beliefnetworks. One such algorithm [Chavez, 1989] transforms an arbitrary network into a Markov chain. The idea is to shield a given node probabilistically from most of the other nodes in the network. Stochastic algorithms run fast in practice, but may not yield absolutely correct results.

8.4 Dempster-Shafer Theory

So far, we have described several techniques, all of which consider individual proposi- tions and assign to each of them a point estimate (i.e.. a single number) of the degree of belief that is warranted given the evidence. In this section, we consider an alternative technique, called Dempster-Shafer Theory [Dempster,1968; Shafer,1976]. This new approach considers sets of propositions and assigns to each of them an interval

[Belief, Plausibility]

in which the degree of belief must lie. Belief (usually denoted Bel) measures the strength of the evidence in favor ofa set of propositions. It ranges from 0 (indicating no evidence) to 1 (denoting certainty).

Plausibility (Pl) is defined to be

Pl(s) =1-Bel(&172;s)

It also ranges from 0 to 1 and measures the extent to which evidence in favor of -s leaves room for belief in s. In particular, if we have certain evidence in favor of -s, then Bel(&172;s) will be 1 and Pl(s) will be 0. This tells us that the only possible value for Bel(s) is also 0.

The belief-plausibility interval we have just defined measures not only our level of belief in some propositions, but also the amount of information we have. Suppose that we are currently considering three competing hypotheses: A, B, and C. If we have no information, we represent that by saying. for each of them, that the true likelihood is in the range [0,1]. As evidence is accumulated, this interval can be expected to shrink, representing increased confidence that we know how likely each hypothesis is. Note thatthis contrasts with a pure Bayesian approach,in which we would probably begin by distributing the prior probability equally among the hypotheses and thus assert for each that P(h) = 0.33. The interval approach makes it clear that we have no information when we start. The Bayesian approach does not, since we could end up with the same probability values if we collected volumes of evidence, which taken together suggest that the three values occur equally often. This difference can matter if one of the decisions that our program needs to make is whether to collect more evidence or to act on the basis of the evidence it already has.

So far, we have talked intuitively about Bel as a measure of our belief in some hypothesis given some evidence. Let's now define it more precisely. To do this, we need to start, just as with Bayes' theorem, with an exhaustive universe of mutually exclusive hypotheses. We'll call this the frame of discernment and we'll write it as teta. For example, in a simplified diagnosis problem, teta might consist of the set {All, Flu, Cold, Pneu}:

All: allergy

Flu: flu

Cold: cold

Pneu: pneumonia

Our goal is to attach some measure of belief to elements of teta. However, not all evidence is directly supportive of individual elements. Often it supports sets of elements (i.e., subsets of teta). For example, in our diagnosis problem, fever might support {Flu, Cold, Pneu}. In addition, since the elements of teta are mutually exclusive, evidence in favor of some may have an affect on our belief in the others. In a purely Bayesian system, we can handle both of these phenomena by listing all of the combinations of conditional probabilities. But our goal is not to have to do that. Dempster-Shafer theory lets us handle interactions by manipulating sets of hypotheses directly.

The key function we use is a probabilitydensity function, which we denote as m. The function m is defined not just for elements of teta but for all subsets of it (including singleton subsets, which correspond to individual elements). The quantity m(p) measures the amount of belief that is currently assigned to exactly the set p of hypotheses. If teta contains n elements, then there are 2^n^ subsets of teta. We must assign m so that the sum of all the m values assigned to the subsets of teta is 1. Although dealing with 2^n^ values may appear intractable, it usually turns out that many of the subsets will never need to be considered because they have no significance in the problem domain (and so their associated value of m will be 0).

Let's see how m works for our diagnosis problem. Assume that we have no infor- mation about how to choose among the four hypotheses when we start the diagnosis task. Then we define m as:

{teta} (1.0)

All other values of m are thus 0. Although this means that the actual value must be some one element All, Flu, Cold, or Pneu, we do not have any information that allows us to assign belief in any other way than to say that we are sure the answer is somewhere in the whole set. Now suppose we acquine a piece of evidence that suggests (at a level of 0.6) that the conect diagnosis is in the set {Flu,Cold,Pneu}. Fever might be such a piece of evidence. We update m as follows:

{Flu, Cold, Pneu} (0.6)

{teta} (0.4)

At this point, we have assigned to the set { Flu, Cold, Pneu} the appropriate be- lief. The remainder of our belief still resides in the larger set 6. Notice that we do not make the commitment that the remainder must be assigned to the complement of {Flu, Cold, Pneu}.

Having defined m, we can now define Bel(p) for a set p as the sum of the values of m or p and for all of its subsets. Thus Bel(p) is our overall belief that the correct answer lies somewhere in the set p.

In order to be able to use m (and thus Bel and Pl) in reasoning programs, we need to define functions that enable us to combine m's that arise from multiple sources of evidence.

Recall that in our discussion of CF's. we considered three combination scenarios, which we illustrated in Figure 8.1. When we use Dempster-Shafer theory, on the other hand, we do not need an explicit combining funetion for the scenario in Figure 8.1(b) since we have that capability already in our ability to assign a value of m to a set of hypotheses. But we do need a mechanism for performing the combinations of seenarios (a) and (c). Dempster's rule of combination serves both these functions. It allows us to combine any two belief functions (whether they represent multiple sources of evidence for a single hypothesis or multiple sources of evidence for different hypotheses).

Suppose we are given two belief funetions m_1_, and m_2_. Let X be the set of subsets of teta to which m_1_, assigns a nonzero value and let Y be the corresponding set for m_2_. We define the combination m_3_ of m_1_, and m_2_ to be

This gives us a new belief funetion that we can apply to any subset Z of teta. We can deseribe what this formula is doing by looking first at the simple case in which all ways of intersecting elements of X and elements of Y generate nonempty sets. For example, suppose m_1_, corresponds to our belief after observing fever:

{Flu, Cold, Pneu} (0.6)

teta (0.4)

Suppose m_2_ corresponds to our belief after observing a runny nose:

{All, Flu, Cold} (0.8)

teta (0.2)

Then we can compute their comMination m3 using the following table (in which we further abbreviate disease names), which we can derive using the numerator of the combination rule:

The four sets that are generated by taking all ways of intersecting an element of X and an element of Y are shown in the body of the table. The value of m_3_ that the combination rule associates with each of them is computed by multiplying the values of m_1_, and m_2_ associated with the elements from which they were derived. Although it did not happen in this simple case, it is possible for the same set to be derived in more than one way during this intersection process. If that does occur, then to compute m_3_ for that set, it is necessary to compute the sum of all the individual values that are generated for all the distinct ways in which the set is produced (thus the summation sign in the numerator of the combination formula).

A slightly more complex situation arises when some of the subsets created by the intersection operation are empty. Notice that we are guaranteed by the way we compute m_3_ that the sum of all its individual values is 1 (assuming that the sums of all the values of m_1_, and m_2_ are 1). If some empty subsets are created, though, then some of m_3_ will be assigned to them. But from the fact that we assumed that teta is exhaustive, we know that the true value of the hypothesis must be contained in some nonempty subset of teta. So we need to redistribute any belief that ends up in the empty subset proportionately across the nonempty ones. We do that with the scaling factor shown in the denominator of the combination formula. If no nonempty subsets are created, the scaling factor is 1, so we were able to ignore it in our first example. But to see how it works, let's add a new piece of evidence to our example. As a result of applying m_1_, and m_2_, we produced m_3_:

{Flu, Cold} (0.48)

{All, Flu, Cold} (0.32)

{Flu, Cold, Pneu} (0.12)

teta (0.08)

Now, let m_4_ correspond to our belief given just the evidence that the problem goes away when the patient goes on a trip:

{All} (0.9)

teta (0.1)

We can apply the numerator of the combination rule to produce (where 0 denotes the empty set):

But there is now a total belief of 0.54 associated with 0; only 0.45 is associated with outcomes that are in fact possible. So we need to scale the remaining values by the factor 1-0.54 = 0.46. If we do this, and also combine alternative ways of generating the set {All, Flu, Cold}, then we get the final combined belief function, m_5_:

{Flu, Cold} (0.104)

{All, Flu, Cold} (0.696)

{Flu, Cold, Pneu} (0.026)

{All} (0.157)

teta (0.017)

Figure 8.4: Fuzzy versus Conventional Set Membership

In this example, the percentage of m_5_ that was initially assigned to the empty set was large (over half). This happens whenever there is conflicting evidence (as in this case between m_1_, and m_4_).

8.5 Fuzzy Logic

In the techniques we have discussed so far, we have not modified the mathematical underpinnings provided by set theory and logic. We have instead augmented those ideas with additionat constructs provided by probability theory. In this section, we take a different :cpproach and briefly consider what happens if we make fundamental changes to our idea of set membership and corresponding changes to our definitions of logical oper:ctions.

The motivation for fuzzy sets is provided by the need to represent such propositions as:

  • John is very tall.
  • Mary is slightly ill.
  • Sue and Linda are close friends.
  • Exceptions to the rule are nearly impossible.
  • Most Frenchmen are not very tall.

While traditional set theory defines set membership as a boolean predicate, fuzzy set theory allows us to represenl set membership as a possibility distribution, such as the ones shown in Figure 8.4(a) for the set of tall people and the set of very tall people. Notice how this contrasts with the standard boolean definition for tall people shown in Figure 8.4(b). In the latter, one is either tall or not and there must be a specific height that delines the boundary. The same is true for very tall. In the former, one's tallness increases with one's height until the value of 1 is reached.

Once set membership has been redefined in this way, it is possible to define a reasoning system based on techniques for combining distributions [Zadeh, 1979] (or see the papers in the journal Fuzzy Sets and Systems.). Such reasoners have been applied in control systems for devices as diverse as trains and washing machines.

8.6 Summary

In this chapter we have shown that Bayesian statistics provide a good basis for reasoning under various kinds of uncenainty. We have also, though, talked about its weaknesses in complex real tasks, and so we have talked about ways in which it can be modified to work in practical domains. The thing that all of these modifications have in common is that they substitute, for the huge joint probability matrix that a pure Bayesian approach requires, a more structured representation of the facts that are relevant to a particular problem. They typically do this by combining probabilistic information with knowledge that is represented using one or more other representational mechanisms, such as rules or constraint networks.

Comparing these approaches for use in a panicular problem-solving program is not always straightforward, since they differ along several dimensions, for example:

  • They provide different mechanisms for deseribing the ways in which propositions are not independent of each other.
  • They provide different techniques for representing ignorance.
  • They differ substantiatly in the ease with which systems that use them can be built and in the computational complexity that the resulting systems exhibit.

We have also presented fuzzy logic as an alternative for representing some kinds of uncertain knowledge. Although there remain many arguments about the relative overall merits of the Bayesian and the fuzzy approaches, there is some evidence that they may both be useful in capturing different kinds of information. As an example, consider the proposition

  • John was pretty sure that Mary was seriously iIl.

Bayesian approaches naturally capture John's degree ofeertainty, while fuzzy techniques can deseribe the degree of Mary's illness.

Throughout all of this discussion, it is imponant to keep in mind the fact that although we have been discussing techniques for representing knowledge, there is another perspective from which what we have really been doing is describing ways of representing lack of knowledge. In this sense, the techniques we have deseribed in this chapter are fundamentally different from the ones we talked about earlier. For example, the truth values that we manipulate in a logical system characterize the formulas that we write; cenainty measures, on the other hand, deseribe the exceptions-the facts that do not appear anywhere in the formulas that we have writlen. The consequences of this distinction show up in the ways that we can interpret and manipulate the formulas that we write. The most imponant difference is that logical formulas can be treated as though they represent independent propositions. As we have seen throughout this chapter, uncertain assertions cannot. As a result, for example, while implication is transitive in logical systems, we often get into trouble in uncertain systems if we treat it as though it were (as we saw in our first treatment of the sprinkler and grass example). Another difference is that in Iogical systems it is necessary to find only a single proof to be able to assert the truth value of a proposition. All other proofs, if there are any, can safely be ignored. In uncertain systems, on the other hand, computing belief in a proposition requires that al I available reasoning paths be followed and combined.

One final comment is in order before we end this discussion. You may have noticed throughout this chapter that we have not maintained a clear distinction among such concepts at probability, certainty, and belief. This is because although there has been a great deal of philosophical debate over the meaning of these various terms, there is no clear argreement on how best to interpret them if our goal is to create working programs. Although the idea that probability should be viewed as a measure of belief rather than as a summary of past experience is now quite widely held, we have chosen to avoid the debate in this presentation. Instead. we have used all those words with their everyday, undifferentiated meaning, and we have concentrated on providing simple descriptions of how several algorithms actually work. If you are interested in the philosophical issues, see, forexample, Shafer [1976] and Pearl [1988].

Unfortunately, although in the last two chapters we have presented several important approaches to the problem of uncertainty management, we have barely seraped the surface of this area. For more information, see Kanal and Lemmer [1986], Kanal and Lemmer [1988], Kanal et al. [1989], Shafer and Pearl [1990], Clark [1990]. In particular, our list of specific techniques is by no means complete. For example, you may wish to look into probabilistic logic [Nilsson, 1986; Halpem, 1989], in which probability theory is combined with logic so that the truth value of a formula is a probability value (between 0 and 1) rather than a boolean value (TRUE or FALSE). Or you may wish to ask not what statistics can do for AI but rather what AI can do for statistics. In that case, see Gale [1986].

8.7 Exercises

  1. Consider the following puzzle:
    • A pea is placed under one of three shells, and the shells are then manipulated in such a fashion that all three appear to be equally likely to contain the pea. Nevenheless, you win a prize if you guess the correct shell, so you make a guess. The person running the game does know the correct shell, however, and uncovers one of the shells that you did not choose and that is empty. Thus, what remains are two shells: one you chose and one you did not choose. Furthermore, since the uncovered shell did not contain the pea, one of the two remaining shells does contain it. You are offered the opponunity to change your selection to thc other shell. Should you'?

    Work through the conditional probabilities mentioned in this problem using Bayes' theorem. What do the results tell about what you should do?

  2. Using MYCIN's rules for inexact reasoning, compute CF, MB, and MD of h_1_, given three observations where

    CF(h_1_,o_1_) = 0.5

    CF(h_1_,o_2_) = 0.3

    CF(h_1_,o_3_) = -0.2

  3. Show that MYCIN's combining rules satisfy the three properties we gave for them.
  4. Consider the following set of propositions:
    • patient has spots
    • patient has measles
    • patient has high fever
    • patient has Rocky Mountain Spotted Fever
    • patient has previously been innoculated against measles
    • patient was recently bitten by a tick
    • patient has an allergy
    1. Create a network that defines the causal connections among these nodes.
    2. Make it a Bayesian network by constructing the necessary conditional prob- ability matrix.
  5. Consider the same propositions again, and assume our task is to identify the patient's disease using Dempster-Shafer theory.
    1. What is teta?
    2. Define a set ofm functions that deseribe the dependencies among sources of evidence and elements of teta.
    3. Suppose we have observed spots, fever, and a tick bite. In that case, what is our Bel({RockyMountainSpottedFever})?
  6. Define fuzzy sets that can be used to represent the list of propositions that we gave at the beginning of Section 8.5.
  7. Consider again the ABC Murder story from Chapter 7. In our discussion of it there, we focused on the use of symbolic techniques for representing and using uncertain knowledge. Let's now explore the use ofnumeric techniyues to solve the same problem. For each part below, show how knowledge could be represented. Whenever possible, show how it can be combined to produce a prediction of who committed the murder given at least one possible configuration of the evidence.
    1. Use MYCIN-style rules and CF's. Example rules might inelude:

      If (1) relative (x,y), and

      (2) on speaking terms (x,y),

      then there is suggestive evidence /0.7) that

      will-lie-for (x,y)

    2. Use Bayesian networks. Represent as nodes such propositions as brother- in-law-lied, Cabot-at-ski-meet, and so forth.
    3. Use Dempster-Shafer theory. Examples of m's might be:

      m_1_ = {Abbott,Babbitt} (0.8) {beneficiaries in will}

      teta (0.2)

      m_2_ = {Abbott, Cobot} (0.7) {in line for his job}

      teta (0.3)

    4. Use fuzzy logic. For example, you might want to define such fuzzy sets as honest people or greedy people and describe Abbott, Babbitt, and Cabot's memberships in those sets.
    5. What kinds of information are easiest (and hardest) to represent in each of these framewocks?