In this chapter, we discuss the use of rules to encode knowledge. This is a particularly important issue since rule-based reasoning systems have played a very important role in the evolution of AI from a purely laboratory science into a commercially significant one, as we see later in Chapter 20.
We have already talked about rules as the basis for a search program. But we gave little consideration to the way knowledge about the world was represented in the rules (although we can see a simple example of this in Section 4.2). In particular, we have been assuming that search control knowledge was maintained completely separately from the rules themselves. We will now relax that assumption and consider a set of rules to represent both knowledge about relationships in the world, as well as knowledge about how to solve problems using the content of the rules.
Since our discussion of knowledge representation has concentrated so far on the use of logical assertions, we use logic as a starting pointin our discussion of rule-based systems.
In the previous chapter, we viewed logical assertions as declarative representations of knowledge. A declarative representation is one in which knowledge is specified, but the use to which that knowledge is to be put is not given. To use a declarative representation, we must augment it with a program that specifies what is to be done to the knowledge and how. For example, a set of logical assertions can be combined with a resolution theorem prover to give a complete program for solving problems. There is a different way, though, in which logical assertions can be viewed, namely as a program, rather than as data to a program. In this view, the implication statements define the legitimate reasoning paths and the atomic assertions provide the starting points (or, if we reason backward, the ending points) of those paths. These reasoning paths define the possible execution paths of the program in much the same way that traditional control constructs, such as if-then-else, define the execution paths through traditional programs. In other words, we could view logical assertions as procedural representations of knowledge. A procedural representation is one in which the control information that is necessary to use the knowledge is considered to be embedded in the knowledge itself. To use a procedural representation, we need to augment it with an interpreter that follows the instructions given in the knowledge.
Actually, viewing logical assertions as code is not a very radical idea, given that all programs are really data to other programs that interpret (or compile) and execute them. The real difference between the declarative and the procedural views of knowledge lies in where control information resides. For example, consider the knowledge base:
man(Marcus)
man(Caesar)
person(Cleopatra)
&every;x : man(x) -> person(x)
Now consider trying to extract from this knowledge base the answer to the question
&exists;y : person(y)
We want to bind y to a particular value for which person is true. Our knowledge base justifies any of the following answers:
y = Marcus
y = Caesar
y = Cleopatra
Because there is more than one value that satisfies the predicate, but only one value is needed, the answer to the yuestion will depend on the order in which the assertions are examined during the search for a response. If we view the assertions as declarative, then they do not themselves say anything about how they will be examined. If we view them as procedural, then they do. Of course, nondeterministic programs are possible- for example, the concurrent and parallel programming constructs described in Dijkstra [1976], Hoare [1985], and Chandy and Misra [1989]. So, we could view these assertions as a nondeterministic program whose output is simply not defined. If we do this, then we have a "procedural" representation that actually contains no more information than does the "declarative" form. But most systems that view knowledge as procedural do not do this. The reason for this is that, at least if the procedure is to execute on any sequential or on most existing parallel machines, some decision must be made about the order in which the assertions will be examined. There is no hardware support for randomness. So if the interpreter must have a way of deciding, there is no real reason not to speeify it as part of the definition of the language and thus to define the meaning of any particular program in the language. For example, we might specify that assertions will be examined in the order in which they appear in the program and that search will proceed depth-first, by which we mean that if a new subgoal is established then it will be pursued immediately and other paths will only be examined if the new one fails. If we do that, then the assertions we gave above describe a program that will answer our question with
y = Cleopatra
To see clearly the difference between declarative and procedural representations, consider the following assertions:
man(Marcus)
man(Caesar)
&every;x : man(x) -> person(x)
person(Cleopatra)
Viewed declaratively, this is the same knowledge base that we had before. All the same answers are supported by the system and no one of them is explicitty selected. But viewed procedurally, and using the control model we used to get Cleoparra as our answer before, this is a ditierent knowledge base since now the answer to our question is Marcus. This happens because the first statement that can achieve the person goal is the inference rule
&every;x : man(x) -> person(x). This rule sets up a subgoal to find a man. Again the statements are examined from the beginning, and now Marcus is found to satisfy the subgoal and thus also the goal. So Marcus is reported as the answer.
It is important to keep in mind that although we have said that a procedurat representation encodes control information in the knowledge base, it does so only to the extent that the interpreter for the knowledge base recognizes that control information. So we could have gotten a different answer to the person question by leaving our original knowledge base intact and changing the interpreter so that it examines statements from last to first (but still pursuing depth-first search). Following this control regime, we report Caesar as our answer.
There has been a great deal of controversy in AI over whether declarative or procedural knowledge representation frameworks are better. There is no clearcut answer to the question. As you can see from this discussion, the distinction between the two forms is often very fuzzy. Rather than try to answer the question of which approach is better, what we do in the rest of this chapter is to describe ways in which rule formalisms and interpreters can be combined to solve problems. We begin with a mechanism called logic programming, and then we consider more flexible structures for rule-based systems.
Logic programming is a programming language paradigm in which Iogical assertions are viewed as programs, as described in the previous section. There are several logic programming systems in use today, the most popular of which is PROLOG [Clocksin and Mellish,1984; Bratko, 1986]. A PROLOG program is described as a series of logical assertions, each of which is a Horn clause. [Footnote: Programs written in pure PROLOG are composed only of Horn clauses. PROLOG, as an actual programming language, however, allows departures from Horn clauses. In the rest of this section, we limit our discussion to pure PROLOG.] A Horn clause is a clause (as defined in Section 5.4.1) that has at most one positive literal. Thus p, &172;p V q, and p -> q are all Horn clauses. The last of these does not look like a clause and it appears to have two positive literals. But recall from Section 5.4.1 that any logical expression can be converted to clause form. If we do that for this example, the resulting clause is &172;p V q,
which is a well-formed Horn clause. As we will see below, when Horn clauses are written in PROLOG programs, they actually look more like the form we started with (an implication with at most one literal on the right of the implication sign) than the clause form we just produced. Some examples of PROLOG Horn clauses appear below.
The fact that PROLOG programs are composed only of Horn clauses and not of arbitrary logical expressions has two important consequences. The first is that because of the uniform representation a simple and efficient interpreter can be written. The second consequence is even more important. The logic of Horn clause systems is decidable (unlike that of full first-order predicate logic).
The control structure that is imposed on a PROLOG program by the PROLOG interpreter is the same one we used at the beginning of this chapter to find the answers Cleopatra and Marcus. The input to a program is a goal to be proved. Backward reasoning is applied to try to prove the goal given the assertions in the program. The program is read top to bottom, Ieft to right and search is performed depth-first with backtracking.
Figure 6.1 shows an example of a simple knowledge base represented in standard logical notation and then in PROLOG. Both of these representations contain two types of statements, facts, which contain only constants (i.e., no variables) and rules, which do contain variables. Facts represent statements about specific objects. Rules represent statements about classes of objects.
Notice that there are several superficial, syntactic differences between the logic and the PROLOG representations, including:
The first two of these differences arise naturally from the fact that PROLOG programs are actually sets of Horn clauses that have been transformed as follows:
This procedure causes a clause, which originally consisted of a disjunction of literals (all but one of which were negative), to be transformed into a single implication whose antecedent is a conjunction of (what are now positive) literals. Further, recall that in a clause, all variables are implicitly universally yuantified. But, when we apply this transformation (which essentially inverts several sleps of the procedure we gave in Section 5.4.1 for converting to cause form), any variables that occurred in negative literals and so now occur in the antecedent become existentially quantitied, while the variables in the consequent (the head) are still universally quantified. For example, the PROLOG clause
P(x) :- Q(x, y)
is equivalent to the logical expression
&every;x : &exists;y : Q(x,y) -> P(x)
A key difference between logic and the PROLOG representation is that the PROLOG interpreter has a fixed control strategy, and so the assertions in the PROLOG program define a particular search path to an answer to any yuestion. In contrast, the logical assertions define only the set of answers that they justify; they themselves say nothing about how to choose among those answers if there are more than one.
The basic PROLOG control strategy outlined above is simple. Begin with a problem statement, which is viewed as a goal to be proved. Look for assertions that can prove the goal. Consider facts, which prove the goal directly, and also consider any rufe whose head matches the goal. To decide whether a fact or a rule can be applied to the current problem; invoke a standard unification procedure (recall Section 5.4.4). Reason backward from that goal until a path is found that terminates with assertions in the program. Consider paths using a depth-first search strategy and using backtracking. At each choice point, consider options in the order in which they appear in the program. If a goal has more than one conjunctive part, prove the parts in the order in which they appear, propagating variable bindings as they are determined during unification. We can illustrate this strategy with a simple example.
Suppose the problem we are given is to find a value of X that satisfies the predicate apartmentpet (X). We state this goal to PROLOG as
?- apartmentpet(X).
Think of this as the input to the program. The PROLOG interpreter begins looking for a fact with the predicate apartmentpet or a rule with that predicate as its head. Usually PROLOG programs are written with the facts containing a given predicate coming before the rules for that predicate so that the facts can be used immediately if they are appropriate and the rules will only be used when the desired fact is not immediately available. In this example, there are no facts with this predicate, though, so the one rule there is must be used. Since the rule will succeed if both of the clauses on its right-hand side can be satisfied, the next thing the interpreter does is to try to prove each of them. They will be tried in the order in which they appear. There are no facts with the predicate pet but again there are rules with it on the right-hand side. But this time there are two such rules, rather than one. All that is necessary for a proofthough is that one of them succeed. They will be tried in the order in which they occur. The first will fail because there are no assertions about the predicate cat in the program. The second will eventually lead to success, using the rule about dogs and poodles and using thc: fact poodle(fluffy). This results in the variable X being bound to fluffy. Now the second clause small(X) of the initial rule must be checked. Since X is now bound to fiuffy, the more specific goal, small(fluffy), must be proved. This too can be done by reasoning backward to the assertion poodle(fluffy). The program then halts with the result apartmentpet(fluffy).
Logical negation (&172;) cannot be represented explicitly in pure PROLOG. So, for example, it is not possible to encode directly the logical assertion
&every;x : dog(x) -> &172;cat(x)
Instead, negation is represented implicitly by the lack of an assertion. This leads to the problem-solving strategy called negation as failure [Clark,1978]. If the PROLOG program of Figure 6.1 were given the goal
?- cat(fluffy).
it would return FALSE because it is unable to prove that F1uffy is a cat. Unfortunately this program retums the same answer when given the goal
?- cat (mittens).
even though the program knows nothing about Mittens and specifically knows nothing that might prevent Mittens from being a cat. Negation by failure requires that we make what is called the closed world assumption, which states that all relevant, true assertions are contained in our knowledge base or are derivable from assertions that are so contained. Any assertion that is not present can therefore be assumed to be false. This assumption. while often justified, can cause serious problems when knowledge bases are inc:omplete. We discuss this issuc Ibnher in Chapter 7.
There is much to say on the topic of PROLOG-style versus LISP-style programming. A great advantage of logic programming is that the programmer need only specify rules and facts since a search engine is built directly into the language. The disadvantage is that the search control is fixed. Although it is possible to write PROLOG code that uses search strategies other than depth-first with backtracking, it is ditficult to do so. It is even more diliicult to apply domain knowledge to constrain a search. PROLOG does allow for rudimentary control of search through a non-logical operator called cu!. A cut can be inserted into a rule to specify a point that may not be backtracked over.
More generally, the fact that PROLOG programs must be composed ofa restricted set of logical operators can be viewed as a limitation of the expressiveness of the language. But the other side of the coin is that it is possible to build PROLOG compilers that produce very efficient code.
In the rest of this chapter, we retain the rule-based nature of PROLOG, but we relax a number of PROLOG's design constraints, leading to more fiexible rule-based architectures.
The object of a search procedure is to discover a path through a problem space from an initial configuration to a goal state. While PROLOG only searches from a goal state, there are actually two directions in which such a search could proceed:
The production system model of the search process provides an easy way ofviewing forward and backward reasoning as symmetric processes. Consider the problem of solving a particulur instance of the R-puzzle. The rules to be used for solving the puzzle can be written as shown in Figure 6.2. Using those rules we could attempt to solve the puzzle shown back in Figure 2.12 in one of two ways:
Notice that the same rules can be used both to reason forward from the initial state and to reason backward from the goal state. To reason forward, the left sides (the preconditions) are matched against the current state and the right sides (the results) are usecl to generate new nodes until the goal is reached. To reason backward, the right sides are matched against the current node and the left sides are used to generate new nodes representing new goal states to be achieved. This continues until one of these goal states is matched by an initial state.
In the ease of the 8-puzzle, it does not make much difference whether we reason forward or backward: about the same number of paths will be explored in either case. But this is not always true. Depending on the topology of the problem space, it may be sigrrilicanily more eflicient to search in one direction rather than the other.
Four factors influence the question of whether it is better to reason forward or backward:
A few examples make these issues clearer. It seems easier to drive from an unfamiliar place home than from home to an unfamiliar place. Why is this? The branching factor is roughly the same in both directions (unless one-way streets are laid out very strangely). But for the purpose of finding our way around, there are many more locations that count as being home than there are locations that count as the unfamiliar target place. Any place from which we know how to get home can be considered as equivalent to home. If we can get to any such place, we can get home easily. But in order to find a route from where we are to an unfamiliar place, we pretty much have to be already at the unfamiliar place. So in going toward the unfamiliar place, we are aiming at a much smaller target than in going home. This suggests that if our starting position is home and our goal position is the unfamiliar place, we should plan our route by reasoning backward from the unfamiliar place.
On the other hand, consider the problem of symbolic integration. The problem space is the set of formulas, some of which contain integral expressions. The start state is a particular formula containing some integral expression. The desired goal state is a formula that is equivalent to the initial one and that does not contain any integral expressions. So we begin with a single easily identified start state and a huge number of possible goal states. Thus to solve this problem, it is better to reason forward using the rules for integration to try to generate an integral-free expression than to start with arbitrary integral-free expressions, use the rules for differentiation, and try to generate the particular integral we are trying to solve. Again we want to head toward the largest target; this time that means chaining forward.
These two examples have illustrated the importance of the relative number of start states to goal states in determining the optimal direction in which to search when the branehing factor is approximately the same in both directions. When the branching factor is not the same, however, it must also be taken into account.
Consider again the problem of proving theorems in some particular domain of mathematics. Our goal state is the particular theorem to be proved. Our initial states are normally a small set of axioms. Neither of these sets is significantly bigger than the other. But consider the branching factor in each of the two directions. From a small set of axioms we can derive a very large number of theorems. On the other.hand, this large number of theorems must go back to the small set of axioms. So the branching factor is significantly greater going forward from the axioms to the theorems than it is going backward from theorems to axioms. This suggests that it would be much better to reason backward when trying to prove theonems. Mathematicians have long realized this [Polya,1957], as have the designors of theorem-proving programs.
The third factor that determines the direction in which search should proceed is the need to generate coherent justifications of the reasoning process as it proceeds. This is often crucial for the acceptance ofprograms for the performance ofvery imponant tasks. For example, doctors are unwilling to accept the advice of a diagnostic program that cannot explain its reasoning to the doctors' satisfaction. This issue was of concern to the designers of MYCIN [ShortIitTe,1976], a program that diagnoses infectious diseases. It reasons backward from its goal of determining the cause of a patient's illness. To do that, it uses rules that tell it such things as "If the organism has the following set of characteristics as determined by the lab results, then it is likely that it is organism x." By reasoning backward using such rules, the program can answer questions like "Why should I perfonn that test you just asked for?" with such answers as "Because it would help to determine whether organism x is present." (For a discussion of the explanation capabilities of MYCIN, see Chapter 20.)
Most of the search techniques described in Chapter 3 can be used to search either forward or backward. By describing the search process as the application of a set of production rules, it is easy to describe the specific search algorithms without reference to the direction of the search.[Footnote: One execption to this is the mcans-ends analysis technique,described in Section 3.6, which proceeds not by making successive steps in a single direction but by reducing differences between the current and the goal states, and, as a result, sometimes reasoning backward and somelimes forward.]
We can also search both forward from the stan state and backward from the goal simultaneously until two paths meet somewhere in between. This strategy is called
bidirectional search. It seems appealing if the number of nodes at each step grows exponentially with the number of steps that have been taken. Empirical results [Pohl, 1971] suggest that for blind search, this divide-and-conquer strategy is indeed effective. Unfonunately, other results [Pohl,1971; de Champeaux and Sint, l977] suggest that for informed, heuristic search it is much less likely to be so. Figure 6.3 shows why bidirectional search may be ineffective. The two searches may pass each other, resulting in more work than it would have taken for one of them, on its own, to have finished. However, if individual forward and backward steps are performed as specified by a program that has been carefully constructed to exploit each in exactly those situations where it can be the most profitable, the results can be more encouraging. In fact, many successful AI applications have been written using a combination of forward and backward reasoning, and most AI programming environments provide explicit suppon for such hybrid reasoning.
Although in principle the same set ofrules can be used for both forward and backward reasoning, in practice it has proved useful to define two classes of rules, each of which encodes a panicular kind of knowledge.
By separating rules into these two classes, we essentially add to each rule an addi- tional piece of information, namely how it should be used in problem solving. In the next three sections, we deseribe in more detail the two kinds of rule systems and how they can be combined.
Backward-chaining rule systems, of which PROLOG is an example, are good for goal- directed problem solving. For example, a query system would probably use backward chaining to reason about and answer user questions.
In PROLOG, rules are restricted to Horn clauses. This allows for rapid indexing because all of the rules for deducing a given fact share the same rule head. Rules are matched with the unification procedure. Unification tries to find a set of bindings for variables to equate a (sub)goal with the head of some rule. Rules in a PROLOG program are matched in the order in which they appear.
Other backward-chaining systems allow for more complex rules. In MYCIN, for example, rules can be augmented with probabilistic cenainty factors to reHect the fact that some rules are more reliable than others. We discuss this in more detail in Chapter 8.
Instead of being directed by goals, we sometimes want to be directed by incoming data. For example, suppose you sense searing heat near your hand. You are likely to jerk your hand away. While this could be construed as goal-directed behavior, it is modeled more naturally by the recognize-act cycle characteristic of forward-chaining rule systems. In forward-chaining systems, left sides of rules are matched against the state deseription. Rules that match dump their right-hand side assenions into the state, and the process repeats.
Matching is typically more complex for forward-chaining systems than backward ones. For example, consider a rule that checks for some condition in the state deseription and then adds an assertion. After the rule fires, its conditions are probably still valid, so it could fire again immediately. However, we will need some mechanism to prevent repeated firings, especially if the state remains unehanged.
While simple matching and control strategies are possible, most forward-chaining systems (e.g., OPSS [Brownston et al., 1985]) implement highly efficient matchers and supply several mechanisms for prefening one rule over another. We discuss matching in more detail in the next section.
Sometimes certain aspects of a problem are best handled via forward chaioing and other aspects by backward chaining. Consider a forward-chaining medical diagnosis program. It might accept twenty or so facts about a patient's condition, then forward chain on those facts to try to deduce the nature and/or cause of the disease. Now suppose that at some point, the left side of a rule was neurly satisfied-say, nine out of ten of its preconditions were met. It might be efficient to apply backward reasoning to satisfy the tenth precondition in a directed manner, rather than wait for forward chaining to supply the fact by accident. Or perhaps the tenth condition requires further medical tests. In that case, backward chaining can be used to yuery the user.
Whether it is possible to use the same rules for both forward and backward reasoning also depends on the form ofthe rules themselves. Ifboth left sides and right sides contain pure assertions, then forward chaining can match assertions on the left side of a rule and aclcl to the state deseription the assertions on the right side. But if arbitrary procedures are allowed as the right sides of rules, then the rules will not be reversible. Some production languages allow only reversible rules; others do not. When irreversible rules are used, then a commitment Io the direction of the search must be made at the time the rules are written. But, as we suggested above, this is often a useful thing to do anyway because it allows the rule writer to add control knowledge to the rules themselves.
So far, we have described the process ofusing search to solve problems as the application of appropriate rules to individual problem states to generate new states to which the rules can then be applied, and so forth, until a solution is found. We have suggested that clever search involves choosing from among the rules that can be applied at a particular point, the ones that are most likely to lead to a solution. But we have said little about how we extract from the entire collection of rules those that can be applied at a given point. To do so requires some kind of maching between the current state and the preconditions of the rules. How should this be done? The answer to this question can be critical to the success of a rule-based system. We discuss a few proposals below.
One way to select applicable rules is to do a simple search through all the rules, comparing each one,s preconditions to the current state and extracting all the ones that match. But there are two problems with this simple solution:
Sometimes there are easy ways to Jeal with the tirst of these problems. Instead of searching through the rules, use the current state as an index into the rules and select the
matching ones immediately. For example, consider the legal-move generation rule for chess shown in in Figure 6.4. To be able to access the appropriate rules immediately, all we need do is assign an index to each board position. This can be done simply by treating the board description as a large number. Any reasonable hashing function can then be used to treat that number as an index into the rules. All the rules that deseribe a given board position will be stored under the same key and so will be found together. Unfortunately, this simple indexing scheme only works because preconditions of rules match exact board configurations. Thus the matching process is easy but at the price of complete lack of generality in the statement of the rules. As discussed in Section 2.1, it is often better to write rules in a more general form, such as that shown in Figure 6.5. When this is done, such simple indexing is not possible. In fact, there is often a trade-off between the ease of writing rules (which is inereased by the use of high-level descriptions) and the simplicity of the matching process (which is decreased by such descriptions).
All of this does not mean that indexing cannot be helpful even when the preconditions of rules are stated as fairly high-level predicates. In PROLOG and many theorem-proving systems, for example, rules are indexed by the predicates they contain, so all the rules that could be applicable to proving a particular fact can be accessed fairly quickly.
In the chess example, rules can be indexed by pieces and their positions. Despite some limitations of this approach, irtdexing in some form is very important in the efficiertt operation of rule-based systems.
The problem of selecting applicable rules is made more difficult when preconditions are oot stated as exact descriptions of particular situations but rather describe properties (of varying complexity) that the situations must have. It often tums out that discovering whether there is a match between a particular situatioo and the preconditiotts of a given rule mustitselfinvolve a significantsearch process.
If we want to match a single condition against a single element in a state description, then the unification procedure of Section 5.4.4 will suffice. However, in many rule- based systems, we need to compute the whole set of rules that match the current state description. Backward-chaining systems usually use depth-first backtracking to select individual rules, but forward-chaining systems generally employ sophisticated conflict resolution strategies to choose among the applicable rules [Footnote: Conflict resolution is discussed in the next scction.]. While it is possible to apply unification repeatedly over the cross product of preconditions and state description elements, it is more efficient to consider the many-many match problem, in which many rules are matched against many elements in the state description simultaneously.
One efficient many-many match algorithm is RETE, which gains efficiency from three major sources:
son(x, y) ^ son(y, z) -> grandparent(x,z)
can be matched, but not in a manner that satisfies the constraint imposed by the variable y. Fortunately, it is not rtecessary to compute binding consistency from scratch every time a new condition is satisfied. RETE remembers its previous calculations and is able to merge new binding information efticiently.
For more details about the RETE match algorithm, see Forgy [1982]. Other matching algorithms (e.g., Miranker [1987] and Oflazer [1987]) take different stands on how much time to spend on saving state information between cycles. They can be more or less efficient than RETE, depending oo the types of rules written for the domain and on the degree of hardware parallelism available.
A more complex matching process is required when the preconditions of a rule specify required properties that are oot stated explicitly in the description of the current state. In this case, a separate set of rules must be used to describe how some properties can be inferred from others.
An even more complex matching process is required if rules should be applied if their preconditions approximately match the current situation. This is often the case in situations involving physical descriptions of the world. For example, a speech- understanding program must contain rules that map from a description of a physical waveform to phones (instances of English phonemes, such as p or d). There is so much variability in the physical signal, as a result of background noise, differences in the way individuals speak, and so forth, that one can hope to find only an approximate match between the rule that describes an ideal sound and the input that describes an unideal world. Approximate matching is particularly difficult to deal with because as we increase the tolerance allowed in the match, we also increase the number of rules that will match, thus increasing the size of the main search process. But approximate matching is nevertheless superior to exact matching in situations such as speech understanding, where exact matching may often result in no rules beiog matched and the search process coming to a grinding halt. Although symbolic techniques for approximate matching exist, there is another, very different approach that can be used to solve this problem. We discuss it in detail in Chapter 18 where we describe connectionist systems (also called neural nets).
For some problems, almost all the action is in the matching of the rules to the problem state. Once that is done, so few rules apply that the remaining search is trivial. This was the case, for example, in ELIZA [Weizenbaum,1966], an early AI program that simulated the behavior of a Rogerian therapist. A fragment of a dialogue between ELIZA and a user is shown in Figure 6.6. ELIZA's knowledge about both English and psychology was coded in a set of simple rules. Figure 6.7 shows some ELIZA-like rules.
ELIZA operated by matching the left sides of the rules against the user 's last sentence and using the appropriate right side to generate a response. For example, if the user typed "My brother is mean to me," ELIZA might respond. "Who else in your family is mean to you?" or "Tell me more about your family." The rules were indexed by keywords so only a few had actually to be matched against a particular sentence. Some
of the rules had no left side, so the rule could apply anywhere. These rules were used if no other rules matched and they generated replies such as "Tell me more about that " Notice that the rules themselves cause a form of approximate matching to occur. The pattems ask about specific words in the user's sentence. They do not need to match entire sentences. Thus a great variety of sentences can be matched by a single rule, and the grammatical complexity of English is pretty much ignored. This accounts both for ELIZA's major strength, its ability to say something fairly reasonable almost all of the time, and its major weakness, the superficiality of its understanding and its ability to be led completely astray. Approximate matching can easily lead to both these results.
As if the matching process were not already complicated enough, recall the frame problem mentioned in Chapter 4. One way of dealing with the frame problem is to avoid storing entire state descriptions at each node but instead to store only the changes from the previous node. If this is done, the matching process will have to be modified to scan backward from a node through its predecessors, looking for the required objects.
The result of the matching process is a list of rules whose antecedents have matched the current state description along with whatever variable bindings were generated by the matching process. It is the job of the search method to decide on the order in which rules will be applied. But sometimes it is useful to incorporate some of that decision making into the matching process. This phase of the matching process is then called conflict resolution.
There are three basic approaches to the problem of conflict resolution in a production system:
There are two common ways of assigning a preference based on the rules themselves. The first, and simplest, is to consider the rules to have been specified in a particular order, such as the physical order in which they are presented to the system. Then priority is given to the rules in the order in which they appear. This is the scheme used in PROLOG.
The other common rule-directed preference scheme is to give priority to special case cules over rules that are more general. We ran across this in Chapter 2, in the case of the water jug problem of Figure 2.3. Recall that rules 11 and 12 were special cases of rules 9 and 5, respectively. The purpose of such specific rules is to allow for the kind of knowledge that expert problem solvers use when they solve problems directly, without search. If we consider all rules that match, then the addition of such special-purpose rules will increase the size of the search rather than decrease it. In order to prevent that, we build the matcher so that it rejects rules that are more general than other rules that also match. How can the matcher decide that one rule is more general than another? There are a few easy ways:
Another way in which the matching process can ease the burden on the search mechanism is to order the matches it finds based on the importance of the objects that are matched. There are a variety of ways this can happen. Consider again ELIZA, which matched pattems against a user's sentence in order to find a rule to generate a reply. The patterns looked for specific combinations of important keywords. Often an input sentence contained several of the keywords that ELIZA knew. If that happened, then ELIZA made use of the fact that some keywords had been marked as being more significant than others. The pattern matcher retumed the match involving the highest priority keyword. For example, ELIZA knew the word "I" as a keyword. Matching the input sentence "I know everybody laughed at me" by the keyword "I" would have enabled it to respond, "You say you know everybody laughed at you." But ELIZA also knew the word "everybody" as a keyword. Because "everybody" occurs more rarely than "I," ELIZA knows it to be more semantically significant and thus to be the clue to which it should respond. So it will produce a response such as "Who in particular are you thinking of?" Notice that priority matching such as this is particularly important if only one of the choices will ever be tried. This was true for ELIZA and would also be true, say, for a person who, when leaving a fast-buming room, must choose between tuming off the lights (normally a good thing to do) and grabbing the baby (a more important thing to do).
Another form of priority matching can occur as a function of the position of the matchable objects in the current state description. For example, suppose we want to model the behavior of human short-term memory (STM). Rules can be matched against the current contents of STM and then used to generate actions, such as producing output to the environment or storing something in long-term memory. In this situation, we might like to have the matcher first try to match against the objects that have most recently entered STM and only compare against older elements if the newer elements do not trigger a match. For a discussion of this method as a conHict resolution strategy in a production system, see Newell [1973].
Suppose that there are several rules waiting to fire. One way of selecting among them is to fire all ofthem temporarily and to examine the results ofeach. Then, using a heuristic function that can evaluate each of the resulting states, compare the merits of the results, and select the preferred one. Throw away (or maybe keep for later if necessary) the remaining ones.
This approach should look familiar it is identical to the best-first search procedure we saw in Chapter 3. Although conceptually this approach can be thought of as a conflict resolution strategy, it is usually implemented as a search control technique that operates on top of the states generated by rule applications.1he drawback to this design is that LISP-coded search control knowledge is procedural and therefore ditificult to modify. Many AI search programs, especially ones that learn from their experience, represent their control strategies declaratively. The next section describes some methods for capturing knowledge about control using rules.
fruitless ones not be pursued. Knowledge about which paths are most likely to lead quickly to a goal state is often called search control knowledge. It can take many forms:
In Chapter 3, we saw how the first type of knowledge could be represented with heuristic evaluation functions. There are many ways of representing the other types of control knowledge. For example, rules can be labeled and partitioned. A medical diagnosis system might have one set of rules for reasoning about bacteriological diseases and another set for immunological diseases. If the system is trying to prove a particular fact by backward chaining, it can probably eliminate one of the two rule sets, depending on what the fact is. Another method [Etzioni,1989] is to assign cost and probability- of-success measures to rules. The problem solver can then use probabilistic decision analysis to choose a cost-effective alternative at each point in the search.
By now it should be clear that we are discussing how to represent knowledge about knowledge. For this reason. search control knowledge is sometimes called meta- knowledge. Davis [1980] first pointed out the need for meta-knowledge, and suggested that it be represented declaratively using rules. The syntax for one type of control rule is shown in Figure 6.8.
A number of AI systems represent their control knowledge with rules. We look briefly at two such systems, SOAR and PRODIGY
SOAR [Laird et al.,1987] is a general architecture for building intelligent systems. SOAR is based on a set of specific, cognitively motivated hypotheses about the structure of human problem solving. These hypotheses are derived from what we know about shon-term memory, practice effects, etc. In SOAR:
The third feature is of most interest to us here. When SOAR is given a stan state and a goal state, it sets up an initial problem space. In order to take the first step in that space, it must choose a rule from the set of applicable ones. Instead of employing a fixed con Nict resolution strategy, SOAR considers that choice ofrules to be a substantial problem in its own right, and it actually sets up another, auxiliary problem space. The rules that apply in this space look something like the rule shown in Figure 6.8. Operator preference rules may be very general, such as the ones deseribed in the previous section on conflict resolution, or they may contain domain-specific knowledge.
SOAR also has rules for expressing a preference for applying a whole sequence of rules in a given situation. In learning mode, SOAR can take useful sequences and build from them more complex productions that it can apply in the future.
We can also write rules based on preferences for some states over others. Such rules can be used to implement the basic search strategies we studied in Chapters 2 and 3. For example, if we always prefer to work from the state we generated last, we will get depth-first behavior. On the other hand, if we prefer states that were generated earlier in time, we will get breadth-first behavior. If we prefer any state that looks better than the current state (according to some heuristic funetion), we will get hill climbing. Best-first search results when state preference rules prefer the state with the highest heuristic score. Thus we see that all of the weak methods are subsumed by an arehitecture that reasons with explicit search control knowledge. Different methods may be employed for different problems, and specific domain knowledge can override the more general strategies.
PRODIGY [Minton et al.,1989) is a general-purpose problem-solving system that incorporates several dififerent leaming mechanisms. A good deal of the learning in PRODIGY is directed at automatically constructing a set of control rules to improve search in a particular domain. We return to PRODIGY's learning methods in Chapter I7, but we mention here a few facts that bear on the issue ofsearch control rules. PRODIGY can acquire control rules in a number of ways:
PRODIGY learns control rules from its experience, but unlike SOAR it also learns from its failures. If PRODIGY pursues an unfruitful path, it will try to come up with an explanation of why that path failed. It will then use that explanation to build control knowledge that will help it avoid fruitless search paths in the future.
One reason why a path may lead to difficulties is that subgoals can interact with one another. In the process of solving one subgoal, we may undo our solution of a previous subgoal. Search control knowledge can tell us something about the order in which we should pursue our subgoals. Suppose we are faced with the problem of building a piece of wooden furniture. The problem specifies that the wood must be sanded, sealed, and painted. Which of the three goals do we pursue first? To humans who have knowledge about this son of thing, the answer is clear. An AI program, however, might decide to try painting first, since any physical object can be painted, regardless of whether it has been sanded. However, as the program plans further, it will realize that one of the effects ofthe sanding process is to remove the paint. The program will then be forced to plan a repainting step or else backtrack and try working on another subgoal first. Proper search control knowledge can prevent this wasted computational effort. Rules we might consider include:
Before closing this section, we should touch on a couple of seemingly paradoxical issues conceming control rules. The first issue is called the utility problem [Minton, 1988]. As we add more and more control knowledge to a system, the system is able to search morejudiciously. This cuts down on the number of nodes it expands. However, in deliberating about which step to take next in the search space, the system must consider all the control rules. If there are many control rules, simply matching them all can be very time-consuming. It is easy to reach a situation (especially in systems that generate control knowledge automatically) in which the system's problem-solving efficiency, as measured in CPU cycles, is worse with the control rules than without them. Different systems handle this problem in different ways, as demonstrated in Section 17.4.4.
The second issue concerns the complexity of the production system interpreter. As this chapter has progressed, we have seen a trend toward explicitly representing more and more knowledge about how search should proceed. We have found it useful to create meta-rules that talk about when to apply other rules. Now, a production system interpreter must know how to apply various rules and meta-rules, so we should expect that our interpreters will have to becnme more complex as we progress away from simple backward-chaining systems like PROLOG. And yet, moving to a declarative representation for control knowledge means that previously hand coded LISP funetions can be eliminated from the interpreter. In this sense, the interpreter becomes m orc streamlined.
In this chapter, we have seen how to represent knowledge declaratively in rule-based systems and how to reason with that knowledge. We began with a simple mechanism, logic programming, and progressed to more complex production system models that can reason both forward and backward, apply sophisticated and efficient matching techniques, and represent their search control knowledge in rules.
In later chapters, we expand further on rule-based systems. In Chapter 7, we describe the use of rules that allow default reasoning to occur in the absence of specific counter evidence. In Chapter 8, we introduce the idea of attaching probabilistic measures to rules. And, in Chapter 20, we look at how rule-based systems are being used to solve complex, real-world problems.
The book Pattern-Directed Inference Systems [Waterman and Hayes-Roth,1978] is a collection of papers describing the wide variety of uses to which production systems have been put in AI. Its introduction provides a good overview of the subject. Brownston et al. [1985] is an introduction to programming in production rules, with an emphasis on the OPSS programming language.
&every;x : &every;y : cat(x) ^ fish(y) -> likes - to - eat(x,y)
&every;x : calico(x) -> cat(x)
&every;x : tuna(x) -> fish(x)
tuna(Charlie)
tuna(Herb)
calico(Puss)