Chapter 17 - Learning

17.1 What Is Learning?

One of the most often heard criticisms of AI is that machines cannot be called intelligent until they are able to learn to do new things and to adapt to new situations, rather than simply doing as they are told to do. There can be little question that the ability to adapt to new surroundings and to solve new problems is an important characteristic of intelligent entities. Can we expect to see such abilities in programs? Ada Augusta, one of the earliest philosophers of computing, wrote that

This remark has been interpreted by several AI critics as saying that computers cannot learn. In fact, it does not say that at all. Nothing prevents us from telling a computer how to interpret its inputs in such a way that its performance gradually improves.

Rather than asking in advance whether it is possible for computers to "learn," it is much more enlightening to try to describe exactly what activities we mean when we say "learning" and what mechanisms could be used to enabte us to perform those activities. Simon [1983] has proposed that learning denotes

As thus defined, learning covers a wide range of phenomena. At one end of the spectrum is skill refinement. People get better at many tasks simply by practicing. The more you ride a bicycle or play tennis, the better you get. At the other end of the spectrum lies knowledge acquisition. As we have seen, many AI programs draw heavily on knowledge as their source of power. Knowledge is generally acquired through experience, and such acquisition is the focus of this chapter.

Knowledge acquisition itself includes many different activities. Simple storing of computed information, or rote learning, is the most basic learning activity. Many computer programs, e.g., database systems, can be said to "learn" in this sense, although most people would not call such simple storage learning. However, many AI programs are able to improve their performance substantially through rote-learning techniques, and we will look at one example in depth, the checker-playing program of Samuel [1963].

Another way we learn is through taking advice from others. Advice taking is similar to rote learning, but high-level advice may not be in a form simple enough for a program to use directly in problem solving. The advice may need to be first operationalized, a process explored in Section 17.3.

People also learn through their own problem-solving experience. After solving a complex problem, we remember tbe structure of the problem and the methods we used to solve it. The next time we see the problem, we can solve it more efficiently. Moreover, we can generalize from our experience to solve related problems more easi ly. In contrast to advice taking, learning from problem-solving experience does not usually involve gathering new knowledge that was previously unavailable to the learning program. That is, the program remembers its experiences and generalizes from them, but does not add to the transitive closure [Footnote: The transitive closure of a program's knowledge is that knowledge plus whatever the program can logically deduce from it.] of its knowledge, in the sense that an advice-taking program would, i.e., by receiving stimuli from the outside world. In large problem spaces, however, efficiency gains are critical. Practically speaking, learning can mean the difference between solving a problem rapidly and not solving it at all. In addition, programs that learn through problem-solving experience may be able to come up with qualitatively better solutions in the future.

Another form of Iearning that does involve stimuli from the outside is learning from examples. We often learn to classify things in the world without being given explicit rules. For example, adults can differentiate between cats and dogs, but small children often cannot. Somewhere along the line, we induce a method for telling cats from dogs based on seeing numerous examples of each. Learning from examples usually involves a teacher who helps us classify things by conecting us when we are wrong. Sometimes, however, a program can discover things without the aid of a teacher.

AI researchers have proposed many mechanisms for doing the kinds of Iearning described above. In this chapter, we discuss several of them. But keep in mind throughout this discussion that learning is itself a problem-solving process. In fact, it is very difficult to formulate a precise definition of learning that distinguishes it from other problem-solving tasks. Thus it should come as no surprise that, throughout this chapter, we will make extensive use of both the problem-solving mechanisms and the knowledge representation techniques that were presented in Parts I and II.

17.2 Rote Learning

When a computer stores a piece of data, it is performing a rudimentary form of learning. After all, this act of storage presumably allows the program to perform better in the future (otherwise, why bother?). In the case ofdata caching, we store computed values so that we do not have to recompute them later. When computation is more expensive than recall, this strategy can save a significant amount of time. Caching has been used

Figure 17.1: Storing Backed-Up Values

in AI programs to produce some surprising performance improvements. Such caching is known as rote learning.

In Chapter 12, we mentioned one of the earliest game-playing programs, Samuel's checkers program [Samuel,1963]. This program learned to play checkers well enough to beat its creator. It exploited two kinds of learning: rote learning, which we look at now, and parameter (or coefficient) adjustment, which is described in Section 17.4.1. Samuel's program used the minimax search procedure to explore checkers game trees. As is the case with all such programs, time constraints permitted it to search only a few levels in the tree. (The exact number varied depending on the situation.) When it could search no deeper, it applied its static evaluation function to the board position and used that score to continue its search of the game tree. When it finished searching the tree and propagating the values backward, it had a score for the position represented by the root of the tree. It could then choose the best move and make it. But it also recorded the board position at the root of the tree and the backed up score that had just been computed for it. This situation is shown in Figure I 7.1 (a).

Now suppose that in a later game, the situation shown in Figure 17.1 (b) were to arise. Instead of using the static evaluation funetion to compute a score for position A, the stored value for A can be used. This creates the effect of having searehed an additional several ply since the stored value for A was compuþed by backing up values from exactly such a search.

Rote learning of this sort is very simple. It does not appear to involve any sophis- ticated problem-solving capabilities. But even it shows the need for some capabilities that will become inereasingly important in more complex learning systems. These capabilities include:

At this point, we have begun to see one way in which learning is similar to other kinds of problem solving. Its success depends on a good organizational stmcture for its knowledge base.

17.3 Learning by Taking Advice

A computer can do very little without a program for it to run. When a programmer writes a series of instructions into a computer, a rudimentary kind of learning is taking place: The programmer is a sort of teacher, and the computer is a son of student. After being programmed, the computer is now able to do something it previously could not. Executing the program may not be such a simple matter, however. Suppose the program is written in a high-level language like LISP. Some interpreteror compiler must intervene to change the teacher's instructions into code that the machine can execute directly.

People process advice in an analogous way. In chess, the advice "fight for control of the center of the board" is useless unless the player can translate the advice into concrete moves and plans. A computer program might make use of the advice by adjusting its static evaluation function to inelude a factor based on the number of center squares attacked by its own pieces.

Mostow [1983] describes a program called FOO, which accepts advice for playing hearts, a card game. A human user first translates the advice from English into a representation that FOO can understand. For example, "Avoid taking points" becomes:

FOO must operationalize this advice by turning it into an expression that contains concepts and actions FOO can use when playing the game of hearts. One strategy FOO can follow is to UNFOLD an expression by replacing some temi by its definition. By UNFOLDing the definition of avoid, FOO comes up with:

FOO considers the advice to apply to the player called "me." Next, FOO UNFOLDs the definition of trick:

(achieve (not (during

(scenario

(each p1 (players)(play-card p1))

(take-trick (trick-winner)))

(take-points me))))

In other words, the player should avoid taking points during the scenario consisting of(1) players playing cards and (2) one player taking the trick. FOO then uses case analysis to determine which steps could cause one to take points. It rules out step l on the basis that it knows of no intersection of the concepts take-points and play-card. But step 2 could affect taking points, so FOO UNFOLDs the definition of take-points:

(achieve (not (there-exists c1(cards-played)

(there-exists c2 (point-cards)

(during (take (trick-winner) c1)

(take me c2))))))

This advice says that the player should avoid taking point-cards during the process of the trick-winner taking the trick. The question for FOO now is: Under what conditions does (take me c2) occur during (take (trick-winner) c1)? By using a technique called partial match, FOO hypothesizes that points will be taken if me = trick-winner and c2 = c1. It transforms the advice into:

(achieve (not (and (have-points (cards-played))

(= (trick-winner) me))))

This means "Do not win a trick that has points." We have not traveled very far conceptually from "avoid taking points," but it is important to note that the current vocabulary is one that FOO can understand in terms of actually playing the game of hearts. Through a number of other transformations, FOO eventually settles on:

(achieve (>= (and (in-suit-led (card-of me))

(possible (trick-has-points)))

(low (card-of me)))

In other words, when playing a card that is the same suit as the card that was played first, if the trick possibly contains points, then play a low card. At last, FOO has translated the rather vague advice "avoid taking points ' into a specific, usable heuristic. FOO is able to play a better game of hearts after receiving this advice. A human can watch FOO play, detect new mistakes, and correct them through yet more advice, such as "play high cards when it is safe to do so." The ability to operationalize knowledge is critical for systems that learn from a teacher 's advice. It is also an important component of explanation-based learning, another form of learning discussed in Section 17.6.

17.4 Learning in Problem Solving

In the last section, we saw how a problem solver could improve its performance by taking advice from a teacher. Can a program get better without the aid of a teacher? It can, by generalizing from its own experiences.

17.4.1 Learning by Parameter Adjustment

Many programs rely on an evaluation procedure that combines information from several sources into a single summary statistic. Game-playing programs do this in their static evaluation functions, in which a variety offactors, such as piece advantage and mobility, are combined into a single score reflecting the desirability of a particular board position. Pattern classification programs often combine several features to determine the correct category into which a given stimulus should be placed. In designing such programs, it is often difficult to know a priori how much weight should be attached to each feature being used. One way of finding the correct weights is to begin with some estimate of the correct settings and then to let the program modify the settings on the basis of its experience. Features that appear to be good predictors ofoverall success will have their weights inereased, while those that do not will have their weights decreased, perhaps even to the point of being dropped entirely.

Samuel's checkers program [Samuel,1963] exploited this kind of learning in addition to the rote learning described above, and it provides a good example of its use. As its static evaluation funetion, the program used a polynomial of the form

The t terms are the values of the sixteen features that contribute to the evaluation. The c terms are the coefficients (weights) that are attached to each of these values. As learning progresses,the c values will change.

The most important question in the design of a learning program based on parameter adjustment is "When should the value ofa coefficient be increased and when should it be decreased?" The second question to be answered is then "By how much should the value be changed'?" The simple answer to the first question is that the coefficients of terms that predicted the fi nal outcome accurately should be increased, while the coefficients of poor predictors should be decreased. In some domains, this is easy to do. If a pattem classification program uses its evaluation function to classify an input and it gets the right answer, then all the terms that predicted that answer should have their weights inereased. But in game-playing programs, the problem is more difficult. The program does not get any conerete feedback from individual moves. It does not find out for sure until the end of the game whether it has won. But many moves have contributed to that final outcome. Even if the program wins, it may have made some bad moves along the way. The problem of appropriately assigning responsibility to each of the steps that led to a single outcome is known as the credit assignment problem.

Samuel's program exploits one technique, albeit imperfect, for solving this problem. Assume that the initial values chosen for the coefficients are good enough that the total evaluation funetion produces values that are fairly reasonable measures ofthe correct score even if they are not as accurate as we hope to get them. Then this evaluation function can be used to provide feedback to itself. Move sequences that lead to positions with higher values can be considered good (and the terms in the evaluation funetion that suggested them can be reinforced).

Because of the limitations of this approach, however, Samuel,s program did two other things, one of which provided an additional test that progress was being made and the other of which generated additional nudges to keep the process out of a rut:

This process of learning by successive modifications to the weights of terms in a scoring function has many limitations, mostly arising out of its lack of exploitation of any knowledge about the structure of the problem with which it is dealing and the logical relationships among the problem's components. In addition, because the learning procedure is a variety of hill climbing, it suffers from the same difficulties as do other hill-climbing programs. Parameter adjustment is certainly not a solution to the overall learning problem. But it is often a useful technique, either in situations where very little additional knowledge is available or in programs in which it is combined with more knowledge-intensive methods. We have more to say about this type of learning in Chapter 18.

17.4.2 Learning with Macro-Operators

We saw in Section 17.2 how rote learning was used in the context of a checker-playing program. Similar techniques can be used in more general problem-solving programs. The idea is the same: to avoid expensive recomputation. For example, suppose you are faced with the problem of getting to the downtown post office. Your solution may involve getting in your car, starting it, and driving along a certain route. Substantial planning may go into choosing the appropriate route, but you need not plan about how to go about staning your car. You are free to treat START CAR as an atomic action, even though it really consists of several actions: sitting down, adjusting the mirror, insening the key, and turning the key. Sequences of actions that can be treated as a whole are called macro-operators.

Macro-operators were used in the early problem-sotving system STRIPS [Fikes and Nilsson,1971; Fikes et al.,1972]. We discussed the operator and goal structures of STRIPS in Section 13.2, but STRIPS also has a learning component. After each problem-solving episode, the learning component takes the computed plan and stores it away as a macro-operator, or MACROP. A MACROP is just like a regular operator except that it consists of a sequerlce of actions, not just a single one. A MACROP's preconditions are the initial conditions of the problemjust solved, and its postconditions correspond to the goal just achieved. In its simplest form, the caching of previously computed plans is similar to rote learning.

Suppose we are given an initial blocks world situation in which ON(C, B) and ON(A, Table) are both true. STRIPS can achieve the goal ON(A, B) by devising a plan with the four steps UNSTACK(C, B), PUTDOWN(C), PlCKUP(A), STACK(A, B). STRIPS now builds a MACROP with preconditions ON(C, B), ON(A, Table) and postconditions ON(C, Table), ON(A, B). The body of the MACROP consists of the four steps just mentioned. In future planning, STRIPS is free to use this complex macro-operator just as it would use any other operator.

But rarely will STRlPS see the exact same problem twice. New problems will differ from previous problems. We would still like the problem solver to make efficient use of the knowledge it gained from its previous experiences. By generalizing MACROPs before storing them, STRIPS is able to accomplish this. The simplest idea for generalization is to replace all of the constants in the macro-operator by variables. Instead of storing the MACROP described in the previous paragraph, STRIPS can generalize the plan to consist of the steps UNSTACK(x1, x2), PUTDOWN(x1), PICKUP(x3), STACK(x3,x2), where x1, x2, and x3 are variables. This plan can then be stored with preconditions ON(x1, x2), ON(x3, Table) and postconditions ON(x1, Table), ON(x2, x3). Such a MACROP can now apply in a variety of situations.

Generalization is not so easy, however. Sometimes constants must retain their specific values. Suppose our domain included an operatorcalled STACK-ON-B(x), with preconditions that both x and B be clear, and with postcondition ON(x, B). Consider the same problem as above:

STRIPS might come up with the plan UNSTACK(C, B), PUTDOWN(C), STACK- ON-BlA). Let's generalize this plan and store it as a MACROP. The precondition becomes ON(x3, x2), the postcondition becomes ON(x1, x2), and the plan itself becomes UNSTACK(x3, x2), PUTDOWN(x3). STACK-ON-B(x1). Now, suppose we encounter a slightly different problem:

The generalized MACROP we just stored seems well-suited to solving this problem if we let x1 = A, x2 = C, and x3 = E. Its preconditions are satisfied, so we construct the plan UNSTACK(E, C), PUTDOWN(E), STACK-ON-B(A). But this plan does not work. The problem is that the postcondition of the MACROP is overgeneralized. This operation is only useful for stacking blocks onto B, which is not what we need in this new example. In this case, this difficulty will be discovered when the last step is attempted. Although we cleared C, which is where we wanted to put A, we failed to clear B, which is were the MACROP is going to try to put it. Since B is not clear, STACK-ON-B cannot be executed. If B had happened to be clear, the MACROP would have exec:uted to completion, but it would not have accomplished the stated goal.

In reality, STRIPS uses a more complex generalization procedure. First, all constants are replaced by variables. Then, for each operator in the parameterized plan, STRlPS reevaluates its preconditions. In our example, the preconditions of steps 1 and 2 are satisfied, but the only way to ensure that B is clear for step 3 is to assume that block x2, which was cleared by the UNSTACK operator, is actually block B. Through "reproving" that the generalized plan works, STRIPS locates constraints of this kind.

More recent work on macro-operators appears in Korf [1985b]. It turns out that the set of problems for which macro-operators are critical are exactly those probtems with nonserializable subgoals. Nonserializability means that working on one subgoal will necessarily interfere with the previous solution to another subgoal. Recall that we discussed such problems in connection with nonlinear planning (Section 13.5). Macro- operators can be useful in such cases, since one macro-operator can produce a small global change in the world, even though the individual operators that make it up produce many undesirable local changes.

For example, consider the 8-puzzle. Once a program has correctly placed the first four tiles, it is difficult to place the tifth tile without disturbing the first four. Because disturbing previously solved subgoals is detected as a bad thing by heuristic scoring funetions, it is strongly resisted. For many problems, includingthe 8-puzzle and Rubik's cube, weak methods based on heuristic scoring are therefore insufticient. Hence. we either need domain-specific knowledge, orelse a new weak method. Fortunately, wc can learn the domain-specilic knowledge we need in the form of macro-operators. Thus, macro-operators can be viewed as a weak method for learning. In the R-puzzle, for example, we might have a macrþ-a complex, prestored seyuence of operators-for placing the fifth tile without disturbing any of the first four tiles externally (although in fact they are disturbed within the macro itselt). Korf [1985b] gives an algorithm for learning a complete set of macro-operators. This approach contrasts with STRIPS, which learned its MACROPs gradually, from experience. Korf's algorithm runs in time proportional to the time it takes to solve a single problem without macro-operators.

17.4.3 Learning by Chunking

Chunking is a process similar in flavor to macro-operators. The idea of chunking comes from the psychological literature on memory and problem solving. Its computational basis is in production systems, of the type studied in Chapter 6. Recall that in that chapter we described the SOAR system and discussed its use of control knowledge. SOAR also exploits chunking [Laird et al.,1986] so that its performance can increase with experience. In fact, the designers of SOAR hypothesize that chunking is a universal learning method, i.e., it can account for all types of learning in intelligent systems.

SOAR solves problems by firing productions, which are stored in long-term memory. Some of those firings turn out to be more useful than others. When SOAR detects a useful sequence of production firings, it creates a chunk, which is essentially a large production that does the work of an entire sequence of smaller ones. As in MACROPs, chunks are generalized before they are stored.

Recall from Section 6.5 that SOAR is a uniform processing architecture. Problems like choosing which subgoals to tackle and which operators to try (i.e., search control problems) are solved with the same mechanisms as problems in the original problem space. Because the problem solving is unifomi, chunking can be used to learn general search control knowledge in addition to operator sequences. For example, if SOAR tries several different operators, but only one leads to a useful path in the search space, then SOAR builds productions that help it choose operators more wisely in the future.

SOAR has used chunking to replicate the macro-operator results described in the last section. In solving the 8-puzzle, for example, SOAR learns how to place a given tile without pemianently disturbing the previously placed tiles. Given the way that SOAR learns, several chunks may encode a single macro-operator, and one chunk may participate in a number of macro sequences. Chunks are generally applicable toward any goal state. This contrasts with macro tables, which are structured toward reaching a particular goal state from any initial state. Also, chunking emphasizes how learning can occur during problem solving, while macro tables are usually built during a preprocessing stage. As a result, SOAR is able to learn within trials as well as across trials. Chunks learned during the initial stages of solving a problem are applicable in the later stages of the same problem-solving episode. After a solution is found, the chunks remain in memory, ready for use in the next problem.

The price that SOAR pays for this generality and flexibility is speed. At present, chunking is inadequate for duplicating the contents of large, directly-computed macro- operator tables.

17.4.4 The Utility Problem
PRODIGY [Minton et al., 1989], which we described in Section 6.5, also acquires control knowledge automatically. PRODIGY employs several learning mechanisms. One mechanism uses explanation-based learning (EBL), a learning method we discuss in Section 17.6. PRODIGY can examine a trace of its own problem-solving behavior and try to explain why certain paths failed. The program uses those explanations to formulate control rules that help the problem solver avoid those paths in the future. So while SOAR learns primarily from examples of suecessful problem solving, PRODIGY also learns from its failures.

A major contribution of the work on EBL in PRODIGY [Minton,1988] was the identification of the utility problem in learning systems. While new search control knowledge can be of great benefit in solving future problems efficiently, there are also some drawbacks. The learned control rules can take up large amounts of memory and the search program must take the time to consider each rule at each step during problem solving. Considering a control rule amounts to seeing if its postconditions are desirable and seeing if its preconditions are satisfied. This is a time-consuming process. So while learned rules may reduce problem-solving time by directing the search more carefully, they may also increase problem-solving time My forcing the problem solver to consider them. If we only want to minimize the number of node expansions in the search space, then the more control rules we learn, the better. But if we want to minimize the total CPU time required to solve a problem, we must consider this trade-off.

PRODIGY maintains a utility measure for each control rule. This measure takes into account the average savings provided by the rule, the frequency of its application, and the cost of matching it. If a proposed rule has a negative utility it is discarded (or "forgotten"). If not, it is placed in long-term memory with the other rules. It is then monitored during subsequent problem solving. If its utility falls, the rule is discarded. Empirical experiments have demonstrated the effectiveness of keeping only those control rules with high utility Utility considerations apply to a wide range of learning systems. For example, for a discussion of how to deal with large, expensive chunks in SOAR, see Tambe and Rosenbloom [1989].

17.5 Learning from Examples: Induction

Classification is the process of assigning, to a particular input, the name of a class to which it belongs. The classes from which the classification procedure can choose can be described in a variety of ways. Their definition will depend on the use to which they will be put.

Classification is an important component of many probLem-solving tasks. In its simplest form, it is presented as a straightforward recognition task. An example of this is the question "What letter ofthe alphabet is this?" But often classification is embedded inside another operation. To see how this can happen, consider a problem-solving system that contains the following production rule:

If: the current goal is to get from place A to place B, and

there is a WALL separating the two places

then: look for a DOORWAY in the WALL and go through it.

To use this rule successfully, the system's matching routine must be able to identify an object as a wall. Without this, the rule can never be invoked. Then, to apply the rule, the system must be able to recognize a doorway.

Before classification can be done, the classes it will use must be defined. This can be done in a variety of ways, including:

There are advantages and disadvantages to each of these general approaches. The statistical approach taken by the first scheme presented here is often more efficient than the structural approach taken by the second. But the second is more fiexible and more extensible.

Regardless of the way that classes are to be described, it is often difficult to construct, by hand, good class definitions. This is particularly true in domains that are not well understood or that change rapidly. Thus the idea of producing a classification program that can evolve its own class definitions is appealing. 'This task of constructing class definitions is called conCept leaiùning, or induction. The techniques used for this task must, of course, depend on the way that classes (concepts) are described. Ifclasses are described by scoring functions, then concept learning can be done using the technique of coefficient adjustment described in Section 17.4.1. If, however, we want to define classes structurally, some other technique for learning class definitions is necessary. In this section, we present three such techniques.

17.5.1 Winston's Learning Program

Winston [1975] describes an early structural concept learning program. This program operated in a simple blocks world domain. Its goal was to construct representations of the definitions of concepts in the blocks domain. For example, it learned the concepts House, Tent, and Arch shown in Figure 17.2. The figure also shows an example ofa near miss for each concept. A near miss is an object that is not an instance of the concept in question but that is very similar to such instances.

Figure 17.2: Some Blocks World Concepts

The program started with a line drawing of a blocks world structure. It used procedures such as the one described in Section 14.3 to analyze the drawing and construct a semantic net representation ofthe structural description of the object(s). This structural description was then provided as input to the learning program. An example of such a structural description for the House of Figure 17.2 is shown in Figure 17.3(a). Node A represents the entire structure, which is composed of two parts: node B, a Wedge, and node C, a Brick. Figures 17.3(b) and I 7.3(c) show descriptions of the two ArCh structures of Figure 17.2. These descriptions are identical except for the types of the objects on the top; one is a Brick while the other is a Wedge. Notice that the two supporting objects are related not only by left-of and right-of links, but also by a does-not-marry link, which says that the two objects do not marry. Two objects marry if they have faces that touch and they have a common edge. The marry relþtion is critical in the definition of an Arch. It is the difference between the first areh structure and the near miss arch structure shown in Figure 17.2.

The basic approach that Winston's program took to the problem ofconcept formation can be described as follows:

  1. Begin with a structural description of one known instance of the concept. Call that description the concept definition.
  2. Examine descriptions of other known instances of the concept. Generalize the definition to include them.
  3. Examine descriptions of near misses of the concept. Restrict the definition to exclude these.

Steps 2 and 3 of this procedure can be interleaved.

Steps 2 and 3 of this procedure rely heavily on a comparison process by which similarities and differences between structures can be detected. This process must function in much the same way as does any other matching process, such as one to determine whether a given production rule can be applied to a particular problem state. Because differences as well as similarities must be found, the procedure must perform

Figure 17.3: Structural Descriptions

Figure 17.4: The Comparison of Two Arches

not just literal but also approximate matching. The output of the comparison procedure is a skeleton structure describing the commonalities between the two input structures. It is annotated with a set of comparison notes that describe specific similarities and differences between the inputs.

To see how this approach works, we trace it through the process of learning what an arch is. Suppose that the arch description of Figure 17.3(b) is presented first. It then becomes the definition of the concept Areh. Then suppose that the areh description of Figure 17.3(c) is presented. The comparison routine wilt return a structure similar to the two input structures except that it will note that the objects represented by the nodes labeled C are not identical. This structure is shown as Figure 17.4. The c-note link from node C describes the difference found by the comparison routine. It notes that the difference occurred in the isa link, and that in the first structure the isa link pointed to Brick, and in the second it pointed to Wedge. It also notes that if we were to follow isa links from Brick and Wedge, these links would eventually merge. At this point, a new description of the concept Areh can be generated. This description could say simply that node C must be either a Brick or a Wedge. But since this particular disjunction has no previously known significance, it is probably better to trace up the isa hierarchies of Brick and Wedge until they merge. Assuming that that happens at the node Object, the Arch definition shown in Figure 17.5 can be built.

Figure 17.5: The Arch Description after Two Examples

Next, suppose that the near miss arch shown in Figure 17.2 is presented. This time, the comparison routine will note that the only difference between the current definition and the near miss is in the does-not-marry link between nodes B and D. But since this is a near miss, we do not want to broaden the definition to include it. Instead, we want to restrict the definition so that it is specifically exeluded. To do this, we modify the link does-not-marry, which may simply be recording something that has happened by chance to be true of the small number of examples that have been presented. It must now say must-not-marry. The Arch description at this point is shown in Figure 17.6. Actually, must-not-marry should not be a completely new link. There must be some structure among link types to refiect the relationship between marry, does-not-marry, and must-not-marry.

Notice how the problem-solving and knowledge representation techniques we covered in earlier chapters are brought to bear on the problem of learning. Semantic networks were used to describe block structures, and an isa hierarehy was used to de- scribe relationships among already known objects. A matehing process was used to detect similarities and differences between structures, and hill climbing allowed the program to evolve a more and more accurate concept definition.

This approach to structural concept learning is not without its problems. One major problem is that a teacher must guide the learning program through a carefully chosen seyuence of examples. In the next section, we explore a learning techniyue that is insensitive to the order in which examples are presented.

17.5.2 Version Spaces

Mitehell [1977; 1978] describes another approach to concept learning called version spaces. The goal is the same: to produce a description that is consistent with all positive examples but no negative examples in the training set. But while Winston's system did this by evolving a single concept description, version spaces work by maintaining n

Figure 17.6: The Arch Description after a Near Miss

Figure 17.7: An Example of the Concept Car

set of possible descriptions and evolving that set as new examples and near misses are presented. As in the previous section, we need some sort ofrepresentation language for examples so that we can describe exactly what the system sees in an example. For now we assume a simple frame-based language, although version spaces can be constructed for more general representation languages. Consider Figure 17.7, a frame representing an individual car.

Now, suppose that each slot may contain only the diserete values shown in Figure 17.8. The choice of features and values is called the bias of the learning system. By being embedded in a particular program and by using particular representations, every learning system is biased, because it learns some things more easily than others. In our example, the bias is fairly simple-e.g., we can learn concepIs that have to do with car manufacturers, but not car owners. In more complex systems, the bias is less obvious. A clear statement of the bias of a learning system is very important to its evaluation.

Concept descriptions, as well as training examples, can be stated in terms of these slots and values. For example, the concept "Japanese economy car" can be represented as in Figure I7.9. The names x1,x2 and x3 are variables. The presence of x2, for example, indicates that the color of a car is not relevant to whether the car is a Japanese

Figure 17.8: Representation Language for Cars

Figure 17.9: The Concept "Japanese economy car"

economy car. Now the learning problem is: Given a representation language such as in Figure 17.8, and given positive and negative training examples such as those in Figure 17.7, how can we produce a concept description such as that in Figure 17.9 that is consistent with all the training examples?

Before we proceed to the version space algorithm, we should make some observations about the representation. Some descriptions are more general than others. For example, the description in Figure 17.9 is more general than the one in Figure 17.7. In fact, the representation language defines a partial ordering of descriptions. A portion of that partial ordering is shown in Figure l7.10.

The entire partial ordering is called the concept space, and can be depicted as in Figure 17.11. At the top of the concept space is the null description, consisting only of variables, and at the bottom are all the possible training instances, which contain no variables. Before we receive any training examples, we know that the target concept Iies somewhere in the concept space. For example, if every possible description is an instance of the intended concept, then the null description is the concept definition since it matches everything. On the other hand, if the target concept includes only a single example, then one of the descriptions at the bottom of the concept space is the desired concept definition. Most target concepts, of course, lie somewhere in between these twc extremes.

As we process training examples, we want to refine our notion of where the target concept might lie. Our current hypothesis can be represented as a subset of the concept space called the version space. The version space is the largest collection of descriptions that is consistent with all the training examples seen so far.

How can we represent the version space? The version space is simply a set of descriptions, so an initial idea is to keep an explicit list of those descriptions. Unfortunately, the number of descriptions in the concept space is exponential in the number of features and values. So enumerating them is prohibitive. However, it turns out that

Figure 17.10: Partial Ordering of Concepts Specified by the Representation Language

Figure 17.11: Concept and Version Spaces

the version space has a concise representation. It consists of two subsets of the concept space. One subset, called G, contains the most general descriptions consistent with the training examples seen so far; the other subset, called S, contains the most specific descriptions consistent with the training examples. The version space is the set of all descriptions that lie between some element of G and some element of S in the partial orderofthe conceptspace.

This representation of the version space is not only efficient for storage, but also for modification. Intuitively, each time we receive a positive training example, we want to make the S set more general. Negative training examples serve to make the G set more specific. If the S and G sets converge, our range of hypotheses will narrow to a singlc concept description. The algorithm for narrowing the version space is called the

candidate elimination algorithm.

Algorithm: Candidate Elimination

Given: A representation language and a set of positive and negative examples expressed in that language.

Compute: A concept description that is consistent with all the positive examples and none of the negative examples.

  1. lnitialize G to contain one element: the null description (all features are variables).
  2. Initialize S to contain one element: the first positiveexample.
  3. Accept a new training example.

    If it is a positive example, first remove from G any descriptions that do not cover the example. Then, update the S set to contain the most specific set ofdescriptions in the version space that cover the example and the current elements of the S set.

    Figure 17.12: Positive and Negative Examples of the Concept "Japanese economy car"

    That is, generalize the elements of S as little as possible so that they cover the new training example.

    If it is a negative example, first remove from S any descriptions that cover the example. Then, update the G set to contain the most general set of descriptions in the version space that do not cover the example. That is, specialize the elements of C as little as possible so that the negative example is no longer covered by any of the elements of G.

  4. If S and G are both singleton sets, then if they are identical, output their value and halt. If they are both singleton sets but they are different, then the training cases were inconsistent. Output this result and halt. Otherwise, go to step 3.

Let us trace the operation of the candidate elimination algorithm. Suppose we want to learn the concept of "Japanese economy car" from the examples in Figure 17.12. G and S both start out as singleton sets. G contains the null description (see Figure 17.11 ), and S contains the first positive training example. The version space now contains all descriptions that are consistent with this first example:[Footnote: To make this example concise, we skip slot names in the descriptions. We just list slot values in the order in which the slocs have been shown in the preceding figures.]

G = {(x1,x2,x3,x4,x5)}

S = {(Japan, Honda, Blue,1980, Economy)}

Now we are ready to process the second example. The G set must be specialized in such a way that the negative example is no longer in the version space. In our rep- resentation language, specialization involves replacing variables with constants. (Note: The G set must be specialized only to descriptions that are within the current version space, not outside of it.) Here are the available specializations:

G = {(x1,Honda,x3,x4,x5), (x1,x2,Blue,x4,x5), (x1,x2,x3,1980,x5), (x1,x2,x3,x4,Economy)}

The S set is unaffected by the negative example. Now we come to the third example, a positive one. The first order of business is to remove from the C set any descriptions that are inconsistent with the positive example. Our new G set is:

G= {(x1,x2,Blue,x4,x5), (x1,x2,x3,x4,Economy)}

We must now generalize the S set to include the new example. This involves replacing constants with variables. Here is the new S set:

S = {(Japan, x2, Blue, x4, Economy)}

At this point, the S and G sets specify a version space (a space of candidate descrip- tions) that can be translated roughly into English as: "The target concept may be as specific as `Japanese, blue economy car,' or as general as either `blue car' or `economy car.'"

Next, we get another negative example, a car whose origin is USA. The S set is unaffected, but the C set must be specialized to avoid covering the new example. The new G set is:

G= {(Japan,x2,Blue,x4,x5), (Japan,x2,x3,x4,Economy)}

We now know that the car must be Japanese, because all of the descriptions in the version space contain Japan as origin [Footnote: it could be the case that our target concept is "not Chrysler," but we will ignore this possibility because our representation language is not powerful enough to express negation and disjunction.] Our final example is a positive one. We first remove from the G set any descriptions that are inconsistent with it, leaving:

G = {(Japan, x2, x3, x4, Economy)}

We then generalize the S set to include the new example:

S = {(Japan, x2, x3, x4, Economy)}

S and C are both singletons, so the algorithm has converged on the target concept. No more examples are needed.

There are several things to note about the candidate elimination algorithm. First, it is a least-commitment algorithm. The version space is pruned as little as possible at each step. Thus, even ifall the positive trainingexamples are Japanese cars, the algorithmwill not reject the possibility that the target concept may inelude cars of other origin-until it receives a negative exumple that forces the rejection. This means that if the training dtrta are sparse, the S and G sets may never converge to a single description; the system rnay learn only partially specified concepts. Second, the algorithm involves exhaustive, breadth-first scarch through the version space. We can see this in the algorithm for updating the G set. Contrast this with the depth-first behavior of Winston's learning program. Third, in our simple representation language, the S set always contains exactly one element, because any two positive examples always have exactly one generalization. Other representation languages may not share this property.

The version space approach can be applied to a wide variety of learning tasks and representation languages. The algorithm above can be extended to handle continuously valued features and hierarehical knowledge (see Exercises). However, version spaces have several deficiencies. One is the large space requirements ofthe exhaustive, breadth- first seareh mentioned above. Another is that inconsistent data, also called noise, can cause the candidate elimination algorithm to prune the target concept from the version space prematurely. In the car example above, if the third training instance had been mislabeled (-) instead of (+), the target concept of "Japanese economy car" would never be reached. Also, given enough erroneous negative examples, the G set can be specialized so far that the version space becomes empty. In that case, the algorithm concludes that no concept fits the training examples.

One solution to this problem [Mitchell,1978] is to maintain several G and S sets. One G set is consistent with all the training instances, another is consistent with all but one, another with all but two, etc. (and the same for the S set). When an inconsistency arises, the algorithm switehes to C and S sets that are consistent with most, but not all, of the training examples. Maintaining multiple version spaces can be costly, however, and the S and G sets are typically very large. If we assume bounded inconsistency, i.e., that instances close to the target concept boundary are the most likely to be miselassified, then more efficient solutions are possible. Hirsh [1990] presents an algorithm that runs as follows. For each instance, we form a version space consistent with that instance plus other nearby instances (for some suitable definition of nearby). This version space is then intersected with the one created for all previous instances. We keep accepting instances until the version space is reduced to a small set of candidate concept descriptions. (Because of inconsistency, it is unlikely that the version space will converge to a singleton.) We then match each of the concept descriptions against the entire data set, and choose the one that classifies the instances most accurately.

Another problem with the candidate elimination algorithm is the learning of dis- junctive concepts. Suppose we wanted to learn the concept of "European car," which, in our representation, means either a German, British, or Italian car. Given positive examples of each, the candidate elimination algorithm will generalize to cars of any origin. Given such a generalization, a negative instance (say, a Japanese car) will only cause an inconsistency of the type mentioned above.

Of course, we could simply extend the representation language to inelude disjunc- tions. Thus, the concept space would hold descriptions such as "Blue car of German or British origin" and "Italian sports car or German luxury car." This approach has two drawbacks. First, thc concept spucc becomes much larger and speciulixutian becamcas intractable. Second, generalization cun easily degenerate to the point where the S set contains simply one large disjunction of all positive instances. We must somehow force generalization while allowing for the introduction of disjunctive descriptions. Mitehell [1978] gives an iterative approach that involves several passes through the training data. On each pass, the algorithm builds a concept that covers the largest number of positive training instances without covering any negative training instances. At the end of the pass, the positive training instances covered by the new concept are removed from the

Figure 17. 13: A Decision Tree
training set, and the new concept then becomes one disjunct in the eventual disjunctive concept description. When all positive training instances have been removed, we are Icft with a disjunetive concept that c:overs alI of them without covering any negutive instanecs.

There are a number of other complexities, ineluding the way in which features interact with one another. For example, if the origin of a car is Japan, then the manufacturer cannot be Chrysler. The version space algorithm as described above makes no use of such information. Also in our example, it would be more natural to replace the decade slot with a continuously valued year field. We would have to change our procedures for updating the S and G sets to account for this kind of numerical data.

17.5.3 Decision Trees

A third approach to concept learning is the induction of decision trees, as exemplified by the ID3 program of Quinlan [1986]. ID3 uses a tree representation for concepts, such as the one shown in Figure I7.I3. To classify a particular input, we start at the top of the tree and answer yuestions until we reach a leaf, where the classification is stored. Figure 17.13 represents the familiar concept "Japanese economy car." ID3 is a program that builds decision trees automatically, given positive and negative instances of a concept. [Footnote: Actually, the decision tree representation is more general: Leaves can denote any of a number of classes, not just positive and negative.]

ID3 uses an iterative method to build up decision trees, preferring simple trees over complex ones, on the theory that simple trees are more accurate classifiers of future inputs. It begins by choosing a random subset of the training examples. This subset is called the window. The algorithm builds a decision tree that correctly classifies all examples in the window. The tree is then tested on the training examples outside the window. If all the examples are classified correctly, the algorithm halts. Otherwise, it adds a number of training examples to the window and the process repeats. Empirical evidence indicates that the iterative strategy is more efficient than considering the whole training set at once.

So how does ID3 actually construct decision trees? Building a node means choosing some attribute to test. At a given point in the tree, some attributes will yield more information than others. For example, testing the attribute color is useless if the color of a car does not help us to classify it correctly. Ideally, an attribute will separate training instances into subsets whose members share a common label (e.g., positive or negative). In that case, branching is terminated, and the leaf nodes are labeled.

There are many variations on this basic algorithm. For example, when we add a test that has more than two branehes, it is possible that one branch has no corresponding training instances. In that case, we can either leave the node unlabeled, or we can attempt to guess a label based on statistieal properties of the set of instances being tested at that point in the tree. Noisy input is another issue. One way of handling noisy input is to avoid building new branches ifthe information gained is very slight. In other words, we do not want to overcomplicate the tree to account for isolated noisy instances. Another source of uncertainty is that attribute values may be unknown. For example a patient's medical record may be incomplete. One solution is to guess the correct braneh to take; another solution is to build special "unknown" branches at each node during learning.

When the concept space is very large, decision tree learning algorithms run more quickly than their version space cousins. Also, disjunetion is more straightforward. For example, we can easily modify Figure I7.13 to represent the disjunctive concept "American car or Japanese economy car," simply by changing one of the negative (-) leaf labels to positive (+). One drawback to the ID3 approach is that large, complex decision trees can be difficult for humans to understand, and so a decision tree system may have a hard time explaining the reasons for its classifications.

17.6 Explanation-Based Learning

The previous section illustrated how we can induce concept descriptions from positive and negative examples. Learning complex concepts using these procedures typically requires a substantial number of training instances. But people seem to be able to learn quite a bit from single examples. Consider a chess player who, as Black, has reached the position shown in Figure 17.14. The posilion is called a "fork" because the white knight attacks both the black king and the black queen. Black must move the king,thereby leaving the queen open to capture. From this single experience, Black is able to learn quite a bit about the fork trap: the idea is that if any piece x attacks both the opponent's king and another piece y, then piece y will be lost. We don't need to see dozens of positive and negative examples of fork positions in order to draw these conelusions. From just one experience, we can learn to avoid this trap in the future and perhaps to use it to our own advantage.

What makes such single-example learning possible? The answer, not surprisingly is knowledge. The chess player has plenty of domain-specific knowledge that can be brought to bear, ineluding the rules of chess and any previously acyuired strategies. That knowledge can be used to identify the critical aspects of the training example. In the case of the fork, we know that the double simultaneous attack is important while the precise position and type of the attacking piece is not.

Much of the recent work in machine learning has moved away from the empirical, data-intensive approach described in the last section toward this more analytical,

Figure 17.14: A Fork Position in Chess

knowledge-intensive approach. A number of independent studies led to the characterization of this approach as explanation-based learning. An EBL system attempts to learn from a single example x by explaining why x is an example of the target concept. The explanation is then generalized, and the system's performance is improved through the availability of this knowledge.

Mitchell et al. [1986] and DeJong and Mooney [1986] both describe general frameworks for EBL programs and give general learning algorithms. We can think of EBL programs as accepting the following as input:

  • A Training Example-What the learning program "sees" in the world, e.g., the car of Figure 17.7
  • A Goal Concept-A high-)evel description of what the program is supposed to learn
  • An Operationality Criterion-A description of which concepts are usable
  • A Domain Theory-A set ofrules that describe relationships between objects and actions in a domain
From this, EBL computes a generallzation of the training example that is sufficient to describe the goal concept, and also satisfies the operationality criterion.

Let's look more closely at this specification. The training example is a familiar input-it is the same thing as the example in the version space algorithm. The goal concept is also familiar, but in previous sections, we have viewed the goal concept as an output of the program, not an input. The assumption here is that the goal concept is not operational, just like the high-level card-playing advice described in Section 17.3. An EBL program seeks to operationalize the goal concept by expressing it in terms that a puiblem-solving program can understand. These terms are given by the operationality criterion. In the chess example, the goal concept might be something like "bad position for Black," and the operationalized concept would be a generalized description of situations similar to the training example, given in terms of pieces and their relative positions. The last input to an EBL program is a domain theory, in our case, the rules of chess. Without such knowledge, it is impossible to come up with a correct generalization of the training example.

Explanation-based generalization (EBG) is an algorithm for EBL described in Mitehell et al. [1986]. It has two steps: (1) explain and (2) generalize. During the first step, the domain theory is used to prune away all the unimportant aspects of the training example with respect to the goal concept. What is left is an explanation of why the training example is an instance of the goal concept. This explanation is expressed in terms that satisfy the operationality criterion. The next step is to generalize the explanation as far as possible while still describing the goal concept. Following our chess example, the first EBL step chooses to ignore White's pawns, king, and rook, and constructs an explanation consisting of White's knight, Black's king, and Black's queen, each in their specific positions. Operationality is ensured: all chess-playing programs understand the basic concepts of piece and position. Next, the explanation is generalized. Using domain knowledge, we find that moving the pieces to a different part of the board is still bad for Black. We can also determine that other pieces besides knights and queens can participate in fork attacks.

In reality, current EBL methods run into difficulties in domains as complex as chess, so we will not pursue this example further. Instead, let's look at a simpler case. Consider the problem of learning the concept Cup [Mitchell et al.,1986]. Unlike the arch-learning program of Section 17.5.1, we want to be able to generalize from a single example of a cup. Suppose the example is:

  • Training Example

    owner(Object23, Ralph) ^ has-part(Object23, Concavityl2) ^

    is(Object23, Light) ^ color(Object23, Brown) ^. . .

Clearly, some of the features of Object23 are more relevant to its being a cup than others. So far in this chapter, we have seen several methods for isolating relevant features. These methods all require many positive and negative examples. In EBL we instead rely on domain knowledge, such as:

  • Domain Knowledge:

    is(x, Light) ^ has-part(x, y) ^ isa(y, Handle) -? liftable(x)

    has-part(x, y) ^ isa(y, Bottom) ^ is(y, Flat) -? stable(x)

    has-part(x, y) ^ isa(y, Concavity) ^ is(y, Upward-Pointing) -> open-vessel(x)

We also need a goal concept to operationalize:
  • Goal Concept: Cup

    x is a Cup iff x is liftable, stable, and open-vessel.

  • Operationality Criterion: Concept definition must be expressed in purely structural terms (e.g., Light, Flat, etc.).

Given a training example and a functional description, we want to build a general structural description of a cup. The first step is to explain why Object23 is a cup. We do this by constructing a proof, as shown in Figure I7.15. Standard theorem-proving

Figure 17.15: An Explanation

techniques can be used to find such a proof. Notice that the proof isolates the relevant features of the training example; nowhere in the proof do the predicates owner and color appear. The proof also serves as a basis for a valid generalization. If we gather up all the assumptions and replace constants with variables, we get the following description of a cup:

has-part(x, y) ^ isa(y, Concavity) ^ is(y, Upward-Pointing) ^

has-part(x, z) ^ isa(z, Bottom) ^ is(z, Flat) ^

has-part(x, w) ^ isa(w, Handle) ^ is(x, Light)

This definition satisfies the operationality criterion and could be used by a robot to classify objects.

Simply replacing constants by variables worked in this example, but in some cases itis necessary to retain certain constants. To catch these cases, we must reprove the goal. This process, which we saw earlier in our discussion of learning in STRIPS, is called goal regression.

As we have seen, EBL depends strongly on a domain theory. Given such a theory, why are examples needed at all? We could have operationalized the goal concept Cup without reference to an example, since the domain theory contains all of the requisite information. The answer is that examples help to focus the Ieaming on relevanl uperationalizations. Without an example cup, EBL is faced with the task of characterizing the entire range of objects that satisfy the goal concept. Most of these objects will never be encountered in the real world, and so the result will be overly general.

Providing a tractable domain theory is a difficult task. There is evidence that humans do not learn with very primitive relations. Instead, they create incomplete and inconsistent domain theories. For example, returning to chess, such a theory might inelude concepts like "weak pawn structure." Getting EBL to work in ill-structured domain theories is an active area ofresearch (see, e.g., Tadepalli [1989]). EBL shares many features of all the learning methods described in earlier sections. Like concept learning, EBL begins with a positive example of some concept. As in learning by advice taking, the goal is to operationalize some piece of knowledge. And EBL techniques, like the techniques of chunking and macro-operators, are often used to improve the performance of problem-solving engines. The major difference between EBL and other learning methods is that EBL programs are built to take advantage of domain knowledge. Since learning is just another kind of problem solving, it should come as no surprise that there is leverage to be found in knowledge.

17.7 Discovery

Learning is the process by which one entity acquires knowledge. Usually that knowledge is already possessed by some number of other entities who may serve as teachers. Discovery is a restricted form of learning in which one entity acquires knowledge without the help of a teacher.[Footnote: Sometimes, there is no one in the world who has the knowledge we seek. In that case the kind of action we must take is called scientific disrovery.] In this section, we look at three types of automated discovery systems.

17.7.1 AM: Theory-Driven Discovery

Discovery is certainly learning. But it is also, perhaps more clearly than other kinds of learning, problem solving. Suppose that we want to build a program to discover things, for example, in mathematics. We expect that such a program would have to rely heavily on the problem-solving techniques we have discussed. In fact, one such program was written by Lenat [1977; 1982]. It was called AM, and it worked from a few basic concepts of set theory to discover a good deal of standard number theory.

AM exploited a variety of general-purpose AI techniques. It used a frame system to represent mathematical concepts. One of the major activities of AM is to create new concepts and fill in their slots. An example of an AM concept is shown in Figure I 7.16. AM also uses heuristic search, guided by a set of 250 heuristic rules representing hints about activities that are likely to lead to "interesting" discoveries. Examples of the kind of heuristics AM used are shown in Figure 17.17. Generate-and-test is used to form hypotheses on the basis of a small number of examples and then to test the hypotheses on a larger set to see if they still appear to hold. Finally, an agenda controls the entire discovery process. When the heuristics suggest a task, it is placed on a central agenda, along with the reason that it was suggested and the strength with which it was suggested. AM operates in cycles, each time choosing the most piomising task from the agenda and performing it.

In one run, AM discovered the concept of prime numbers. How did it do that? Having stumbled onto the natural numbers, AM explored operations such as addition, multiplication, and their inverses. It created Ihe concept of divisibility and noticed that some numbers had very few divisors. AM has a built-in heuristic that tells it to explore

Figure 17.16: An AM Concept: Prime Number

Figure 17.17: Some AM Heuristics

extreme cases. It attempted to list all numbers with zero divisors (finding none), one divisor (finding one: 1), and two divisors. AM was instructed to call the last concept "primes." Before pursuing this concept, AM went on to list numbers with three divisors, such as 49. AM tried to relate this property with other properties of 49, such as its being odd and a perfect syuare. AM generated other odd numbers and other perfect squares to test its hypotheses. A side effect of determining the equivalence of perfect squares with numbers with three divisors was to boost the "interestingness" rating of the divisor concept. This led AM to investigate ways in which a number could be broken down into factors. AM then noticed that there was only one way to break a number down into prime factors (known as the Unique Factorization Theorem).

Since brcaking down numbers into multiplicative components tumed out to be inicresting, AM decided, by analogy, to pursue additive components as well. It made several uninteresting conjectures, such as that every number could be expressed as a sum of 1's. It also found more interesting phenomena, such as that many numbers were expressiblc as the sum of two primes. By listing cases, AM determined that all even numbers greater than 2 seemed to have this property. This conjecture, known as Goldbach's Conjecture, is widely believed to be true, but a proofof it has yet to be found in mathematics.

AM contains a great many general-purpose heuristics such as the ones it used in this example. Often different heuristics point in the same place. For example, while AM discovered prime numbers using a heuristic that involved looking at extreme cases, another way to derive prime numbers is to use the following two rules:

  • If there is a strong analogy between A and B but there is a conjecture about A that does not hold for all elements of B, define a new concept that includes the elements of B for which it does hold.
  • If there is a set whose complement is much rarer than itself, then create a new concept representing the complement.

There is a strong analogy between addition and multiplication of natural numbers. But that analogy breaks down when we observe that all natural numbers greater than 1 can be expressed as the sum of two smaller natural numbers (excluding the identity). This is not true for multiplication. So the first heuristic described above suggests the creation of a new concept representing the set of composite numbers. Then the second heuristic suggests creating a concept representing the complement of that, namely the set of prime numbers.

Two major questions came out of the work on AM. One question was: "Why was AM ever turned off?" That is, why didn't AM simpfy keep discovering new interesting facts about numbers, possibly facts unknown to human mathematics? Lenat [1983b] contends that AM's performance was limited by the static nature of its heuristics. As the program progressed, the concepts with which it was working evolved away from the initial ones, while the heuristics that were available to work on those concepts stayed the same. To remedy this problem, it was suggested that heuristics be treated as full-fledged concepts that could be created and modified by the same sorts of processes (such as generalization, specialization, and analogy) as are concepts in the task domain. In other words, AM would run in discovery mode in the domain of "Heuretics," the study of heuristics themselves, as well as in the domain of number theory. An extension of AM called EURISKO [Lenat, 1983a] was designed with this goal in mind.

The other question was: "Why did AM work as well as it did?" One source of power for AM was its huge collection of heuristics about what constitute interesting things. But AM had another less obvious source of power, namely, the natural relationship between number theoretical concepts and their compact representations in AM [Lenat and Brown, 1983]. AM worked by syntactically mutating old concept definitions- stored essentially as short LISP programs-in the hopes of finding new, interesting concepts. It turns out that a mutation in a small LISP program very likely results in another well-formed, meaningful LISP program. This accounts for AM's ability to generate so many novel concepts. But while humans interprel AM as exploring number theory, it was actually exploring the space of small LISP programs. AM succeeded in large part because of this intimate relationship between number theory and LISP programs. When AM and EURISKO were applied to other domains, including the study of heuristics themselves, problems arose. Concepts in these domains were larger and more complex than number theory concepts, and the syntax of the representation

Figure 17.18: BACON Discovering the Ideal Gas Law

language no longer closely mirrored the semantics of the domain. As a result, syntactic mutation of a concept definition almost always resulted in an ill-formed or useless concept, severely hampering the discovery procedure.

Perhaps the moral of AM is that learning is a tricky business. We must be careful how we interpret what our AI programs are doing [Ritchie and Hanna, 1984]. AM had an implicit bias toward learning concepts in number theory. Only after that bias was explicitly recognized was it possible to understand why AM performed well in one domain and poorly in another.

17.7.2 BACON: Data-Driven Discovery

AM showed how discovery might occur in a theoretical setting. Empirical scientists see things somewhat differently. They are confronted with data from the world and must make sense of it. They make hypotheses, and in order to validate them, they design and execute experiments. Scientific discovery has inspired a number of computer models. Langley et al. [ 198 la) present a model ofdata-driven scientific discovery that has been icnplemented as a program called BACON, named after Sir Francis Bacon, an early philosopher of science.

BACON begins with a set of variables for a problem. For example, in the study of the behavior of gases, some variables are p, the pressure on the gas, V, the volume of the gas, n, the amount of gas in moles, and T, the temperature of the gas. Physicists have long known a law, called the ideal gas law, that relates these variables. BACON is able to derive this law on its own. First, BACON holds the variables n and T constant, performing experiments at different pressures p1,p2, and p3. BACON notices that as the pressure increases, the volume V decreases. Therefore, it creates a theoretical term p V. This term is constant. BACON systematically moves on to vary the other variables. It tries an experiment with different values of T, and finds that p V changes. The two terms are linearly related with an intercept of 0, so BACON creates a new term pV/T. Finally BACON varies the termn and finds another linear relation between n and pV/T. For all valucs of n, p, V, and T, pV/nT = 8.32. This is, in fact, the ideal gas law. Figure 17.18 shows BACON's reasoning in a tabular format.

BACON has been used to discover a wide variety of scientific laws, such as Kepler's third law, Ohm's law, the conservation of momentum, and Joule's law. The heuristics BACON uses to discover the ideal gas law include noting constancies, finding linear relations, and defining theoretical terms. Other heuristics allow BACON to postulate intrinsic: properties of objects and to reason by analogy. For example, if BACON finds a regularity in one set of parameters, it will attempt to generate the same regularity in a similar set of parameters. Since BACON's discovery procedure is state-space seareh, these heuristics allow it to reach solutions while visiting only a small portion of the search space. In the gas example, BACON comes up with the ideal gas law using a minimal number of experiments.

A better understanding of the science of scientific discovery may lead one day to programs that display true creativity. Much more work must be done in areas of science that BACON does not model, such as determining what data to gather, choosing (or creating) instruments to measure the data, and using analogies to previously understood phenomena. For a thorough discussion of scientific discovery programs, see Langley et al. [1987].

17.7.3 Clustering

A third type of discovery, called clustering, is very similar to induction, as we described it in Section 17.5. In inductive learning, a program learns to classify objects based on the labelings provided by a teacher. In clustering, no class labelings are provided. The program must discover for itselfthe natural classes that exist for the objects, in addition to a method for classifying instances.

AUTOCLASS [Cheeseman et al., 1988] is one program that accepts a number of training cases and hypothesizes a set of classes. For any given case, the program provides a set of probabilities that predict into which class(es) the case is likely to fall. In one application, AUTOCLASS found meaningful new classes of stars from their infrared spectral data. This was an instance of true discovery by computer, since the facts it discovered were previously unknown to astronomy. AUTOCLASS uses statistical Bayesian reasoning of the type discussed in Chapter 8.

17.8 Analogy

Analogy is a powerful inference tool. Our language and reasoning are laden with analogies. Consider the following sentences:

  • Last month, the stock market was a roller coaster.
  • Bill is like a fire engine.
  • Problems in electromagnelism are just like problems in fluid fiow.

Underlying each of these examples is a complicated mapping between what appcear to be dissimilar concepts. For example, to understand the first sentence above, it is necessary to do two things: (1) pick out onc kcy prc,perty ofa roller coaster, namely that it travels up and down rapidly and (2) rcalize that physical travel is itself an analogy for numerical fluctuations (in stock prices). This is no easy trick. Thc space of possible analogies is very large. We do not want to entertain possibilities such as "the stock market is like a roller coaster because it is made of metal."

Figure 17.19: Transformational Analogy

Lakoff and Johnson [1980] make the case that everyday language is filled with such analogies and metaphors. An AI program that is unable to grasp analogy will be difficult to talk to and, consequently, difficult to teach. Thus, analogical reasoning is an important factor in learning by advice taking. It is also important to learning in problem solving.

Humans often solve problems by making analogies to things they already understand how to do. This process is more complex than storing macro-operators (as discussed in Section 17.4.2) because the old problem might be quite different from the new problem on the surface. The difficulty comes in determining what things are similar and what things are not. Two methods of analogical problem solving that have been studied in AI are transformational and derivational analogy.

17.8.1 Transformational Analogy

Suppose you are asked to prove a theorem in plane geometry. You might look for a previous theorem that is very similar and "copy" its proof, making substitutions when necessary. The idea is to transform a solution to a previous problem into a solution for the current problem. Figure 17.19 shows this process.

An example of transformational analogy is shown in Figure 17.20 [Anderson and Kline, 1979]. The program has seen proofs about points and line segments; for example, it knows a proof that the line segment RN is exactly as long as the line segment OY, given that RO is exactly as long as NY. The program is now asked to prove a theorem about angles, namely that the angle BD is equivalent to the angle CE, given that angles BC and DE are equivalent. The proofabout line segments is retrieved and transformed into a proof about angles by substituting the notion of line for point, angle for line segment, AB for R, AC for O, AD for N, and AE for Y. Carbonell [I983] describes one method for transforming old solutions into new solutions. Whole solutions are viewed as states in a problem space called T-space. T- operators preseribe the methods of transfottrting solutions (states) into other solutions. Rcasoning by analogy becomes search in T-space: starting with an old solution, we use means-ends analysis or some other method to find a solution to the cunent problem.

Figure 17.20: Solving a Problem by Transformational Analogy

Figure 17.21: Derivational Analogy
17.8.2 Derivational Analogy

Notice that transformational analogy does not look at how the old problem was solved: it only looks at the final solution. Often the twists and turns involved in solving an old problem are relevant to solving a new problem. The detailed history of a problem- solving episode is called its derivation. Analogical reasoning that takes these histories into account is called derivational analogy (see Figure l7.21).

Carbonell [1986] claims that derivational analogy is a necessary component in the transfer of skills in complex domains. For example, suppose you have coded an efficient sorting routine in Pascal, and then you are asked to recode the routine in LISP. A line- by-line translation is not appropriate, but you will reuse the major structural and control decisions you made when you constructed the Pascal program. One way to model this behavior is to have a problem-solver"replay" the previous derivation and modify it when necessary. If the original reasons and assumptions for a step's existence still hold in the new problem, the step is copied over. If some assumption is no longer valid, another assumption must be found. If one cannot be found, then we can try to find justification for some altemative stored in the derivation of the original problem. Or perhaps we can try some step marked as leading to search failure in the original derivation, if the reasons for failure conditions are not valid in the current derivation.

Analogy in problem solving is a very open area of researeh. For a survey of recent work, see Hall [1989].

17.9 Formal Learning Theory

Like many other AI problems, learning has attracted the attention of mathematicians and theoretical computer scientists. Inductive learning in particular has received con- siderable attention. Valiant [1984] describes a "theory of the learnable" which classifies problems by how difficult they are to learn. Formally, a device learns a concept if it can, given positive and negative examples, produce an algorithm that will classify future examples correctly with probability 1/h. The complexity of learning a concept is a function of three factors: the error tolerance (h), the number of binary features present in the examples (t), and the size of the rule necessary to make the discrimination (f). If the number of training examples required is polynomial in h, t, and f, then the concept is said to be learnable.

Some interesting results have been demonstrated for concept learning. Consider the problem of learning conjunetive feature descriptions. For example, from the list of positive and negative examples of elephants shown in Figure 17.22, we want to induce the description "gray, mammal, large." It has been shown that in conjunetive learning the number of randomly chosen training examples is proportional to the logarithm ofthe total number of features [Haussler,1988; Littlestone,1988]. [Footnote: However, the number of examples must be linear in the number of relevant attributes, i.e., the number of attributes that appear in the learned conjunction.] Since very few training examples are needed to solve this induction problem, it is called learnable. Even if we restrict the learner to positir,e examples only, conjunctive learning can be achieved when the nelmber of examples is linearly proportional to the number of attributes [Ehrenfeucht et al., 19R4]. Learning from positive examples only is a phenomenon not modeled by least-commitment inductive techniyues such as version spaces. The introduction of the error tolerance h makes this possible: After all, even if all the elephants in our training set are gray, we may later encounter a genuine elephant that happens to be white. For-Iunately,, we can extend the size of our randomly sampled training set to ensure that the probability of miselassifying an elephant as something else (such as a polar bear) is an arbitrarily small 1/h.

Formal techniques have been applied to a number of other learning problems. For example, given positive and negative examples of strings in some regular language, can we efficienlly induce the finite automaton that produces all and only the strings in that language? The artswer is no; an exponential number of computational steps is required [Kcams and Valiant,1989].[Footnote: The proof of this result rests on some unproven hypotheses about the complexity of certain number theoretic functions.] However, if we allow the learner to make specific queries (e.g., "Is string x in the language?"), then the problem is learnable [Angluin, 1987].

Figure 17.22: Six Positive and Negative Examples of the Concept Elephant

It is difficult to tell how such mathematical studies of learning will affect the ways in which we solve AI problems in practice. After all, people are able to solve many exponentially hard problems by using knowledge to constrain the space of possible solutions. Perhaps mathematical theory will one day be used to quantify the use of such knowledge, but this prospect seems far off. For a critique of fomial learning theory as well as some of the inductive techniques described in Section 17.5, see Amsterdam [1988].

17.10 Neural Net Learning and Genetic Learning

The very first efforts in machine learning tried to mimic animal learning at a neural level. These efforts were quite different from the symbolic manipulation methods we have seen so far in this chapter. Collections of idealized neurons were presented with stimuli and prodded into changing their behavior via forms of reward and punishment. Researehers hoped that by imitating the learning mechanisms of animals, they might build learning machines from very simple parts. Such hopes proved elusive. However, the field of neural network learning has seen a resurgence in recent years, partly as a result of the discovery of powetful new learning algorithms. Chapter 18 describes these algorithms in detail.

While neural network models are based on a computational "brain metaphor," a number of other learning techniques make use of a metaphor based on evolution. In this work, learning occurs through a selection process that begins with a large population of random programs. learning algorithms inspired by evolution are called genetic algorithms [Holland, 1975; de Jong, 1988; Goldberg, 1989].

17.11 Summary

The most important thing to conclude from our study of automated learning is that learning itself is a problem-solving process. We can cast various learning sirategics in terms of the methods of Chapters 2 and 3.

  • learning by taking advice
    • Initial state: high-level advice
    • Final state: an operational rule
    • Operators: unfolding definitions, case analysis, matching, etc.

    Learning from examples

    • Initial state: collection of positive and negative examples
    • Final state: concept description
    • Search algorithms: candidate elimination, induction of decision trees
  • Learning in problem solving
    • Initial state: solution traces to example problems
    • Final state: new heuristics for solving new problems efficiently
    • Heuristics for search: generalization, explanation-based learning, utility considerations
  • Discovery
    • Initial state: some environment
    • Final state: unknown
    • Heuristics for search: interestingness, analogy, etc.

A learning machine is the dream system of AI. As we have seen in previous chapters, the key to intelligent behavior is having a lot of knowledge. Getting all of that knowledge into a computer is a staggering task. One hope of sidestepping the task is to let coniputers acquire knowledge independently, as people do. We do not yet have programs that can eztend themselves indefinitely. But we have discovered some of the reasons for our failure to create such systems. If we look at actual learning programs, we find that the more knowledge a program starts with, the more it can learn. This finding is satisfying, in the sense that it corroborates our other discoveries about the power of knowledge. But it is also unpleasant, because it seems that fully self-extending systems are, for the present, still out of reach.

Research in machine learning has gone through several cycles ofpopularity. Timing is always an important consideration. A learning program needs to acquire new knowl- edge and new problem-solving abilities, but knowledge and problem-solving are topics still under intensive study. If we do not understand the nature of the thing we want to learn, learning is difficult. Not surprisingly, the most successful learninþ programs operate in fairly well-understood areas (like planning), and not in less well-understood areas (like natural Ianguage understanding).

17.12 Exercises

  1. Would it be reasonable to apply Samuel's rote-learning procedure to chess? Why (not)?
  2. lmplement the candidate elimination algorithm for version spaces. Choose a concept space with several features (for example, the space of books, computers, animals, etc.) Pick a concept and demonstrate learning by presenting positive and negative examples of the concept.
  3. In Section 17.5.2, the concept "Japanese economy car" was learned through the presentation of five positive and negative examples. Give a seyuence offour examples that accomplishes the same goal. In general, what properties of a positive example make it most useful? What makes a negative example most useful?
  4. Recall the problem of learning disjunctive concepts in version spaces. We discussed learning a concept like "European car," where a European car was defined as a car whose origin was either Germany, Italy, or Britain. Suppose we expand the number of discrete values the slot origin might take to include the values Europe and Imported. Suppose further that we have the following isa hierarchy at our disposal:

    The diagram reflects facts such as "Japanese cars are a subset of imported cars" and "Italian cars are a subset of European cars." How could we modify the candidate elimination algorithm to take advantage of this knowledge? Propose new methods of updating the sets G and S that would allow us to Iearn the concept "European car" in one pass through a set of adequate training examples.

  5. AM exploited a set of 250 heuristics designed to guide AM's behavior toward interesting mathematical concepts. A classic work by Polya [1957] describes a set of heuristics for solving mathematical problems. Unfortunately, Polya's heuristics are not specified in enough detail to make them implementable in a program. In particular, they lack precise descriptions of the situations in which they are appropriate (i.e., the left sides ifthey are viewed as productions). Examine some of Polya's rules and refine them so that they could be implemented in a problem-solving program with a structure similar to AM's.
  6. Consider the problem of building a program to learn a grammar for a language such as English. Assume that such a program would be provided. as input, with a set of pairs, each consisting of a sentence and a representation of the meaning of the sentence. This is analogous to the experience of a child who hears a sentence and sees something at the same time. How could such a program be built using the techniques discussed in this chapter?
  7. Life is a one-player game invented by John Conway. The game is played on an infinite two-dimensional grid of square cells. At any given time, a cell is either living or dead. Pattems of cells transform themselves according to a simple set of rules:
    • If a cell is living, it continues to live if surrounded by exactly two or three living cells. If it is surrounded by more than three living cells, it dies of overerowding; if less than two of its neighbors is alive, it dies of loneliness.
    • If a cell is dead, it becomes living only if it is surrounded by exactly three living cells. Otherwise, it remains dead.

        For example:

        1. Create input-output pairs for every possible configuration of a cell and its eight neighbors. There will be 5I2 (29) different input vectors. Associated with each input veetor will be one output bit: 0 if the next state of the cell is dead,1 if living. Use the rules above to compute the proper output for each input vector.
        2. Train a three-layer backpropagation network to learn the behavior of a Life cell. Use two hidden units.
        3. Print out the set of weights and biases learned by the network. Now derive a set of (symbolic) rules that concisely describes how the network is actually computing its output. Focus on the behaviors of the two hidden units-how do they respond to their inputs, and what effects do they have on the eventual output?
        4. Compare the rules you derived in part (c) with the rules you used to create the data in part (a).