Chapter 21 - Perception and Action

Scanned by Werner 'Dirk Gently' Zsolt (dirk@master.fok.hu)

ln the first chapter of this book, we proposed a definition of AI based on the nature of the problems it tackles, namely those for which humans currently outperform computers. So far, we have discussed primarily cognitive tasks, but there are many other tasks that also fall within this realm. In basic perceptual and motor skills, even lower animals possess phenomenal capabilities compared to computers.

Perception involves interpreting sights, sounds, smells, and touch. Action includes the ability to navigate through the world and manipulate objects. In order to build robots that live in the world, we must come to understand these processes. Figure 21.1 shows a design for a complete autonomous robot. Most of AI is concerned only with cognition, the idea being that when intelligent programs are developed, we will simply add sensors and effectors to them. But problems in perception and action are substantial in their own right and are being tackled by researchers in the field of robotics.

In the past, robotics and AI have been largely independent endeavors, and they have developed different techniques to solve different problems. We attempt to characterize the field of robotics at the end of this chapter, but for now, we should note one key difference between AI programs and robots: While AI programs usually operate in computer-simulated worlds, robots must operate in the physical world. As an example, consider making a move in chess. An AI program can search millions of nodes in a game tree without ever having to sense or touch anything in the real world. A complete chess-playing robot, on the other hand, must be capable of grasping pieces, visually interpreting board positions, and carrying on a host of other actions.

The distinction between real and simulated worlds has several implications:

Recent years have seen efforts to integrate research in robotics and AI. The old idea of simply attaching sensors and effectors to existing AI programs has given way to a serious rethinking of basic AI algorithms in light of the problems involved in dealing with the physical world. Research in robotics is likewise affected by AI techniques, since reasoning about goals and plans is essential for mapping perceptions onto appropriate actions. In this chapter, we explore the interface between robotics and AI. We do not delve too deeply into purely robotic issues, but instead focus on how the AI techniques we have seen in this book can be used and/or modified to handle problems that arise in dealing with the physical world.

At this point, one might ask whether physical robots are necessary for research purposes. Since current AI programs already operate in simulated worlds, why not build more realistic simulations, which better model the real world? Such simulators do exist, for example, Carbonell and Hood [l986] and Langley et al. [1981b). There are several advtcntages to using a simulated world: Experiments can be conducted very rapidly, conditions can easily be replicated, progracns can return to previous states at no cost, and sensory input can be treated in a high-Ievel fashion. Furthermore, sicnulalors require no fragile, expensive mechanical parts. The major drawback to simulators is figuring out exactly which factors to build in. Experience with real robots continues to expose tough problems that do not arise even in the most sophisticated simulators. The world turns out-not surprisingly-to be an excellent model of itself, and a readily available one.

21.1 Real-Time search

We now turn to heuristic search, as exemplified in AI by the A* algorithm. While A* is guaranteed to find an optimal path from the initial state to the goal state, the algorithm has a number of limitations in the real world. For one, the exponential complexity of A* limits the size of problems it can realistically solve, and forces us to consider a limited search horizon. Also, having incomplete information about the world can further limit that search horizon. For example, consider the task of navigating from one room to another in an unfamiliar building. The search horizon is limited to how far one can (literally) see at any given Iime. It is necessary to take steps in the physical world in order to see beyond the horizon, despite the fact that the steps cnay be nonoptimal ones. Finally, real-time tasks like driving require continuous monitoring and reacting. Because heuristic search is time-consucning, we cannot afford to work out optimal solutions ahead of time.

There is a variation of A* that addresses these issues. It is called Real-Time-A* (RTA*) [Korf, 1988]. This algorithm commits to a real-world action every k seconds, where k is some constant that depends oct the depth of the search horizon. Each time RTA* carries out an action, it restarts the search from that point. Thus, RTA* is able to make progress toward a goal state without having to plan a complete sequence of solution steps in advance. RTA* was inspired to a degree by work on computer games.

As we mentioned in Chapter 12, game-playing programs must commit to irrevocable moves because of time constraints.

Algorithm: Real-Time-A*

  1. Set NODE to be the start state.
  2. Generate the successors of NODE. If any of the successors is a goal state, then quit.
  3. Estimate the value of each successor by performing a fixed-depth search starting at that successor. Use depth-first search. Evaluate all leaf nodes using the A* heuristic function f= g + h', where g is the distance to the leaf node and h' is the predicted distance to the goal. Pass heuristic estimates up the search tree in such a way that thef value of each internal node is set to the minimum of the values of its children.[Footnote:It is possible to prune the search tree using a technique called alpha pruning, a single-agent analogue of alpha-beta pruning. Alpha pruning is a branch-and-bound technique of the type we encountered in Chapter 2.]
  4. Set NODE to the successor with the lowest score, and take the corresponding action in the world. Store the old NODE in a table along with the heuristic score of the second-best successor. (With this strategy, we can never enter into a fixed loop, because we never make the same decision at the same node twice.) If this node is ever generated again in step 2, simply look up the heuristic estimate in the table instead of redoing the fixed-depth search of step 3.
  5. Go to step 2.

We can adjust the depth to which we search in step 3, depending on how much time we want to spend planning versus executing actions in the world. Provided that every part of the search space is accessible from every other part, RTA* is guaranteed to find a path to a solution state if one exists. The path may not be an optimal one, however. The deeper we search in step 3, the shorter our average solution paths will be. Of course, the task itself may impose limits on how deep we can search, as a result of incomplete information.

RTA* is just one example of a limited-horizon search algorithm. Another algorithm, due to Hansson and Mayer [1989], uses Bayesian inference. Dean and Boddy [1988] deline a related notion, the anytime algorithm. An anytime algorithm is one that can be interrupted and queried at any point during its computation. The longer the algorithm runs, the more accurate its answer is.

Now we turn to more specific techniques aimed at various perceptual and motor problems. Later, we investigate architectures for integrating perception, action, and cognition. It should be noted that this is only a brief survey of a very large and active field of research. Those interested in investigating these issues more deeply should consult robotics texts such as Brady [1982] and Craig [1985].

21.2 Perception

We perceive our environment through many channels: sight, sound, touch, smell, taste. Many animals possess these same perceptual capabilities, and others are able to monitor entirely different channels. Robots, too, can process visual and auditory information, and they can also be equipped with more exotic sensors, such as laser rangefinders, speedometers, and radar.

Two extremely important sensory channels for humans are vision and spoken language. It is through these two faculties that we gather almost all of the knowledge that drives our problem-solving behaviors.

21.2.1 Vision

Accurate machine vision opens up a new realm of computer applications. These applications include mobile robot navigation, comptex manufacturing tasks, analysis of satellite images, and medical image processing. In this section, we investigate how we can transform raw camera images into useful information about the world.

A video camera provides a computer with an image represented as a two-dimensional grid of intensity levels. Each grid element, or pixel, may store a single bit of information (that is, black/white) or many bits (perhaps a real-valued intensity measure and color information). A visual image is composed of thousands of pixels. What kinds of things might we want to do with such an image? Here are four operations, in order of increasing complexity:

  1. Signal Processing--Enhancing the image, either for human consumption or as input to another program.
  2. Measurement Analysis--For images containing a single object, determining the two-dimensional extent of the object depicted.
  3. Pattern Recognition--For single-object images, classifying the object into a category drawn from a finite set of possibilities.
  4. Image Understanding--For images containing many objects, locating the objects in the image, classifying them, and building a three-dimensional model of the scene.

See Niblack [1986] for algorithms that perform the first two operations. The third operation, pattern recognition, varies in its difficulty. It is possible to classify two- dimensional (2-D) objects, such as machine parts coming down a conveyor belt, but classifying 3-D objects is harder because of the large number of possible orientations for each object. Image understanding is the most difficult visual task, and it has been the subject of the most study in AI. While some aspects of image understanding reduce to measurement analysis and pattern recognition, the entire problem remains unsolved, because of difficulties that include the following:

As a result, 2-D images are highly ambiguous. Given a single image, we could construct any number of 3-D worlds that would give rise to the image. For example, consider the ambiguous image of Figure 21.2. It is impossible to decide what 3-D solid it portrays. In order to determine the most likely interpretation of a scene, we have to apply several types of knowledge.

For example, we may invoke knowledge about low-level image features, such as shadows and textures. Figure 21.3 shows how such knowledge can help to disambiguate the image. Having multiple images of the same object can also be useful for recovering 3-D structure. The use of two or more cameras to acquire multiple simultaneous views of an objeet is called stereo vision. Moving objects (or moving cameras) also supply multiple views. Of course, we must also possess knowledge about how motion affects images that get produced. Still more information can be gathered with a laser rangefinder, a device that returns an array of distance measures much like sonar does. While rangefinders are still somewhat expensive, integration of visual and range data will soon become common place. Integrating different sense modalities is called sensor fusion. Other image factors we might want to consider include shading, color, and reflectance.

High-level knowledge is also important for interpreting visual data. For example, consider the ambiguous object at the center of Figure 21.4(a). While no low-level image feature, can tell us what the object is, the object's surroundings provide us with top-down expectations. Expectations are critical for interpreting visual scenes, but resolving expectations can be tricky. Consider the scene shown in Figure 21.4(b). All objeets in this scene are ambiguous; the same shapes might be interpreted elsewhere as an amoeba, logs in a fireplace, and a basketball. As a result, there are no clear-cut top- down expectations. But the preferred interpretations of egg, bacon, and plate reinforce each other mutually, providing the necessary expectations.

So how can we bring all of this knowledge to bear in an organized fashion? One possible architecture for vision is shown in Figure 21.5. The very first step is to convert the analog video signal into a digital image. The next step is to extract image features like edges and regions. Edges can be detected by algorithms that look for sets of adjacent pixels with differing values. Since pixel values are affected by many factors, small edges

Figure 21.3: Using Low-Level Knowledge to Interpret an Image

Figure 21.4: Using High-Level Knowledge to Interpret an Image

with similar orientations must be grouped into larger ones [Ballard and Brown,1982]. Regions, on the other hand, are found by grouping similar pixels together. Edge and region detection are computationally intensive processes, but ones that can be readily mapped onto parallel hardware. The next step is to infer 3-D orientations for the various regions. Texture, illumination, and range data are all useful for this task. Assumptions about the kinds of objects that are portrayed can also be valuable, as we saw in the Waltz labeling algorithm (Section 14.3). Next, surfaces are collected into 3-D solids. Small solids are combined into larger, composite objects. At this point, the scene is segmented into discrete entities. The final step involves matching these entities against a knowledge base in order to pick the most likely interpretations for them. Organizing such a knowledge base of objects is difficult, though the knowledge-structuring techniques we studied in Part II are useful. As we demonstrated above, it may be impossible to interpret objects in isolation. Therefore, higher-level modules can pass hypotheses back down to lower-level modules, which check for predictions made by the hypotheses.

This is only one way of structuring an image understanding program. It highlights the spectrum of low- to high-level knowledge required for 3-D vision. As with other AI tasks, the success of a vision program depends critically on the way it represents and applies knowledge. For more information on computer vision, see Marr [1982], Ballard and Brown [1982], and Horn [1986].

21.2.2 Speech Recognition

Natural language understanding systems usually accept typed input, but for a number of applications this is not acceptable. Spoken language is a more natural form of communication in many human-computer interfaces. Speech recognition systems have been available for some time, but their limitations have prevented widespread use. Below are five major design issues in speech systems. These issues also provide dimensions along which systems can be compared with one another.

Existing speech systems make various compromises. Early systems, like DRAGON [Baker, 1975], HEARSAY [Lesser et al., 1975], and HARPY [Lowerre, 1976] dealt with single-user, continuous speech, and vocabularies up to a thousand words. They achieved word accuracy rates of 84 to 97 percent. TANGORA [IBM speech recognition group, 1985] moved to speaker independence and a large, 20,000-word vocabulary, but sacrificed continuous speech. TANGORA is 97 percent accurate. One system built at Bell Labs for recognizing continuous, speaker-independent digit recognition (for phone numbers) has also produced 97 percent accuracy [Rabiner et al.,1988]. SPHINX [Lee and Hon, 1988] is the first system to achieve high accuracy (96 percent) on real-time, speaker independent, continuous speech with a vocabulary of 1000 words.

What techniques do these systems use? HEARSAY used a blackboard architecture, of the kind we discussed in Chapter 16. Using this method, various knowledge sources enter positive and negative evidence for different hypotheses, and the blackboard integrates all the evidence. Low-level phonemic knowledge sources provide information that high-level knowledge sources can use to make hypotheses about what words appear in the input. The high-level knowledge sources can then generate expectations that can be checked by the low-level ones.

The HARPY system also used knowledge to direct its reasoning, but it precompiled all that knowledge into a very large network of phonemes. In the network model, an interpreter tries to find the path through the network that best matches the spoken input. This path can be found with any number of heuristic search techniques, for example, beam search. HARPY was much faster than HEARSAY, but the blackboard architecture that HEARSAY used was more general and easily extensible.

Most modern speech systems are learning systems. In other words, they accept sample inputs and interpcetations, and modify themselves appropriately until they are able to transform speech waveforms into written words. So far, statistical learning methods have proven most useful for learning this type of transformation. The statistical method used in the SPHlNX system is called Hidden Markov modeling. A hidden Markov model (HMM) is a collection of states and transitions. Each transition leaving a state is marked with (1) the probability with which that transition is taken, (2) an output symbol, and (3) the probability that the output symbol is emitted when the transition is taken. The problem of decoding a speech waveform turns into the problem of finding the most likely path (set of transitions) through an appropriate HMM. It is possible to tune the probabilities of an HMM automatically so that these paths conespond to correct interpretations of the waveform. The technique for doing this is called the forward-backward algorithm.

Connectionist systems also show promise as a learning mechanism for speech recognition. One problem with connectionist models is that they do not deal very well with time-varying data. New types of networks, such as recurrent and time-delay networks [Waibel et al., 1989], are being employed to overcome these difficulties.

In our discussion of vision in Section 21.2.1, we saw that higher-level sources of knowledge can he used to manage uncertainty at lower levels. Speech recognition also has sources of higher-level knowledge. We have already studied some of these in Chapter 15. Syntactic knowledge can be used to identify constituent phrases, semantic knowledge to disambiguate word senses, discourse knowledge to dereference pronouns, and so forth. Early speech recognition systems sought to make use of this higher-level knowledge in order to constrain the interpretation at the lower levels. As we saw in Chapter 14, a speech system that cannot decide between "the cat's cares are few" and "the cat scares are few" can invoke high-level knowledge to choose one alternative over the other.

However, modem speech systems perform fairly well without any sophisticated syntactic or semantic models of language. Instead, simple statistical models are used. Forexample, SPHINX uses a word-pair-grammar, which tells it which wordscan legally appear adjacent to one another in the input. TANGORA uses a triam grammar, which, given the previous two words in the input, yields the probability that a given word will occur next.

Still, no speech system is 100 percent accurate. There has recently been renewed interest in integrating speech recognition and natural language processing in order to overcome the final hurdle. For example, ATNs and unification-based grammars can be used to constrain the hypotheses made by a speech system. Thus far, integration has proved difficult, because natural language grammars do not offer much in the way of constraints.

In the speech recognition literature, there is a quantitative measure of grammar, called perplexity. Perplexity measures the number of words that can legally appear next in the input (on average). The telephone number recognition task has a perplexity of 10, because at any decision point, there are ten alternatives. On a sample 1000-word English task, a word-pair grammar may reduce the perplexity from 1000 down to 60. A bigram grammar may reduce it further, perhaps to 20 [Lee and Hon,1988].

While natural language grammars accurately predict word categories (such as noun and verb), they say nothing about which words within a category are likely to show up in the input. For example, given the word "the," a grammar might hypothesize that the next word is either an adjective or a noun. But this knowledge does us little good when there are thousands of possible adjectives and nouns to choose from. Thus, it is natural to turn to statistical, or collocational, facts about language. For example, if the word "doctor" is recognized, then one might expect to hear the word "nurse" later in the input, but not "Norse." Collocational data, unlike more complex syntactic and semantic structures, can be extracted automatically from large on-line bodies of text. Ultimately, we want to substitute semantic and discourse information for statistical data. If we know the conversation is about doctors, and if we know that doctors and nurses typically work together, then we should be able to generate the proper expectations. Such a strategy will require large knowledge bases and a deeper understanding of semantics and discourse.

21.3 Action

Mobility and intelligence seem to have evolved together. Immobile creatures have little use for intelligence, while it is intelligence that puts mobility to effective use. In this section, we investigate the nature of mobility in terms of how robots navigate through the world and manipulate objects.

21.3.1 Navigation

Navigation means moving around the world: planning routes, reaching desired destinations without bumping into things, and so forth. Like vision and speech recognition,

Figure 21.6: A Path Planning Problem

Figure 21.7: Constructing a Visibility Graph

this is something humans do fairly easily.

Many classic AI planning problems involve navigation. The STRIPS system, for example, gave high-level instructions to a robot moving through a set of rooms, canying objects from one to another. Plans to solve goals like "move box A into room X" contained operators like MOVE(Y, X), meaning "move from room Y to room X." The planner did not concern itself with how this movement was to be realized in the world; from its perspective, the manner of movement was something akin to teleportation. A real robot, however, must consider the low-level details involved in getting from here to there.

Navigational problems are surprisingly complex. For example, suppose that there are obstacles in the robot's path, as in Figure 21.6. The problem of path planning is to plot a continuous set of points connecting the initial position of the robot to its desired position.

If the robot is so small as to be considered a point, the problem can be solved straightforwardly by constructing a visibility graph. Let S be the set consisting of the initial and final positions as well as the vertices of all obstacles. To form the visibility graph, we connect every pair of points in S that are visible from one another, as shown in Figure 21.7. We can then search the graph (perhaps using the A* algorithm) to find an optimal path for the robot.

Most robots have bulky extent, however, and we must take this into account when we plan paths. Consider the problem shown in Figure 21.8, where the robot has a pentagonal

Figure 21.8: Another Path Planning Problem

Figure 21.9: Constructing Configuration Space Obstacles

shape. Fortunately, we can reduce this problem to the previous path-planning problem. The algorithm is as follows: First choose a point P on the surface of the robot, then increase the size ofthe oMstacles so that they cover all points that P cannot enter, because ofthe physical size and shape ofthe roMot. Now, simply construct and search a visibility graph based on P and the vertices of the new obstacles, as in Figure 21.9. The basic idea is to reduce the robot to a point P and do path planning in an artificially constructed space, known as configuration space, or c-space [Lozano-Perez et al.,1984].

If we want to allow rotations, we can represent the robot as a combination of point P and some angle of rotation &fi. The robot can now bc considered as a point rnoving through three-dimensional space (x, y, rotation &fi). Obstacles can be transformed into three-dimensional c-space objects, and a visibility graph can again be created and searched.

An altemative approach to obstacle avoidance is the use ofpotentialfreld.s [Khatib, 1986]. With this technique, the direction of a moving robot is continually recomputed as a function of its current position relative to obstacles and its destination. The robot is essentially repelled by obstacles and attracted to the destination point. This approach is especially useful for correcting positioning errors that accumulate during a robot's journey and for dealing with unexpected obstacles. It can be combined with configuration space path planning to enable robust navigation [Krogh and Thorpe, 1986].

Road following is another navigational task that has received a great deal of attention. The object of road following is to steer a moving vehicle so that it stays centered on a road and avoids obstacles. Much of the problem comes in locating the edges of the road despite varying light, weather, and ground conditions. At present, this control task is feasible only for fairly slow-moving vehicles [Shafer and Whittaker, 1989]. Increases in speed demand more reactivity and thus more real-time computation.

21.3.2 Manipulation

Robots have found numerous applications in industrial settings. Robot manipulators are able to perform simple repetitive tasks, such as bolting and fitting automobile parts, but these robots are highly task-specific. It is a long-standing goal in robotics to build robots that can be prograntmed to carry out a widc vciriety of tasks.

A manipulator is composed of a series of links and joints, usually terminating in an end-effector, which can take the form of a two-pronged gripper, a humanlike hand, or any of a variety of tools. One general manipulation problem is called pick-and-place, in which a robot must grasp an object and move it to a specitic location. For example, consider Figure 21.10, where the goal is to place a peg in a hole.

There are two main subtasks here. The first is to design a robot motion that ends

Figure 21.10: A Pick-and-Place Task

with the object stably grasped between the two fingers of the robot. Clearly some form of puth planning, as discussed above, can be used to move the arm toward the object, but we need to modify the technique when it comes to the fine motion involved in the grasp itself. Here, uncertainty is a critical problem. Even with the vision techniques of Section 21.2.1, a robot can never be sure of the precise location of the peg or the arm. Therefore it would be a mistake to plan a grasp moticn in which the gripper is spread only wide enough to permit the peg to pass, as in Figure 21.t 1(a). A better strategy is to open the gripper wide, then close gradually as the gripper gets near the peg, as in Figure 21.11(b). That way, if the peg turns out to be located some small distance away from where we thought it was, the grasp will still succeed. Although this strategy depends less on precise vision, it requires some tactile sensitivity in order to terminate the grasp. Unless we take special care in designing grasping motions, uncertainty can lead to disasters. For example, should the left side of the gripper touch the peg one second before the right side does, the peg may fall, thus foiling the grasp. Brost [1988] and Mason et al. [1988] give robust algorithms for grasping a wide variety of objects.

After the peg is stably grasped, the robot must place it in the hole. This subtask resembles the path-planning problem, although it is complicated by the fact that moving the peg through 3-D space requires careful orchestration of the arm's joints. Also, we must seriously consider the problems introduced by uncertainty. Figure 21.12(a) shows a naive strategy for placing the peg. Failure will result from even a slight positioning error, because the peg will jam flatly on the outer surface. A better strategy is shown in Figure 21.12(b). We slide the peg along the surface, applying downward pressure so that the peg enters the hole at an angle. After this happens, we straighten the peg gradually and push it down into the hole.

This type of motion, which reacts to forces generated by the world, is called compli-

Figure 21.11 : Naive and Clever Strategies for Grasping

ant motion. Compliant motion is very robust in the face of uncertainty. Humans employ compliant motion in a wide variety of activities, such as writing on chalkboards.

So given a pick-and-place problem, how can we automatically generate a sequence ofcompliant motions? One approach [Lozano-Perez et al.,1984] is to use the familiar problem-solving process of backward chaining. Our initial and goal states for the peg-in- hole problem are represented as points in configuration space, as shown in Figure 21.13. First, we compute the set of points in c-space from which we are guaranteed to reach the goal state in a single compliant motion, assuming a certain degree of uncertainty in initial position and direction of movement and certain facts about relative friction. This set of points is called the goal state's strong pre-image. [Footnote: The set of points from which it is possible lo reach the state in a single motion is called the state's weak pre-image.] In Figure 21.13, the strong pre-image of the goal state is shown in gray. Now we use backward chaining to design a set of motions that is guaranteed to get us from the initial state to some point in the goal state's strong pre-image. Recursively applying this procedure will eventually yield a set of motions that, while individually uncertain, combine to form a guaranteed plan.

21.4 Robot architectures

Now let us turn to what happens when we put it all together---perception, cognition, and action. There are many decisions involved in designing an architecture that integrates all these capabilities, among them:

With these issues in mind, let's look briefly at a few existing robot architectures.

CODGER [Shafer et al., 1986] is an architecture for controlling vehicles in outdoor road-following tasks. CODGER uses a blackboard structure to organize incoming perceptual data. The system's control is centralized and hierarchical-all numerical data from sensors are fused in order to build up a consistent model of the world. This ********* model is represented symbolically. CODGER has been used to build a system for driving the experimental NAVLAB [Shafer and Whittaker, 1989] vehicle, a commercial van that has been altered for computer controll via electric and hydraulic sensors. The NAVLAB is completely self-contained, with room for several on-board computers and researchers. Brooks [1986] describes the subsumption architecture for building autonomous robots for indoor navigation and manipulation. Behaviors are built up from layers of simple, numeric finite-state machines. Bottom layers consist off reactive, instinctual behaviors such as obstacle avoidance and wandering. Upper layers consist of behaviors like object identification and reasoning about object motions. The various behaviors operate in a decentralized fashion, computing independentfy, and suppressing or informing one another. Such an organization encourages reactivity---for example, high-level navigation behavior is suppressed abruptly when an obstacle moves to block a robot's path. In fact, the subsumption architecture takes reactivity to the extreme. Separate modules monitor only the sensors that affect their behavior, and there are no explicit goals, plans, or world models in these systems. They simply react to the situation at hand. For example, the task of one such robot is to wander the halls, picking up soda cans and depositing them in a bin. When the robot locates a can, several modules steer the robot toward it. Modules governing the robot arm continuously monitor the physical wheels of the robot. When the wheels stop, the arm extends to grasp the can. Notice that all these motions are decentralized and reactive; nowhere in the robot is there any explicit plan for how to pick up the soda can, or how to pick up soda cans in general.

This kind of organization presents a perspective on problern solving similar to the one we deseribed in Section 13.7. Advantages of the subsumption architecture include simplicity and speed, since programs for controlling such robots are simple enough that they can be rendered easily into hard ware. Also, modeling the real world is a very difficult task, one that the subsumption architecture avoids. On the other hand, it is not clear that the subsumption architecture will scale up to complex planning problems. Subsumption robots tend to lack the flexibility that traditional problem solvers display in being able to reason about a wide variety of tasks. Also, they lack the ability to reflect on their own actions. For example, if the wheels of the soda can robot should stop turning because of a loose connection, the robot arm will mindlessly extend forward in search of a nonexistent can. While the CODGER architecture ernphasizes data fusion, subsumption robots emphasize data fission. A series of subsumption robots have been built, and they demonstrate how reactive systems are capable of much more interesting and varied behavior than was previously thought. It is unknown whether these architectures are capable of achieving tasks that seem to require significant amounts of planning. TCA [Simmons and Mitehell, 1989] is an architecture that combines the idea of reactive systems with traditional AI planning. TCA is a distributed system with centralized control, designed to control autonomous robots for long periods in unstructured environments, such as the surfuce of Mars. TCA particularly addresses issues that arise in the context of multiple goals and limited resources. The architecture provides mechanisms for hierarchical task management and allows action based on incomplete plans. Because robots gather new information by moving through the world, TCA permits plans to be termineted early should higher-priority goals arise. Some situations require highly reactive behavior. TCA achieves high-speed response by parallelizing planning and execution whenever possible. For example, in designing walking motions over rough terrain, TCA plans one step, initiates it, and then begins to plan the next slep before the leg motion has been completed.

Another program for combining heuristic problem solving with reactivity is called THEO-Agent [Mitehell, 1990]. THEO-Agent contains two subsystems, a reactive engine and a general problem solver (called THEO [Mitehell et al., 1989]). When the reactive subsystem fails to suggest a course of action, the problem solver creates a plan for the robot. As it executes the plan, the robot uses explanation-based learning to create new reactive modules. Thus, the robot becomes increasingly reactive with experience. Robo-SOAR [Laird et al., 1989], an extension of the SOAR problem-solving system, is another learning robot architecture.

PRS [Georgeff and Lansky,1987] is a symbolic robot planning system that interleaves planning and execution. In PRS, goals represent robot behaviors, not world states. PRS contains procedures for turning goals into subgoals or iterations thereof. A procedure can be invoked by either the presence of a goal or the presence of some sensory input. Thus, the robot is capable of goal-directed behavior but can also react when the world changes or when plans fail. Goals and procedures are represented symbolically, and a central reasoner uses a stack to oversee the invocation of procedures.

21.5 Summary

The field of robotics is often deseribed as the subfield of AI that is concerned with perceptual and motor tasks. As Figure 21.1 suggests, the tables can easily be turned, and AI could well be the subfield of robotics that deals with cognition. Indeed, Brady [1985] has proposed a definition of robotics with this flavor:

Another definition, suggested by Grossman,[Footnote: David Grossman, after-dinner speech delivered at the 7th NSF Grantees Conference, Ithaca, NY, 1974.] reads as follows:

The word "surprisingly" suggests a moving-target definition. It should be noted that the first automatic dishwashing machines were called robots by their designers. But after a while, it became less surprising that a machine could wash dishes, and the term "robot" fell away. This characterization of robotics is similar to the one we proposed for AI in Chapter 1. There, we characterized AI as the study of problems in which humans currently perform better than computers. As a result, programs that solve calculus problems are no longer considered artificial intelligence [Footnote: We must be careful here. When movable-typeprinting was first introduced, it was called artifical writing, because it seemed to be automating what scribes had been doing for previous centuries. Of course, printing only automates a small portion of the writing process. It is often more enlightening to view AI programs and robots as tools for enhancing human capabilities, rather than as independent, autonomous agents [Hill, 1989].]

These moving-target definitions accurately differentiate actual AI work and robotics work. AI tends to focus on uniquely human capabilities, while robotics aims to produce physical, animate behaviors. As we have seen in this chapter, however, many interesting problems lie at the intersection of AI and robotics, and only by combining techniques from both fields will we be able to design intelligent robots that live in the world.

21.6 Exercises

  1. Describe scenarios in which the following features are critical:
    1. Reactivity-The robot must react quickly to a changing environment.
    2. Robustness-The robot must act appropriately, in spite of incomplete or inexactsensory data.
    3. Recoverability-When a plan fails to bring about expected results, the robot must find another way to achieve its goal.

    Why aren't Ihe planning techniques deseribed in Chapter 13 sufficient to ensure these characteristics?

  2. Deseribe three different ways of combining speech recognition with a natural language understanding system. Compare and contrast them in terms of expected performance and ease of implementation.
  3. Say each of the following phrases very slowly, and write down the sounds you use. Then gradually speed up, and continue to write down the sounds. Finally, say them the way you would in ordinary speech. How do the sounds change as you move through each series? What are the implications of these changes for continuous speech recognition?
    1. could you
    2. boy's sehool
    3. the store,the elevator
    4. sharp point
    5. stop it
    6. want to go
  4. Create a search graph, labeled with heuristic estimates, that shows the RTA* algorithm entering the same node twice. Explain what would happen if RTA* did not keep track of previously visited states.
  5. In Section 21.1, we said that the RTA* algorithm is guaranteed to find a path to a solution state if such a path exists and if every part of the search space is accessible from every other part. Why is this second qualitication necessary? Give an example in which, without it, a solution will not be found.
  6. Consider the following variation on the peg-in-hole problem:

    Explain, using the concept of a strong pre-image, why this problem is easier than the standard peg-in-hole problem of Figure 21.10.