To understand something is to transform it from one representation into another, where this second representation has been chosen to correspond to a set of available actions that could be performed and where the mapping has been designed so that for each event, an appropriate action will be performed. There is very little absolute in the notion of understanding. If you say to an airline database system "I need to go to New York as soon as possible," the system will have "understood" if it finds the first available plane to New York. If you say the same thing to your best friend, who knows that your family lives in New York, she will have "understood" if she realizes that there may be a problem in your family and you may need some emotional support. As we talk about understanding, it is important to keep in mind that the success or failure of an "understanding" program can rarely be measured in an absolute sense but must instead be measured with respect to a particular task to be performed. This is true both of language-understanding programs and also of understanders in other domains, such as vision.
For people, understanding applies to inputs from all the senses. Computer understanding has so far been applied primarily to images, speech, and typed language. In this chapter we discuss issues that cut across all of these modalities. In Chapter I5, we explore the problem of typed natural language in more detail, and in Chapter 2I, we look at speech and vision problems. Although we have defined understanding above as the process of mapping into appropriate actions, we are not precluding a view of under- slanding in which inputs are simply interpreted and stored for later. In such a system, the appropriate action is to store the proper representation. This view of understanding deseribes what occurs in most image understanding programs and some language un- derstanding programs. Taking direct action describes what happens in systems in which language, either typed or spoken, is used in the interfacc between user and computer.
There are four major factors that contribute to the difficulty of an understanding problem:
A few examples will illustrate the importance of each of these factors.
Complexity of the Target Representation
Suppose English sentences are being used for communication with a keyword-based data retrieval system. Then the sentence
would need to be translated into a representation such as
But now suppose that English sentences are being used to provide input to a program that rccords events so that it can answer a variety of questions about those events and their rclationships. For example, consider the following story:
Bill told Mary he would not go to the movies with her.
Her feelings were hurt.
The result of understanding this story could be represented, using the conceptual dependency model that we discussed in Chapter 10, as shown in Figure 14.1. This representation is considerably more complex than that for the simple query. All other things being equal, constructing such a complex representation is more difficult than constructing a simple one since more information must be extracted from the input sentences. Extracting that information often requires the use of additional knowledge about the world described by the sentences.
Type of Mapping
Recall that understanding is the process of mapping an input from its original form to a more useful one. The simplest kind of mapping to deal with is one-to-one (i.e., each different statement maps to a single target representation that is different from that arising from any other statement). Very few input systems are totally one-to-one. But as an example of an almost one-to-one mapping, consider the language of arithmetic expressions in many programming languages. In such a language, a mapping such as the following might occur:
Although one-to-one mappings are, in general, the simplest to perform, they are rare in interesting input systems for several reasons. One important reason is that in many domains, inputs must be interpreted not absolutely, but relatively, with respect to some reference point. For example, when images are being interpreted, size and perspective will change as a function of the viewing position. Thus a single object will look different in different images. To see this, look at Figure 14.2, which shows two line drawings representing the same scene, one of which corresponds to a picture taken close to the scene and one of which represents a picture taken from farther away. A similar phenomenon occurs in English. The word "tall" specifies one height range in the phrase "a tall giraffe" and a different one in the phrase "a tall poodle."
A second reason that many-to-one mappings are frequent is that free variation is often allowed, either because ofthe physical limitationsofthe system that produces the inputs or because such variation simply makes the task of generating the inputs manageable. Both of these factors help to explain why natural languages, both in their spoken and their written forms, require many-to-one mappings. Examples from speech abound. No two people speak identically. In fact. one person does not always say a given word the
same way. Figure 14.3 illustrates this problem. It shows a spectrogram produced by the beginning of the utterance "Alpha gets alpha minus beta." A spectrogram shows how the sound energy is distributed over the auditory frequency range as a function of time. In this example, you can see two different patterns, each produced by the word "alpha." Even when we ignore the variability of the speech signal, natural languages admit variability because oftheirrichness. This is particularly noticeable when mapping from a natural language (with its richness ofstructure and vocabulary) to a small, simple target representation. So, for example, we might find many-to-one mappings, such as the following one, occurring in the English front end to a keyword data retrieval system:
Many-to-one mappings require that the understanding system know about all the ways that a target representation can be expressed in the source language. As a result, they typically require a structured analysis ofthe input rather than a simple, exact pattem match. But they often do not require much other knowledge.
One-to-many mappings, on the other hand, often require a great deal of domain knowledge (in addition to the input itselt) in order to make the correct choice among the available target representations. An example of such a mapping (in which the input can be said to be ambiguous) is the following sentence:
Notice that although this sentence, taken in isolation, is ambiguous, it would usually not be interpreted as being ambiguous by a human listener in a specific context. Clues, both from previous sentences and from the physical context in which the sentence occurs, usually make one of these interpretations appear to be correct. The problem, though, from a processing standpoint, is how to encode this contextual information and how to exploit it while processing each new sentence.
Notice that English, in all its glory, has the properties of both of these last two examples; it involves a many-to-many mapping, in which there are many ways to say the same thing and a given statement may have many meanings.
Level of Interaction among Components
In most interesting understanding contexts, each input is composed of several components (lines, words, symbols, or whatever). The mapping process is the simplest if each component can be mapped without concern for the other components of the statement. Otherwise, as the number of interactions increases, so does the complexity of the mapping.
Programming languages provide good examples of languages in which there is very Iittle interaction among the components of an input. For example, Figure 14.4 shows how changing one word of a statement requires only a single change to one node of the corresponding parse tree.
In many natural language sentences, on the other hand, changing a singte word can alter not just a single node ofthe interpretation, but rather its entire structure. An example of this is shown in Figure 14.5. (The triangles in the figure indicate substructures whose funher decomposition is not imponant.) As these examples show, the components of an English sentence typically interact more heavily with each other than do the components of anificial tanguages, such as programming languages, that have been designed, among other things, to facilitate processing by computer.
Nonlocality can be a problem at all levels of an understanding process. In the boy in the park example, the problem is in how to group phrases together. But in perceptual understanding tasks, this same probtem may make it difficult even to decide
on what the basic constituents are. Figure 14.6 shows a simplified example from speech understanding. Assuming that the sounds shown in the figure have been identified, the problem is to group them into words. But the correct grouping cannot be determined without Iooking at the larger context in which the sounds occurred. Either of the groupings sho wn is possible, as can be seen from the two sentences in the figure. Figure 14.7 shows an actual speech waveform, in which the lack of local clues, even for segmenting into individual sounds, can be seen.
In image-understanding problems as well, a similar problem involving local indeterminacy arises. Consider the situation shown in Figure 14.8. At this point, lines have been extracted from the original figure and the next task is to separate the figure into objects. But suppose we start at the left and identify the object labeled A. Does it end at the vertical line? It is not possible to tell without looking past the vertical object to see if there is an extension, which, in this case, there is.
Noise in the Input
Understanding is the process of interpreting an input and assigning it meaning. Unfortunately, in many understanding situations the input to which meaning should be assigned is not always the input that is presented to the understander. Because of the complex environment in which understanding usually occurs, other things often interfere with the basic input before it reaches the understander. In perceptual tasks, such as speech and image understanding, this problem is common. We rarely have the opportunity to listen to each other against a background of silence. Thus we must take an input signal and separate the speech component from the background noise component in order to understand the speech. The same problem occurs in image understanding. If you look out of your car window in search of a particular store sign, the image you will see of the sign may be interfered with by many things, such as your windshield wipers or the trees alongside the road. Although typed language is less susceptible to noise than is spoken language, noise is still a problem. For example, typing errors are common, particularly if language is being used interactively to communicate with a computer system.
Conclusion
The point of this section has been twofold. On the one hand, it has attempted to describe the sources of complexity in understanding tasks, in order to help you analyze new understanding tasks for tractability. On the other, it has tried to point out specific understanding tasks that tum out, unfortunately to be hard (such as natural language understanding) but that are nevertheless important in the sense that it would be useful if we could perform them. It is to these understanding tasks that we will need to devote substantial research effort.
On the basis of a superficial analysis (such as the one in the last section), many under- standing tasks appear impossibly complex. The number of interpretations that can be assigned to individual components of an input is large, and the number of combinations of those components is enormous. But a closer analysis often reveals that many of the combinations cannot actually occur. These natural constraints can be exploited in the understanding process to reduce the complexity from unmanageable to tractable. There are two important steps in the use ofconstraints in problem solving:
In the rest of this section, we look at one example of the use of this approach, the Waltz algorithm for labeling line drawings. In Chapter 15 we then look in depth at the problem of natural language understanding and see how it too can be viewed as a constraint satisfaction process.
Consider the drawing shown in Figure 14.9. Assume either that you have been given this drawing as the input or that lower-level routines have already operated to extract these lines from an input photograph. The next step in the analysis process is Io deterniine þhc objects desecribed by the lines. To do this, we need first to identify each of the lines in the figure as representing either:
For more complex figures, other edge types, such as cracks between coplanar faces and shadow edges between shadows and the background, would also be required. The approach we deseribe here has, in fact, been extended to handle these other edge types. But to make the explanation straightforward, we consider only these three. In fact, we consider only figures composed exelusively of trihedral vertices, which are vertices at which exactly three planes come together. Figure 14.10 shows examples of trihedral figures. Figure 14.11 shows examples of nontrihedral figures.
Determining the Constraints
The problem we are trying to solve is how to recognize individual objects in a figure. To do that, we intend first to label all the lines in the figure so that we know which ones correspond to boundaries between objects. We use the three line types given above. For boundary lines, we also need to indicate a direction, telling which side of the line corresponds to the object and which to the background. This produces a set of four labels that can be attached to a given line. We use the conventions shown in Figure I4.12 to show line labelings. To illustrate these labelings, Figure 14.13 shows the drawing of Figure 14.9 with each ofits lines correctly labeled.
Assuming these four line types we can calculate that the number of ways of labeling a figure composed of N lines is 4N. How can we find the correct one? The critical observation here is that every line must meet other Iines at a vertex at each of its ends. For the trihedral figures we are considering, there are only four configurations that describe all the possible vertices. These four configurations are shown in Figure 14.14. The rotational position of the vertex is not significant, nor are the sizes of the angles it contains, except that the distinction between acute angles (< 90 degrees) and obtuse angles (> 90 degrees) is imponant to distinguish between a FORK and an ARROW. If there tum out to be constraints on the kinds of vertices that can occur, then there would be corresponding constraints on the lines entering the vertices and thus the number of possible line labelings would be reduced.
To begin looking for such vertex constraints, we first consider the maximum number of ways that each of the four types of lines might combine with other lines at a vertex. Since an L vertex involves two lines, each of which can have four labels, there must be sixteen ways it could be formed. FORKs, Ts, and ARROWs involve three lines, so they could be formed in sixty-four ways each. Thus there are 208 ways to form a trihedral vertex. But, in fact, only a very small number of these labelings can actually occur in line drawings representing real physical objects. To see this, consider the planes on which the faces that form a vertex of a trihedral figure lie. These three planes must
divide 3-space into eight pans (called octants) since each individual face divides the space in half and none of the faces can be coplanar. Trihedral figures may differ in the number of octants that they fill and in the position (which must be one of the unfilled octants) from which they are viewed. Any vertex that can occur in a trihedral figure must correspond to such a division of space with some number (between one and eight) of octants filled, which is viewed from one of the unfilled octants. So to find all the vertex labelings that can occur, we need only consider all the ways of filling the octants and each of the ways of viewing those fillings, and then record the types of the vertices that we find.
To illustrate this process, consider the drawing shown in Figure 14.15, which occu- pies one of the eight octants formed by the intersection of the planes corresponding to the faces of vertex A. Imagine viewing this figure from each of the remaining seven octants and recording the configuration and the labeling of vertex A. Figure l4.16(a) shows the results of this. When we take those seven descriptions and eliminate rota- tional and angular variations. we see that only three distinct ones remain, as shown in Figure 14.16(b). If we continue this process for objects filling up to seven octants (there can be no vertices if all eight octants are filled), we get a complete list of the possible trihedral vertices and their labelings (equivalent to that developed by Clowes [1971]). This list is shown in Figure 14.17. Notice that of the 208 labelings that we said were theoretically possible, only eighteen are physically possible. Thus we have found a severe constraint on the way that lines in drawings corresponding to real figures can be labeled.
Of course, at this point we have only found a constraint on the ways in which simple trihedral vertices can be labeled. Many figures, such as those shown in Figure 14.11, contain nontrihedral vertices. In addition, many figures contain shadow areas, which can be of great use in analyzing the scene that is being portrayed. When these varia- tions are considered, there do become more than eighteen allowable vertex labelings. But when these variations are allowed, the number of theoretically possible labelings becomes much larger than 208, and, in fact, the ratio of physically allowable vertices to theoretically possible ones becomes even smaller than 18þ208. Thus not only can this approach be extended to larger domains, it must be.
As a result of this analysis, we have been able to articulate one class of constraints that will be needed by a line-labeling procedure. These constraints are static (since the physical rules they are based on never change), and so they do not need to be represented explicitly as part of a problem state. They can be encoded directly into the line-labeling algorithm. The other class of constraints we will need contains the dynamic ones that describe the current options for the labeling of each vertex. These constraints will be represented and manipulated explicitly by the line-labeling algorithm.
Applying Constraints in Analysis Problems
Having analyzed the domain in which we are working and extracted a set of constraints that objects in the domain must satisfy, we need next to apply those constraints to the problem of analyzing inputs in the domain. To do this, we use a form of the constraint satisfaction procedure described in Section 3.5: It turns out that for this problem it is
not necessary to use the second part of our constraint satisfaction procedure (the one that makes guesses and results in search). The domain provides sufficiently powerful constraints that it is not necessary to resort to search. Thus the Waltz algorithm [Waltz, 1975], which we present here, omits that step entirely.
To label line drawings of the sort we are considering, we first pick one vertex and find all the labelings that are possible for it. Then we move to an adjacent vertex and find all of its possible Iabelings. The line that we followed to get from the first vertex to the second must end up with only one label, and that label must be consistent with the two vertices it enters. So any labelings for either of the two vertices that require the line to be labeled in a way that is inconsistent with the other vertex ean be eliminated. Now another vertex, adjacent to one of the first two, can be labeled. New constraints will arise from this labeling and these constraints can be propagated back to vertices that have already been labeled, so the set of possible labelings for them is further reduced. This process proceeds until all the vertices in the figure have been labeled.
As an example, consider the simple drawing shown in Figure 14.18(a). We can begin by labeling all the boundary edges, as shown in Figure 14.18(b). Suppose we then begin labeling vertices at vertex 1. The only vertex label that is consistent with the known line labels is 13. At vertex 2, the only consistent label is 6. At each of the remaining boundary vertices, there is also only one labeling choice. These labelings are shown in parentheses in Figure 14.18(c). Now consider vertex 7. Just looking at vertex 7 itself, it would appear that any of the five FORK labelings is possible. But from the only labeling we found for vertex 2, we know that the line between vertices 2 and 7 must be labeled +. This makes sense since it obviously represents a convex edge. Using this fact, we can eliminate four of the possible FORK labels. Only label 8 is now possible. The complete labeling just computed is shown in Figure 14. I 8(d). Thus we see that by exploiting constraints on vertex labelings, we have correctly identified vertex 7 as being formed by three convex edges.
We can now specify in more detail this particular version of constraint propagation.
Algorithm: Waltz
This algorithm will always find the unique, correct figure labeling if one exists. If a figure is ambiguous, however, the algorithm will terminate with at least one vertex still having more than one labeling attached to it.
Actually, this algorithm, as described by Waltz, was applied to a larger class of figures in which cracks and shadows might occur. But the operation of the algorithm is the same regardless of the size of the table of allowable vertex labelings that it uses. In fact, as suggested in the last section, the usefulness of the algorithm increases as the size of the domain increases and thus the ratio of physically possible to theoretieally possible vertices decreases. Waltz's program, for example, used shadow information, which appears in the figure locally as shadow lines, as a way of exploiting a global constraint, namely that a single source of light produces consistent shadows.
In this chapter we outlined the major difficulties that confront programs designed to perform perceptual tasks. We also described the use of the constraint satisfaction procedure as one way of surmounting some of those difficulties.
Sometimes the problems of speech and image understanding are important in the construction of stand-alone programs to solve one particular task. But they also play an important role in the larger field of robotics, which has as its goal the construction of intelligent robots capable of functioning with some degree of autonomy. For such robots, perceptual abilities are essential. We will retum to these issues in Chapter 21.