Natural Language Understanding James Allen University of Rochester Brief Contents Acquisitions Editor: J. Carter Shanklin Executive Editor: Dan Joraanstad Editorial Assistant: Melissa Standen Cover Designer: Yvo Riezebos Design Technical Assistant: Peg Meeker Production Editor: Ray Kanarr Copy Editor: Barbara Conway Proofreader: Joe Ruddiek Design Consultant: Michael Rogondino Macintosh is a trademark of Apple Computer, Inc. Word is a trademark of Microsoft, Inc. Canvas is a trademark of Deneba Software. Camera-ready copy for this book was prepared on a Macintosh with Microsoft Word and Canvas. The programs and the applications presented in this book have been included for their instructional value. They have been tested with care but are not guaranteed for any particular purpose. The publisher does not offer any warranties or representations, nor does it accept any liabilities with respect to the programs or applications. Copyright © 1995 by The Benjamin/Cummings Publishing Company, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America. Published simultaneously in Canada. Library of Congress Cataloging-in-Publication Data Allen, James. Natural language understanding / James Allen. — 2nd ed. p. cm. Includes bibliographical references and index. ISBN 0-8053-0334-0 1. Programming languages (Electronic computers)--Semantics. 2. Language and logic. 3. Artificial intelligence. I. Title. QA76.7.A44 1994 006.3'6--dc20* 94-18218 CIP 3 4 5 6 7 8 9 10 - MA - 98 97 96 95 The Benjamin/Cummings Publishing Company, Inc. 390 Bridge Parkway Redwood City, CA 94065 Preface xiii Chapter 1 Introduction to Natural Language Understanding 1 PART I Syntactic Processing Chapter 2 Linguistic Background: An Outline of English Syntax Chapter 3 Grammars and Parsing 41 Chapter 4 Features and Augmented Grammars 83 Chapter 5 Grammars for Natural Language 123 Chapter 6 Toward Efficient Parsing 159 Chapter 7 Ambiguity Resolution: Statistical Methods 189 PART II Semantic Interpretation Chapter 8 Semantics and Logical Form 227 Chapter 9 Linking Syntax and Semantics 263 Chapter 10 Ambiguity Resolution 295 Chapter 11 Other Strategies for Semantic Interpretation 328 Chapter 12 Scoping and the Interpretation of Noun Phrases 351 PART III Context and World Knowledge Chapter 13 Knowledge Representation and Reasoning 391 Chapter 14 Local Discourse Context and Reference 429 Chapter 15 Using World Knowledge 465 Chapter 16 Discourse Structure 503 Chapter 17 Defining a Conversational Agent 541 Appendix A An Introduction to Logic and Model-Theoretic Semantics Appendix B Symbolic Computation 595 Appendix C Speech Recognition and Spoken Language 611 Bibliography 629 Index 645 Contents PREFACE xi Chapter 1 Introduction to Natural Language Understanding 1.1 The Study of Language 1 1.2 Applications of Natural Language Understanding 1.3 Evaluating Language Understanding Systems 6 1.4 The Different Levels of Language Analysis 9 1.5 Representations and Understanding 11 1.6 The Organization of Natural Language Understanding Systems 15 PARTI SYNTACTIC PROCESSING Chapter 2 Linguistic Background: An Outline of English Syntax 2.1 Words 23 2.2 The Elements of Simple Noun Phrases 25 2.3 Verb Phrases and Simple Sentences 28 2.4 Noun Phrases Revisited 33 2.5 Adjective Phrases 35 2.6 Adverbial Phrases 35 Chapter 3 Grammars and Parsing 3.1 Grammars and Sentence Structure 41 3.2 What Makes a Good Grammar 44 3.3 A Top-Down Parser 47 3.4 A Bottom-Up Chart Parser 53 3.5 Transition Network Grammars 61 ° 3.6 Top-Down Chart Parsing 65 ° 3.7 Finite State Models and Morphological Processing ° 3.8 Grammars and Logic Programming 72 (o—optional topic) vi CONTENTS Chapter 4 Features and Augmented Grammars 4.1 Feature Systems and Augmented Grammars 83 4.2 Some Basic Feature Systems for English 86 4.3 Morphological Analysis and the Lexicon 90 4.4 A Simple Grammar Using Features 94 4.5 Parsing with Features 98 ° 4.6 Augmented Transition Networks 101 ° 4.7 Definite Clause Grammars 106 ° 4.8 Generalized Feature Systems and Unification Grammars 109 Chapter 5 Grammars for Natural Language 5.1 Auxiliary Verbs and Verb Phrases 123 5.2 Movement Phenomena in Language 127 5.3 Handling Questions in Context-Free Grammars 132 ° 5.4 Relative Clauses 141 5.5 The Hold Mechanism in ATNs 144 5.6 Gap Threading 148 Chapter 6 Toward Efficient Parsing 6.1 Human Preferences in Parsing 159 6.2 Encoding Uncertainty: Shift-Reduce Parsers 163 6.3 A Deterministic Parser 170 6.4 Techniques for Efficient Encoding of Ambiguity 176 6.5 Partial Parsing 180 Chapter 7 Ambiguity Resolution: Statistical Methods 7.1 Basic Probability Theory 189 7.2 Estimating Probabilities 192 7.3 Part-of-Speech Tagging 195 7.4 Obtaining Lexical Probabilities 204 7.5 Probabilistic Context-Free Grammars 209 7.6 Best-First Parsing 213 7.7 A Simple Context-Dependent Best-First Parser 216 CONTENTS \ii PART II SEMANTIC INTERPRETATION Chapter 8 Semantics and Lpgical Form 8.1 Semantics and Logical Form 227 8.2 Word Senses and Ambiguity 231 8.3 The Basic Logical Form Language 233 8.4 Encoding Ambiguity in the Logical Form 238 8.5 Verbs and States in Logical Form 241 8.6 Thematic Roles 244 8.7 Speech Acts and Embedded Sentences 250 ° 8.8 Defining Semantic Structure: Model Theory 251 Chapter 9 Linking Syntax and Semantics 9.1 Semantic Interpretation and Compositionality 263 9.2 A Simple Grammar and Lexicon with Semantic Interpretation 267 9.3 Prepositional Phrases and Verb Phrases 271 9.4 Lexicalized Semantic Interpretation and Semantic Roles 275 9.5 Handling Simple Questions 280 9.6 Semantic Interpretation Using Feature Unification 283 ° 9.7 Generating Sentences from Logical Form 286 Chapter 10 Ambiguity Resolution 10.1 Selectional Restrictions 295 10.2 Semantic Filtering Using Selectional Restrictions 302 10.3 Semantic Networks 305 10.4 Statistical Word Sense Disambiguation 310 10.5 Statistical Semantic Preferences 314 10.6 Combining Approaches to Disambiguation 318 Chapter 11 Other Strategies for Semantic Interpretation 11.1 Grammatical Relations 328 11.2 Semantic Grammars 332 11.3 Template Matching 334 11.4 Semantically Driven Parsing Techniques 341 Viü CONTENTS CONTENTS ix Chapter 12 Scoping and the Interpretation of Noun Phrases 12.1 Scoping Phenomena 351 12.2 Definite Descriptions and Scoping 359 12.3 A Method for Scoping While Parsing 360 12.4 Co-Reference and Binding Constraints 366 12.5 Adjective Phrases 372 12.6 Relational Nouns and Nominalizations 375 12.7 Other Problems in Semantics 378 PART III CONTEXT AND WORLD KNOWLEDGE Chapter 13 Knowledge Representation and Reasoning 13.1 Knowledge Representation 392 13.2 A Representation Based on FOPC 397 13.3 Frames: Representing Stereotypical Information 400 13.4 Handling Natural Language Quantification 404 ° 13.5 Time and Aspectual Classes of Verbs 406 ° 13.6 Automating Deduction in Logic-Based Representations 410 ° 13.7 Procedural Semantics and Question Answering 414 ° 13.8 Hybrid Knowledge Representations 419 Chapter 14 Local Discourse Context and Reference 14.1 Defining Local Discourse Context and Discourse Entities 429 14.2 A Simple Model of Anaphora Based on History Lists 433 14.3 Pronouns and Centering 435 14.4 Definite Descriptions 440 ° 14.5 Definite Reference and Sets 445 14.6 Ellipsis 449 14.7 Surface Anaphora 455 Chapter 15 Using World Knowledge 15.1 Using World Knowledge: Establishing Coherence 465 15.2 Matching Against Expectations 466 15.3 Reference and Matching Expectations 471 15.4 Using Knowledge About Action and Causality 473 15.5 Scripts: Understanding Stereotypical Situations 477 15.6 Using Hierarchical Plans 480 ° 15.7 Action-Effect-Based Reasoning 483 ° 15.8 Using Knowledge About Rational Behavior 490 Chapter 16 Discourse Structure 16.1 The Need for Discourse Structure 503 16.2 Segmentation and Cue Phrases 504 16.3 Discourse Structure and Reference 510 16.4 Relating Discourse Structure and Inference 512 16.5 Discourse Structure, Tense, and Aspect 517 16.6 Managing the Attentional Stack 524 16.7 An Example 530 Chapter 17 Defining a Conversational Agent 17.1 What's Necessary to Build a Conversational Agent? 541 17.2 Language as a Multi-Agent Activity 543 17.3 Representing Cognitive States: Beliefs 545 17.4 Representing Cognitive States: Desires, Intentions, and Plans 551 17.5 Speech Acts and Communicative Acts 554 17.6 Planning Communicative Acts 557 17.7 Communicative Acts and the Recognition of Intention 561 17.8 The Source of Intentions in Dialogue 564 17.9 Recognizing Illocutionary Acts 567 17.10 Discourse-Level Planning 570 X CONTENTS APPENDIX A An Introduction to Logic and Model-Theoretic Semantics A.l Logic and Natural Language 579 A.2 Model-Theoretic Semantics 584 A. 3 A Semantics for FOPC: Set-Theoretic Models 588 APPENDIX B Symbolic Computation B. l Symbolic Data Structures 595 B.2 Matching 598 B.3 Search Algorithms 600 B.4 Logic Programming 603 B. 5 The Unification Algorithm 604 APPENDIX C Speech Recognition and Spoken Language C. l Issues in Speech Recognition 611 C.2 The Sound Structure of Language 613 C.3 Signal Processing 616 C.4 Speech Recognition 619 C.5 Speech Recognition and Natural Language Understanding 623 C.6 Prosody and Intonation 625 BIBLIOGRAPHY 629 INDEX 645 Preface The primary goal of this book is to provide a comprehensive, in-depth description of the theories and techniques used in the field of natural language understanding. To do this in a single book requires eliminating the idiosyncratic complexities of particular approaches and identifying the underlying concepts of the field as a whole. It also requires developing a small number of notations that can be used to describe a wide range of techniques. It is not possible to capture the range of specific notations that are used in the literature, and attempting to do so would mask the important ideas rather than illuminate them. This book also attempts to make as few assumptions about the background of the reader as possible. All issues and techniques are described in plain English whenever possible. As a result, any reader having some familiarity with the basic notions of programming will be able to understand the principal ideas and techniques used. There is enough detail in this book, however, to allow a sophisticated programmer to produce working systems for natural language understanding. The book is intended both as textbook and as a general reference source for those interested in natural language processing. As a text, it can be used both in undergraduate and graduate courses in computer science or computational linguistics. As a reference source, it can be used by researchers in areas related to natural language, by developers building natural language systems, and by individuals who are simply interested in how computers can process language. Work in natural language understanding requires background in a wide range of areas, most importantly computer science, linguistics, logic, psycholinguistics, and the philosophy of language. As very few people have all this background, this book introduces whatever background material is required, as needed. For those who have no familiarity with symbolic programming or logical formalisms, two appendices are provided that describe enough of the basic ideas to enable nonprogrammers to read the book. Background in linguistics and other areas is introduced as needed throughout the book. The book is organized by problem area rather than by technique. This allows the overlap in different approaches to be presented once, and facilitates the comparison of the different techniques. For example, rather than having a chapter on context-free grammars, one on transition network grammars, and another on logic programming techniques, these three techniques are discussed in many chapters, each focusing on a specific set of problems. There is a chapter on basic parsing techniques, then another on using features, and another on handling movement phenomena in language. Each chapter lays out its problem, and then discusses how the problem is addressed in each approach. This way, the similarities and differences between the approaches can be clearly identified. xi Xii PREFACE PREFACE Xiii Why a Second Edition? I have taught a course based on the first edition regularly since it first appeared. Given developments in the field, I began to find it lacking in some areas. In other areas the book had the right content, but not the right emphasis. Another motivation came from what I saw as a potential split in the field with the introduction of statistically-based techniques and corpus-based research. Claims have been made as to how the new techniques make the old obsolete. But on examining the issues carefully, it was clear to me that the old and new approaches are complementary, and neither is a substitute for the other. This second edition attempts to demonstrate this by including new material on statistical methods, with an emphasis on how they interact with traditional schemes. Another motivation was to improve the pedagogical aspects of the book. While the original edition was intended for use by students and researchers at different levels of background and with different needs, its organization sometimes made this difficult. The second edition addresses this problem in several ways, as described below in the section on using this book. As a result, it should be accessible to students with little background—say, undergraduates in a computer science program with little linguistics background, or undergraduates in a linguistics program with only a basic course in programming—yet also comprehensive enough for more advanced graduate-level courses, and for use as a general reference text for researchers in the field. There are many other changes throughout the book, reflecting how the field has changed since the first edition, or at least reflecting how my views have changed. Much of this is a change in emphasis. The sections on syntax, for instance, now use feature-based context-free grammars as the primary formalism for developing the ideas, rather than augmented transition networks. This better reflects current work in the field, and also allows for a better integration of insights from linguistics. Transition network formalisms are still covered in detail, however, as they underlie much work in the field. The sections on semantics have also changed substantially, focusing more on underlying principles and issues rather than specific computational techniques. The logical form language was updated to cover more complex sentences, and the primary method of semantic interpretation is a compositional rule-by-rule analysis. This allows for both a better discussion of the underlying problems in semantic interpretation and a better integration of insights from linguistics. New material has also been added on domain-specific semantic interpretation techniques that have been an active focus in practical systems in the last few years. The material on contextual processing and discourse has generally been rewritten and updated to reflect new work and my better understanding of the principles underlying the techniques in the literature. The other significant change is the development of software to accompany the book. Most of the principal examples and algorithms have been implemented and tested. This code is available so that students can run the examples to get a better understanding of the algorithms. It also provides a strong base from which students can extend algorithms and develop new grammars for assignments. The code also helps to clarify and simplify the discussion of the algorithms in the book, both because they can be more precise, being based on running programs, and because implementation details can be omitted from the text and still be available to the student in the code. How to Use This Book To cover the entire book in detail would require a two-semester sequence at the undergraduate level. As a graduate seminar, it is possible to move more quickly and cover most of the book in a single semester. The most common use of this book, however, is as an undergraduate-level course for a single semester. The book is designed to make this possible, and to allow the instructor to choose the areas on which to focus. There are several mechanisms to help customize the book to your needs. First, each chapter is divided into core material and optional material. Optional sections are marked with an open dot (o). In general, if you are using a chapter, all of the core material should be covered. The optional material can be selected as best fits the design of your course. Additional optional material is included in boxes. Material in boxes is never required, but can be used to push a little deeper in some areas. The chapters themselves have the general dependencies shown in the figure on the next page. This means that the student should cover the core material in the earlier chapters to best understand the later chapters. In certain cases, material in a section may require knowledge of a technique discussed earlier but not indicated in the dependencies shown here. For example, one of the techniques discussed in Chapter 11 uses some partial parsing techniques described in Chapter 6. In such a case, the earlier section will need to be covered even though the chapter as a whole is skipped. Readers familiar with the first edition will notice that the two final chapters, dealing with question answering and generation, are not present in the second edition. In keeping with the organization by topic rather than technique, the material from these chapters has been integrated into other chapters. Sections dealing with generation appear as each topic area is discussed. For example, a head-driven realization algorithm is described in Chapter 9 on linking syntax and semantics, and the procedural semantics technique used in database query systems appears in Chapter 13 on knowledge representation techniques. Software The software accompanying this book is available via anonymous ftp from bc.aw.com in the subdirectory "bc/allen." All software is written in standard COMMON LISP, and should run with any COMMON LISP system. To retrieve the software, ftp to bc.aw.com by typing ftp bc.aw.com and log in as "anonymous." Change to the directory for this book by typing cd bc/allen Before retrieving the files, it is a good idea to look at the "readme" file to see if changes have been made since this book went to press. To retrieve this file, type get README Xiv PREFACE PREFACE XV 1. Introduction, 2. Linguistic Background, 3. Grammars and Parsing, 4. Features and Augmented Grammars 5. Grammars for Natural Language 6. Efficient Parsing 7. Ambiguity Resolution: Statistical Methods 8. Semantics and Logical Form, 9. Linking Syntax and Semantics V 10. Ambiguity Resolution 11. Other Strategies for Semantic Interpretation 12. Scoping and the Interpretation of Noun Phrases 3. Knowledge Representation 14. Reference and Local Context 15. Using Knowledge 16. Discourse Structure 17. A Conversational Agent Quit ftp to log off and read the file. (Although the file can be read online, it is courteous not to tie up the login for reading.) Log back on when you are ready to download. You can also get a listing of filenames available using either the conventional UNIX "Is" command or the DOS "dir" command. Assuming that you are in the proper subdirectory, these commands require no arguments: dir or Is Using ftp and then de-archiving files can get complicated. Instructions vary as to whether you are downloading Macintosh, DOS, or UNIX files. More instructions are included in the README file. If you are new to using ftp, it is best to consult your favorite UNIX guide or wizard. References This book contains extensive references to literature in this field, so that the particular details of the techniques described here can easily be found. In addition, references are provided to other key papers that deal with issues beyond the scope of the book to help identify sources for research projects. References are chosen based on their general availability. Thus I have tried to confine citations to journal articles, books, and major conferences, avoiding technical reports and unpublished notes that are hard to find ten years after the fact. In general, most work in the field is published first in technical reports; more accessible publications often do not appear until many years after the ideas were introduced. Thus the year of publication in the references is not a reliable indicator of the time when the work was done, or when it started to influence other work in the field. While I've tried to provide a wide range of sources and to give credit to the ideas described in this book, I know I have omitted key papers that I will regret immensely when I remember them. To these authors, I extend my apologies. Acknowledgments This book would not have been completed except for a sabbatical from the University of Rochester, and for the continual support of the Computer Science Department over the last three years. I was provided with the extra resources needed to complete the project without question, and I am very grateful. I also must thank Peggy Meeker for her efforts in editing and preparing endless drafts as well as the final pages you see here. Peggy has had amazing patience and good humor through countless revisions, and has enforced a consistency of notation and grammatical style that appears to be foreign to my natural way of writing! I have been fortunate to have a wide range of reviewers on various drafts, both those working for the publisher and many other friends who have taken much time to read and comment on drafts. The list is long, and each one has devoted such effort that listing their names hardly seems sufficient. But here goes. Many thanks to Glenn Blank, Eugene Charniak, Michelle Degraff, George Ferguson, Mary Harper, Peter Heeman, Chung-Hee Hwang, Susan McRoy, Brad Miller, Johanna Moore, Lynn Peterson, Victor Raskin, Len Schubert, Mark Steedman, Mike Tanenhaus, David Traum, and the students in CSC 447 who used the book in the Spring of 1994. The book is significantly different and improved from the early drafts because of all the honest and critical feedback I received. Finally, I must thank my wife, Judith Hook, for putting up with all this for a second time after experiencing the trauma of doing the first edition. Pushing the book to completion became an obsession that occupied me day and night for nearly a year. Thanks to Judi, and to Jeffrey, Daniel, and Michael, for putting up with me during this time and keeping an appropriate sense of humor about it. James F. Allen Rochester, New York CHAPTER Introduction to Natural Language Understanding 1.1 The Study of Language 12 Applications of Natural Language Understanding 1.3 Evaluating Language Understanding Systems 1.4 The Different Levels of Language Analysis 1.5 Representations and Understanding 1.6 The Organization of Natural Language Understanding Systems Introduction to Natural Language Understanding 1 This chapter describes the field of natural language understanding and introduces some basic distinctions. Section 1.1 discusses how natural language understanding research fits into the study of language in general. Section 1.2 discusses some applications of natural language understanding systems and considers what it means for a system to understand language. Section 1.3 describes how you might evaluate whether a system understands language. Section 1.4 introduces a few basic distinctions that are made when studying language, and Section 1.5 discusses how computational systems often realize these distinctions. Finally, Section 1.6 discusses how natural language systems are generally organized, and introduces the particular organization assumed throughout this book. 1.1 The Study of Language Language is one of the fundamental aspects of human behavior and is a crucial component of our lives. In written form it serves as a long-term record of knowledge from one generation to the next. In spoken form it serves as our primary means of coordinating our day-to-day behavior with others. This book describes research about how language comprehension and production work. The goal of this research is to create computational models of language in enough detail that you could write computer programs to perform various tasks involving natural language. The ultimate goal is to be able to specify models that approach human performance in the linguistic tasks of reading, writing, hearing, and speaking. This book, however, is not concerned with problems related to the specific medium used, whether handwriting, keyboard input, or speech. Rather, it is concerned with the processes of comprehending and using language once the words are recognized. Computational models are useful both for scientific purposes— for exploring the nature of linguistic communication—and for practical purposes—for enabling effective human-machine communication. Language is studied in several different academic disciplines. Each discipline defines its own set of problems and has its own methods for addressing them. The linguist, for instance, studies the structure of language itself, considering questions such as why certain combinations of words form sentences but others do not, and why a sentence can have some meanings but not others. The psycholinguist, on the other hand, studies the processes of human language production and comprehension, considering questions such as how people identify the appropriate structure of a sentence and when they decide on the appropriate meaning for words. The philosopher considers how words can mean anything at all and how they identify objects in the world. Philosophers also consider what it means to have beliefs, goals, and intentions, and how these cognitive capabilities relate to language. The goal of the computational linguist is to develop a computational theory of language, using the notions of algorithms and data structures from computer science. Of course, to build a computational model, you must take advantage of what is known from all the other disciplines. Figure 1.1 summarizes these different approaches to studying language. 2 CHAPTER 1 Introduction to Natural Language Understanding 3 Discipline Typical Problems Linguists Psycholinguists Philosophers Computational Linguists How do words form phrases and sentences? What constrains the possible meanings for a sentence? How do people identify the structure of sentences? How are word meanings identified? When does understanding take place? What is meaning, and how do words and sentences acquire it? How do words identify objects in the world? How is the structure of sentences identified? How can knowledge and reasoning be modeled? How can language be used to accomplish specific tasks? Tools Intuitions about well-formedness and meaning; mathematical models of structure (for example, formal language theory, model theoretic semantics) Experimental techniques based on measuring human performance; statistical analysis of observations Natural language argumentation using intuition about counterexamples; mathematical models (for example, logic and model theory) Algorithms, data structures; formal models of representation and reasoning; AI techniques (search and representation methods) Figure 1.1 The major disciplines studying language As previously mentioned, there are two motivations for developing computational models. The scientific motivation is to obtain a better understanding of how language works. It recognizes that any one of the other traditional disciplines does not have the tools to completely address the problem of how language comprehension and production work. Even if you combine all the approaches, a comprehensive theory would be too complex to be studied using traditional methods. But we may be able to realize such complex theories as computer programs and then test them by observing how well they perform. By seeing where they fail, we can incrementally improve them. Computational models may provide very specific predictions about human behavior that can then be explored by the psycholinguist. By continuing in this process, we may eventually acquire a deep understanding of how human language processing occurs. To realize such a dream will take the combined efforts of linguists, psycholinguists, philosophers, and computer scientists. This common goal has motivated a new area of interdisciplinary research often called cognitive science. The practical, or technological, motivation is that natural language processing capabilities would revolutionize the way computers are used. Since most of human knowledge is recorded in linguistic form, computers that could understand natural language could access all this information. In addition, natural language interfaces to computers would allow complex systems to be accessible to BOX 1.1 Boxes and Optional Sections This book uses several techniques to allow you to identify what material is central and what is optional. In addition, optional material is sometimes classified as advanced, indicating that you may need additional background not covered in this book to fully appreciate the text. Boxes, like this one, always contain optional material, either providing more detail on a particular approach discussed in the main text or discussing additional issues that are related to the text. Sections and subsections may be marked as optional by means of an open dot (o) before the heading. Optional sections provide more breadth and depth to chapters, but are not necessary for understanding material in later chapters. Depending on your interests and focus, you can choose among the optional sections to fill out the core material presented in the regular sections. In addition, there are dependencies between the chapters, so that entire chapters can be skipped if the material does not address your interests. The chapter dependencies are not marked explicitly in the text, but a chart of dependencies is given in the preface. everyone. Such systems would be considerably more flexible and intelligent than is possible with current computer technology. For technological purposes it does not matter if the model used reflects the way humans process language. It only matters that it works. This book takes a middle ground between the scientific and technological goals. On the one hand, this reflects a belief that natural language is so complex that an ad hoc approach without a well-specified underlying theory will not be successful. Thus the technological goal cannot be realized without using sophisticated underlying theories on the level of those being developed by linguists, psycholinguists, and philosophers. On the other hand, the present state of knowledge about natural language processing is so preliminary that attempting to build a cognitively correct model is not feasible. Rather, we are still attempting to construct any model that appears to work. The goal of this book is to describe work that aims to produce linguistically motivated computational models of language understanding and production that can be shown to perform well in specific example domains. While the book focuses on computational aspects of language processing, considerable space is spent introducing the relevant background knowledge from the other disciplines that motivates and justifies the computational approaches taken. It assumes only a basic knowledge of programming, although the student with some background in linguistics, artificial intelligence (AI), and logic will appreciate additional subtleties in the development. 1.2 Applications of Natural Language Understanding A good way to define natural language research is to consider the different applications that researchers work on. As you consider these examples, it will 4 CHAPTER 1 Introduction to Natural Language Understanding 5 also be a good opportunity to consider what it would mean to say that a computer system understands natural language. The applications can be divided into two i major classes: text-based applications and dialogue-based applications. Text-based applications involve the processing of written text, such as » books, newspapers, reports, manuals, e-mail messages, and so on. These are all reading-based tasks. Text-based natural language research is ongoing in applications such as • finding appropriate documents on certain topics from a data- ", base of texts (for example, finding relevant books in a library) • extracting information from messages or articles on certain topics (for example, building a database of all stock transac- ] tions described in the news on a given day) • translating documents from one language to another (for ex- < ample, producing automobile repair manuals in many different languages) • summarizing texts for certain purposes (for example, producing a 3-page summary of a 1000-page government report) i Not all systems that perform such tasks must be using natural language understanding techniques in the way we mean in this book. For example, consider the task of finding newspaper articles on a certain topic in a large database. Many techniques have been developed that classify documents by the presence of certain keywords in the text. You can then retrieve articles on a certain topic by looking for articles that contain the keywords associated with that topic. Articles on law, for instance, might contain the words lawyer, court, sue, affidavit, and so on, while articles on stock transactions might contain words such as stocks, takeover, leveraged buyout, options, and so on. Such a system could retrieve articles on any topic that has been predefined by a set of keywords. Clearly, we would not say that this system is understanding the text; rather, it is using a simple matching technique. While such techniques may produce useful applications, they are inherently limited. It is very unlikely, for example, that they could be extended to handle complex retrieval tasks that are easily expressed in natural language, such as the query Find me all articles on leveraged buyouts \ involving more than 100 million dollars that were attempted but failed during '. 1986 and 1990. To handle such queries, the system would have to be able to extract enough information from each article in the database to determine whether the article meets the criteria defined by the query; that is, it would have to build a representation of the information in the articles and then use the representation to do the retrievals. This identifies a crucial characteristic of an j understanding system: it must compute some representation of the information that can be used for later inference. J Consider another example. Some machine translation systems have been built that are based on pattern matching; that is, a sequence of words in one language is associated with a sequence of words in another language. The ] translation is accomplished by finding the best set of patterns that match the input and producing the associated output in the other language. This technique can produce reasonable results in some cases but sometimes produces completely wrong translations because of its inability to use an understanding of content to disambiguate word senses and sentence meanings appropriately. In contrast, other machine translation systems operate by producing a representation of the meaning of each sentence in one language, and then producing a sentence in the other language that realizes the same meaning. This latter approach, because it involves the computation of a representation of meaning, is using natural language understanding techniques. One very attractive domain for text-based research is story understanding. In this task the system processes a story and then must answer questions about it. This is similar to the type of reading comprehension tests used in schools and provides a very rich method for evaluating the depth of understanding the system is able to achieve. Dialogue-based applications involve human-machine communication. Most naturally this involves spoken language, but it also includes interaction using keyboards. Typical potential applications include • question-answering systems, where natural language is used to query a database (for example, a query system to a personnel database) • automated customer service over the telephone (for example, to perform banking transactions or order items from a catalogue) • tutoring systems, where the machine interacts with a student (for example, an automated mathematics tutoring system) • spoken language control of a machine (for example, voice control of a VCR or computer) • general cooperative problem-solving systems (for example, a system that helps a person plan and schedule freight shipments) Some of the problems faced by dialogue systems are quite different than in text-based systems. First, the language used is very different, and the system needs to participate actively in order to maintain a natural, smooth-flowing dialogue. Dialogue requires the use of acknowledgments to verify that things are understood, and an ability to both recognize and generate clarification sub-dialogues when something is not clearly understood. Even with these differences, however, the basic processing techniques are fundamentally the same. It is important to distinguish the problems of speech recognition from the problems of language understanding. A speech recognition system need not involve any language understanding. For instance, voice-controlled computers and VCRs are entering the market now. These do not involve natural language understanding in any general way. Rather, the words recognized are used as commands, much like the commands you send to a VCR using a remote control. Speech recognition is concerned only with identifying the words spoken from a 6 CHAPTER 1 Introduction to Natural Language Understanding 7 given speech signal, not with understanding how words are used to communicate. To be an understanding system, the speech recognizer would need to feed its input to a natural language understanding system, producing what is often called a spoken language understanding system. With few exceptions, all the techniques discussed in this book are equally relevant for text-based and dialogue-based language understanding, and apply equally well whether the input is text, keyboard, or speech. The key characteristic of any understanding system is that it represents the meaning of sentences in some representation language that can be used later for further processing. 1.3 Evaluating Language Understanding Systems As you can see, what counts as understanding might vary from application to application. If this is so, how can you tell if a system works? One obvious way to evaluate a system is to run the program and see how well it performs the task it was designed to do. If the program is meant to answer questions about a database of facts, you might ask it questions to see how good it is at producing the correct answers. If the system is designed to participate in simple conversations on a certain topic, you might try conversing with it. This is called black box evaluation because it evaluates system performance without looking inside to see how it works. While ultimately this method of evaluation may be the best test of a system's capabilities, it is problematic in the early stages of research because early evaluation results can be misleading. Sometimes the techniques that produce the best results in the short term will not lead to the best results in the long term. For instance, if the overall performance of all known systems in a given application is uniformly low, few conclusions can be drawn. The fact that one system was correct 50 percent of the time while another was correct only 40 percent of the time says nothing about the long-term viability of either approach. Only when the success rates become high, making a practical application feasible, can much significance be given to overall system performance measures. An alternative method of evaluation is to identify various subcomponents of a system and then evaluate each one with appropriate tests. This is called glass box evaluation because you look inside at the structure of the system. The problem with glass box evaluation is that it requires some consensus on what the various components of a natural language system should be. Achieving such a consensus is an area of considerable activity at the present. Despite the difficulties, evaluations are of crucial importance to the field, as very little can be concluded from examining a few examples that a particular system can handle. If you cannot interact freely with the program yourself but must depend on transcripts of others' interactions, you can easily be mislead into overestimating a program's abilities. It is very hard for people not to attribute much more intelligence to the program than is actually present. Consider a concrete example—the ELIZA program that was developed in the mid-1960s at MIT. This program must be one of the most popular AI User: Men are all alike. They're always bugging us about something or other. Well, my boyfriend made me come here. He says I'm depressed much of the time. It's true, 1 am unhappy. I need some help; that much seems certain. Perhaps I could learn to get along with my mother. My mother takes care of me. My father. You are like my father in some ways. Figure 1.2 A dialogue with ELIZA ELIZA: In what way? Can you think of a specific example? Your boyfriend made you come here, I am sorry to hear you are depressed. Do you think coming here will help you not to be unhappy? What would it mean to you if you got some help? Tell me more about your family. Who else in your family takes care of you? Your father. What resemblance do you see? programs of all time, and versions of it exist for most machines, including most personal computers. ELIZA was never claimed to embody a theory of language comprehension and production, but it serves as an excellent example as its behavior initially seems impressive. The system plays the role of a therapist and, to obtain the best results, the user should correspondingly play the role of a patient. Figure 1.2 presents a transcript of this system in operation. Given this transcript, or even playing with the system yourself for a few minutes, ELIZA'S performance certainly seems impressive. Here is a simple description of how ELIZA works. There is a database of particular words that are called keywords. For each keyword, the system stores an integer, a pattern to match against the input, and a specification of the output. The algorithm is as follows: Given a sentence S, find a keyword in S whose pattern matches S. If there is more than one keyword, pick the one with the highest integer value. Use the output specification that is associated with this keyword to generate the next sentence. If there are no keywords, generate an innocuous continuation statement, such as Tell me more or Go on. Figure 1.3 shows a fragment of a database of keywords. In this database a pattern consists of words and variables. The prefix ? before a letter indicates a variable, which can match any sequence of words. For example, the pattern ?X are you ?Y would match the sentence Why are you looking at me?, where the variable ?X matches Why and ?Y matches looking at me. The output specification may also use the same variables. In this case, ELIZA inserts the words that match the variables in the input into the output after making some minor changes in the 8 CHAPTER 1 Word Rank Pattern Outputs alike 10 ?x In what way? What resemblance do you see? are 3 ?X are you ?Y Would you prefer it if I weren't ?Y? 3 ?X are ?Y What if they were not ?Y? always 5 ?X Can you think of a specific example? When? Really, always? what 2 ?X Why do you ask? Does that interest you? Figure 1.3 Sample data from ELIZA pronouns (for example, replacing me with you). Thus, for the pattern above, if the output specification is Would you prefer it if I weren't ?Y? the rule would generate a response Would you prefer it if I weren't looking at you? When the database lists multiple output specifications for a given pattern, ELIZA selects a different one each time a keyword rale is used, thereby preventing unnatural repetition in the conversation. Using these rules, you can see how ELIZA produced the first two exchanges in the conversation in Figure 1.2. ELIZA generated the first response from the first output of the keyword alike and the second response from the first output of the keyword always. This description covers all of the essential points of the program. You will probably agree that the program does not understand the conversation it is participating in. Rather, it is a collection of tricks. Why then does ELIZA appear to function so well? There are several reasons. Perhaps the most important reason is that, when people hear or read a sequence of words that they understand as a sentence, they attribute meaning to the sentence and assume that the person (or machine) that produced the sentence actually intended that meaning. People are extremely good at distinguishing word meanings and interpreting sentences to fit the context. Thus ELIZA appears to be intelligent because you use your own intelligence to make sense of what it says. Other crucial characteristics of the conversational setting also aid in sustaining the illusion of intelligence. For instance, the system does not need any world knowledge because it never has to make a claim, support an argument, or answer a question. Rather, it simply asks a series of questions. Except in a patient-therapist situation, this would be unacceptable. ELIZA evades all direct questions by responding with another question, such as Why do you ask? There is no way to force the program to say something concrete about any topic. Even in such a restricted situation, however, it is relatively easy to demonstrate that the program does not understand. It sometimes produces completely off-the-wall responses. For instance, if you say Necessity is the mother of invention, it might respond with Tell me more about your family, based on its pattern for the word mother. In addition, since ELIZA has no knowledge about the structure of language, it accepts gibberish just as readily as valid sentences. If you enter Green the adzabak are the a ran four, ELIZA will respond with something like What if they were not the a ran four? Also, as a conversation progresses, it becomes obvious that the program does not retain any of the content in the conversation. It begins to ask questions that are inappropriate in light of earlier exchanges, and its responses in general begin to show a lack of focus. Of course, if you are not able to play with the program and must depend only on transcripts of conversations by others, you would have no way of detecting these flaws, unless they are explicitly mentioned. Suppose you need to build a natural language program for a certain application in only six months. If you start to construct a general model of language understanding, it will not be completed in that time frame and so will perform miserably on the tests. An ELIZA-like system, however, could easily produce behavior like that previously discussed with less than a few months of programming and will appear to far outperform the other system in testing. The differences will be especially marked if the test data only includes typical domain interactions that are not designed to test the limits of the system. Thus, if we take short-term performance as our only criteria of progress, everyone will build and fine-tune ELIZA-style systems, and the field will not progress past the limitations of the simple approach. To avoid this problem, either we have to accept certain theoretical assumptions about the architecture of natural language systems and develop specific evaluation measures for different components, or we have to discount overall evaluation results until some reasonably high level of performance is obtained. Only then will cross-system comparisons begin to reflect the potential for long-term success in the field. 1.4 The Different Levels of Language Analysis A natural language system must use considerable knowledge about the structure of the language itself, including what the words are, how words combine to form sentences, what the words mean, how word meanings contribute to sentence meanings, and so on. However, we cannot completely account for linguistic behavior without also taking into account another aspect of what makes humans intelligent—their general world knowledge and their reasoning abilities. For example, to answer questions or to participate in a conversation, a person not only must know a lot about the structure of the language being used, but also must know about the world in general and the conversational setting in particular. 10 CIIAPi'KR I Introduction to Natural Language Understanding 11 The following are some of the different forms of knowledge relevant for natural language understanding: Phonetic and phonological knowledge—concerns how words are related to the sounds that realize them. Such knowledge is crucial for speech-based systems and is discussed in more detail in Appendix C. Morphological knowledge—concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language (for example, the meaning of the word friendly is derivable from the meaning of the noun friend and the suffix -ly, which transforms a noun into an adjective). Syntactic knowledge—concerns how words can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Semantic knowledge—concerns what words mean and how these meanings combine in sentences to form sentence meanings. This is the study of context-independent meaning—the meaning a sentence has regardless of the context in which it is used. Pragmatic knowledge—concerns how sentences are used in different situations and how use affects the interpretation of the sentence. Discourse knowledge—concerns how the immediately preceding sentences affect the interpretation of the next sentence. This information is especially important for interpreting pronouns and for interpreting the temporal aspects of the information conveyed. World knowledge—includes the general knowledge about the structure of the world that language users must have in order to, for example, maintain a conversation. It includes what each language user must know about the other user's beliefs and goals. These definitions are imprecise and are more characteristics of knowledge than actual distinct classes of knowledge. Any particular fact might include aspects from several different levels, and an algorithm might need to draw from several different levels simultaneously. For teaching purposes, however, this book is organized into three parts, each describing a set of techniques that naturally cluster together. Part I focuses on syntactic and morphological processing, Part II focuses on semantic processing, and Part III focuses on contextual effects in general, including pragmatics, discourse, and world knowledge. BOX 1.2 Syntax, Semantics, and Pragmatics The following examples may help you understand the distinction between syntax, semantics, and pragmatics. Consider each example as a candidate for the initial sentence of this book, which you know discusses natural language processing: 1. Language is one of the fundamental aspects of human behavior and is a crucial component of our lives. 2. Green frogs have large noses. 3. Green ideas have large noses. 4. Large have green ideas nose. Sentence 1 appears to be a reasonable start (I hope!). It agrees with all that is known about syntax, semantics, and pragmatics. Each of the other sentences violates one or more of these levels. Sentence 2 is well-formed syntactically and semantically, but not pragmatically. It fares poorly as the first sentence of the book because the reader would find no reason for using it. But however bad sentence 2 would be as a start, sentence 3 is much worse. Not only is it obviously pragmatically ill-formed, it is also semantically ill-formed. To see this, consider that you and I could argue about whether sentence 2 is true or not, but we cannot do so with sentence 3. I cannot affirm or deny sentence 3 in coherent conversation. However, the sentence does have some structure, for we can discuss what is wrong with it: Ideas cannot be green and, even if they could, they certainly cannot have large noses. Sentence 4 is even worse. In fact, it is unintelligible, even though it contains the same words as sentence 3. It does not even have enough structure to allow you to say what is wrong with it. Thus it is syntactically ill-formed. Incidentally, there are cases in which a sentence may be pragmatically well-formed but not syntactically well-formed. For example, if I ask you where you are going and you reply "I go store," the response would be understandable even though it is syntactically ill-formed. Thus it is at least pragmatically well-formed and may even be semantically well-formed. 1.5 Representations and Understanding As previously stated, a crucial component of understanding involves computing a representation of the meaning of sentences and texts. Without defining the notion of representation, however, this assertion has little content. For instance, why not simply use the sentence itself as a representation of its meaning? One reason is that most words have multiple meanings, which we will call senses. The word cook, for example, has a sense as a verb and a sense as a noun; dish has multiple senses as a noun as well as a sense as a verb; and still has senses as a noun, verb, adjective, and adverb. This ambiguity would inhibit the system from making the appropriate inferences needed to model understanding. The disambiguation problem appears much easier than it actually is because people do not generally notice ambiguity. While a person does not seem to consider each of the possible 12 CHAPTER I senses of a word when understanding a sentence, a program must explicitly consider them one by one. To represent meaning, we must have a more precise language. The tools to do this come from mathematics and logic and involve the use of formally specified representation languages. Formal languages are specified from very simple building blocks. The most fundamental is the notion of an atomic symbol, which is distinguishable from any other atomic symbol simply based on how it is written. Useful representation languages have the following two properties: The representation must be precise and unambiguous. You should be able to express every distinct reading of a sentence as a distinct formula in the representation. • The representation should capture the intuitive structure of the natural language sentences that it represents. For example, sentences that appear to be structurally similar should have similar structural representations, and the meanings of two sentences that are paraphrases of each other should be closely related to each other. Several different representations will be used that correspond to some of the levels of analysis discussed in the last section. In particular, we will develop formal languages for expressing syntactic structure, for context-independent word and sentence meanings, and for expressing general world knowledge. Syntax: Representing Sentence Structure The syntactic structure of a sentence indicates the way that words in the sentence are related to each other. This structure indicates how the words are grouped together into phrases, what words modify what other words, and what words are of central importance in the sentence. In addition, this structure may identify the types of relationships that exist between phrases and can store other information about the particular sentence structure that may be needed for later processing. For example, consider the following sentences: 1. John sold the book to Mary. 2. The book was sold to Mary by John. These sentences share certain structural properties. In each, the noun phrases are John, Mary, and the book, and the act described is some selling action. In other respects, these sentences are significantly different. For instance, even though both sentences are always either true or false in the exact same situations, you could only give sentence 1 as an answer to the question What did John do for Mary? Sentence 2 is a much better continuation of a sentence beginning with the phrase After it fell in the river, as sentences 3 and 4 show. Following the standard convention in linguisdcs, this book will use an asterisk (*) before any example of an ill-formed or questionable sentence. Introduction to Natural Language Understanding 13 NP VP S /\ NP VP /\ /\ N N NP N PP I \ Rice flies like N Rice flies P NP I I I sand like N I sand Figure 1.4 Two structural representations of Rice flies like sand. 3. * After it fell in the river, John sold Mary the book. 4. After it fell in the river, the book was sold to Mary by John. Many other structural properties can be revealed by considering sentences that are not well-formed. Sentence 5 is ill-formed because the subject and the verb do not agree in number (the subject is singular and the verb is plural), while 6 is ill-formed because the verb put requires some modifier that describes where John put the object. 5. *John are in the comer. 6. *John put the book. Making judgments on grammaticality is not a goal in natural language understanding. In fact, a robust system should be able to understand ill-formed sentences whenever possible. This might suggest that agreement checks can be ignored, but this is not so. Agreement checks are essential for eliminating potential ambiguities. Consider sentences 7 and 8, which are identical except for the number feature of the main verb, yet represent two quite distinct interpretations. 7. Flying planes are dangerous. 8. Flying planes is dangerous. If you did not check subject-verb agreement, these two sentences would be indistinguishable and ambiguous. You could find similar examples for every syntactic feature that this book introduces and uses. Most syntactic representations of language are based on the notion of context-free grammars, which represent sentence structure in terms of what phrases are subparts of other phrases. This information is often presented in a tree form, such as the one shown in Figure 1.4, which shows two different structures Introduction to Natural Language Understanding 15 for the sentence Rice flies like sand. In the first reading, the sentence is formed from a noun phrase (NP) describing a type of fly, rice flies, and a verb phrase (VP) that asserts that these flies like sand. In the second structure, the sentence is formed from a noun phrase describing a type of substance, rice, and a verb phrase stating that this substance flies like sand (say, if you throw it). The two structures also give further details on the structure of the noun phrase and verb phrase and identify the part of speech for each word. In particular, the word like is a verb (V) in the first reading and a preposition (P) in the second. The Logical Form The structure of a sentence doesn't reflect its meaning, however. For example, the NP the catch can have different meanings depending on whether the speaker is talking about a baseball game or a fishing expedition. Both these interpretations have the same syntactic structure, and the different meanings arise from an ambiguity concerning the sense of the word catch. Once the correct sense is identified, say the fishing sense, there still is a problem in determining what fish are being referred to. The intended meaning of a sentence depends on the situation in which the sentence is produced. Rather than combining all these problems, this book will consider each one separately. The division is between context-independent meaning and context-dependent meaning. The fact that catch may refer to a baseball move or the results of a fishing expedition is knowledge about English and is independent of the situation in which the word is used. On the other hand, the fact that a particular noun phrase the catch refers to what Jack caught when fishing yesterday is contextually dependent. The representation of the context-independent meaning of a sentence is called its logical form. The logical form encodes possible word senses and identifies the semantic relationships between the words and phrases. Many of these relationships are often captured using an abstract set of semantic relationships between the verb and its NPs. In particular, in both sentences 1 and 2 previously given, the action described is a selling event, where John is the seller, the book is the object being sold, and Mary is the buyer. These roles are instances of the abstract semantic roles AGENT, THEME, and TO-POSS (for final possessor), respectively. Once the semantic relationships are determined, some word senses may be impossible and thus eliminated from consideration. Consider the sentence 9. Jack invited Mary to the Halloween ball. The word ball, which by itself is ambiguous between the plaything that bounces and the formal dance event, can only take the latter sense in sentence 9, because the verb invite only makes sense with this interpretation. One of the key tasks in semantic interpretation is to consider what combinations of the individual word meanings can combine to create coherent sentence meanings. Exploiting such interconnections between word meanings can greatly reduce the number of possible word senses for each word in a given sentence. The Final Meaning Representation The final representation needed is a general knowledge representation (KR), which the system uses to represent and reason about its application domain. This is the language in which all the specific knowledge based on the application is represented. The goal of contextual interpretation is to take a representation of the structure of a sentence and its logical form, and to map this into some expression in the KR that allows the system to perform the appropriate task in the domain. In a question-answering application, a question might map to a database query, in a story-understanding application, a sentence might map into a set of expressions that represent the situation that the sentence describes. For the most part, we will assume that the first-order predicate calculus (FOPC) is the final representation language because it is relatively well known, well studied, and is precisely defined. While some inadequacies of FOPC will be examined later, these inadequacies are not relevant for most of the issues to be discussed. 1.6 The Organization of Natural Language Understanding Systems This book is organized around the three levels of representation just discussed: syntactic structure, logical form, and the final meaning representation. Separating the problems in this way will allow you to study each problem in depth without worrying about other complications. Actual systems are usually organized slightly differently, however. In particular, Figure 1.5 shows the organization that this book assumes. As you can see, there are interpretation processes that map from one representation to the other. For instance, the process that maps a sentence to its syntactic structure and logical form is called the parser. It uses knowledge about word and word meanings (the lexicon) and a set of rules defining the legal structures (the grammar) in order to assign a syntactic structure and a logical form to an input sentence. An alternative organization could perform syntactic processing first and then perform semantic interpretation on the resulting structures. Combining the two, however, has considerable advantages because it leads to a reduction in the number of possible interpretations, since every proposed interpretation must simultaneously be syntactically and semantically well formed. For example, consider the following two sentences: 10. Visiting relatives can be trying. 11. Visiting museums can be trying. These two sentences have identical syntactic structure, so both are syntactically ambiguous. In sentence 10, the subject might be relatives who are visiting you or 16 CHAPTER 1 Introduction to Natural Language Understanding 17 Words (Input) Words (Response) Parsing Syntactic Structure and Logical Form Syntactic Structure and Logical Form of Response Final Meaning Realization Contextual Interpretation Utterance Planning Meaning of Response Application Reasoning Figure 1.5 The flow of information the event of you visiting relatives. Both of these alternatives are semantically valid, and you would need to determine the appropriate sense by using the contextual mechanism. However, sentence 11 has only one possible semantic interpretation, since museums are not objects that can visit other people; rather they must be visited. In a system with separate syntactic and semantic processing, there would be two syntactic interpretations of sentence 11, one of which the semantic interpreter would eliminate later. If syntactic and semantic processing are combined, however, the system will be able to detect the semantic anomaly as soon as it interprets the phrase visiting museums, and thus will never build the incorrect syntactic structure in the first place. While the savings here seem small, in a realistic application a reasonable sentence may have hundreds of possible syntactic structures, many of which are semantically anomalous. Continuing through Figure 1.5, the process that transforms the syntactic structure and logical form into a final meaning representation is called contextual processing. This process includes issues such as identifying the objects referred to by noun phrases such as definite descriptions (for example, the man) and pronouns, the analysis of the temporal aspects of the new information conveyed by the sentence, the identification of the speaker's intention (for example, whether Can you lift that rock is a yes/no question or a request), as well as all the inferential processing required to interpret the sentence appropriately within the application domain. It uses knowledge of the discourse context (determined by the sentences that preceded the current one) and knowledge of the application to produce a final representation. The system would then perform whatever reasoning tasks are appropriate for the application'. When this requires a response to the user, the meaning that must be expressed is passed to the generation component of the system. It uses knowledge of the discourse context, plus information on the grammar and lexicon, to plan the form of an utterance, which then is mapped into words by a realization process. Of course, if this were a spoken language application, the words would not be the final input and output, but rather would be the output of a speech recognizer and the input to a speech synthesizer, as appropriate. While this text focuses primarily on language understanding, notice that the same levels of knowledge are also used for the generation task as well. For instance, knowledge of syntactic structure is encoded in the grammar. This grammar can be used either to identify the structure of a given sentence or to realize a structure as a sequence of words. A grammar that supports both processes is called a bidirectional grammar. While most researchers agree that bidirectional grammars are the preferred model, in actual practice grammars are often tailored for the understanding task or the generation task. This occurs because different issues are important for each task, and generally any given researcher focuses just on the problems related to their specific task. But even when die actual grammars differ between understanding and generation, the grammatical formalisms used remain the same. Summary This book describes computational theories of natural language understanding. The principal characteristic of understanding systems is that they compute representations of the meanings of sentences and use these representations in reasoning tasks. Three principal levels of representation were introduced that correspond to the three main subparts of this book. Syntactic processing is concerned with the structural properties of sentences; semantic processing computes a logical form that represents the context-independent meaning of the sentence; and contextual processing connects language to the application domain. Related Work and Further Readings A good idea of work in the field can be obtained by reading two articles in Shapiro (1992), under the headings "Computational Linguistics" and "Natural Language Understanding." There are also articles on specialized subareas such as machine translation, natural language interfaces, natural language generation, and so on. Longer surveys on certain areas are also available. Slocum (1985) gives a 18 CHAPTER 1 survey of machine translation, and Perrault and Grosz (1986) give a survey of natural language interfaces. You can find a description of the ELIZA program that includes the tran -script of the dialogue in Figure 1.2 in Weizenbaum (1966). The basic technique of using template matching was developed further in the PARRY system, as described in the paper by Colby in Schank and Colby (1973). That same book also contains descriptions of early natural language systems, including those by Winograd and by Schank. Another important early system is the LUNAR system, an overview of which can be found in Woods (1977). For another perspective on the AI approach to natural language, refer to the introduction in Winograd (1983). Exercises for Chapter 1 1. (easy) Define a set of data rules for ELIZA that would generate the first seven exchanges in the conversation in Figure 1.2. 2. (easy) Discover all of the possible meanings of the following sentences by giving a paraphrase of each interpretation. For each sentence, identify whether the different meanings arise from structural ambiguity, semantic ambiguity, or pragmatic ambiguity. a Time flies like an arrow. b. He drew one card. c. Mr. Spock was charged with illegal alien recruitment. d. He crushed the key to my heart. 3. (easy:) Classify these sentences along each of the following dimensions, given that the person uttering the sentence is responding to a complaint that the car is too cold: (i) syntactically correct or not; (ii) semantically correct or not; (iii) pragmatically correct or not. a The heater are on. b. The tires are brand new. c. Too many windows eat the stew. 4. (medium) Implement an ELIZA program that can use the rules that you developed in Exercise 1 and run it for that dialogue. Without adding any more rules, what does your program do on the next few utterances in the conversation in Figure 1.2? How does the program do if you run it in a different context—say, a casual conversation at a bar? PART I Syntactic Processing 20 PARTI As discussed in the introduction, this book divides the task of understanding sentences into three stages. Part I of the book discusses the first stage, syntactic processing. The goal of syntactic processing is to determine the structural components of sentences. It determines, for instance, how a sentence is broken down into phrases, how those phrases are broken down into sub-phrases, and so on, all the way down to the actual structure of the words used. These structural relationships are crucial for determining the meaning of sentences using the techniques described in Parts II and III. There are two major issues discussed in PartI. The first issue concerns the formalism that is used to specify what sentences are possible in a language. This information is specified by a set of rules called a grammar. We will be concerned both with the general issue of what constitutes good formalisms for writing grammars for natural languages, and with the specific issue of what grammatical rules provide a good account of English syntax. The second issue concerns how to determine the structure of a given sentence once you know the grammar for the language. This process is called parsing. There are many different algorithms for parsing, and this book will consider a sampling of the techniques that are most influential in the field. Chapter 2 provides a basic background to English syntax for the reader who has not studied linguistics. It introduces the key concepts and distinctions that are common to virtually all syntactic theories. Chapter 3 introduces several formalisms that are in common use for specifying grammars and describes the basic parsing algorithms in detail. Chapter 4 introduces the idea of features, which extend the basic grammatical formalisms and allow many aspects of natural languages to be captured concisely. Chapter 5 then describes some of the more difficult aspects of natural languages, especially the treatment of questions, relative clauses, and other forms of movement phenomena. It shows how the feature systems can be extended so that they can handle these complex sentences. Chapter 6 discusses issues relating to ambiguity resolution. Some techniques are aimed at developing more efficient representations for storing multiple interpretations, while others are aimed at using local information to choose between alternative interpretations while the parsing is in progress. Finally, Chapter 7 discusses a relatively new area of research that uses statistical information derived from analyzing large databases of sentences. This information can be used to identify the most likely classes for ambiguous words and the most likely structural analyses for structurally ambiguous sentences. CHAPTER Linguistic Background: An Outline of English Syntax 2.1 Words 2.2 The Elements of Simple Noun Phrases 2.3 Verb Phrases and Simple Sentences 2.4 Noun Phrases Revisited 2.5 Adjective Phrases 2.6 Adverbial Phrases Linguistic Background: An Outline of English Syntax 23 This chapter provides background material on the basic structure of English syntax for those who have not taken any linguistics courses. It reviews the major phrase categories and identifies their most important subparts. Along the way all the basic word categories used in the book are introduced. While the only structures discussed are those for English, much of what is said applies to nearly all other European languages as well. The reader who has some background in linguistics can quickly skim this chapter, as it does not address any computational issues. You will probably want to use this chapter as a reference source as you work through the rest of the chapters in Part I. Section 2.1 describes issues related to words and word classes. Section 2.2 describes simple noun phrases, which are then used in Section 2.3 to describe simple verb phrases. Section 2.4 considers complex noun phrases that include embedded sentences such as relative clauses. The remaining sections briefly cover other types of phrases: adjective phrases in Section 2.5 and adverbial phrases in Section 2.6. 2.1 Words At first glance the most basic unit of linguistic structure appears to be the word. The word, though, is far from the fundamental element of study in linguistics; it is already the result of a complex set of more primitive parts. The study of morphology concerns the construction of words from more basic components corresponding roughly to meaning units. There are two basic ways that new words are formed, traditionally classified as inflectional forms and derivational forms. Inflectional forms use a root form of a word and typically add a suffix so that the word appears in the appropriate form given the sentence. Verbs are the best examples of this in English. Each verb has a basic form that then is typically changed depending on the subject and the tense of the sentence. For example, the verb sigh will take suffixes such as -s, -ing, and -ed to create the verb forms sighs, sighing, and sighed, respectively. These new words are all verbs and share the same basic meaning. Derivational morphology involves the derivation of new words from other forms. The new words may be in completely different categories from their subparts. For example, the noun friend is made into the adjective friendly by adding the suffix -ly. A more complex derivation would allow you to derive the noun friendliness from the adjective form. There are many interesting issues concerned with how words are derived and how the choice of word form is affected by the syntactic structure of the sentence that constrains it. Traditionally, linguists classify words into different categories based on their uses. Two related areas of evidence are used to divide words into categories. The first area concerns the word's contribution to the meaning of the phrase that contains it, and the second area concerns the actual syntactic structures in which the word may play a role. For example, you might posit the class noun as those words that can be used to identify the basic type of object, concept, or place being discussed, and adjective as those words that further qualify the object, 24 CHAPTER 2 Linguistic Background: An Outline of English Syntax 25 concept, or place. Thus green would be an adjective and book a noun, as shown in the phrases the green book and green books. But things are not so simple: green might play the role of a noun, as in That green is lighter than the other, and book might play the role of a modifier, as in the book worm. In fact, most nouns seem to be able to be used as a modifier in some situations. Perhaps the classes should be combined, since they overlap a great deal. But other forms of evidence exist. Consider what words could complete the sentence It's so .... You might say It's so green, It's so hot, It's so true, and so on. Note that although book can be a modifier in the book worm, you cannot say *It's so book about anything. Thus there are two classes of modifiers: adjective modifiers and noun modifiers. Consider again the case where adjectives can be used as nouns, as in the green. Not all adjectives can be used in such a way. For example, the noun phrase the hot can be used, given a context where there are hot and cold plates, in a sentence such as The hot are on the table. But this refers to the hot plates; it cannot refer to hotness in the way the phrase the green refers to green. With this evidence you could subdivide adjectives into two subclasses—those that can also be used to describe a concept or quality directly, and those that cannot. Alternatively, however, you can simply say that green is ambiguous between being an adjective or a noun and, therefore, falls in both classes. Since green can behave like any other noun, the second solution seems the most direct. Using similar arguments, we can identify four main classes of words in English that contribute to die meaning of sentences. These classes are nouns, adjectives, verbs, and adverbs. Sentences are built out of phrases centered on these four word classes. Of course, there are many other classes of words that are necessary to form sentences, such as articles, pronouns, prepositions, particles, quantifiers, conjunctions, and so on. But these classes are fixed in the sense that new words in these classes are rarely introduced into the language. New nouns, verbs, adjectives and adverbs, on the other hand, are regularly introduced into the language as it evolves. As a result, these classes are called the open class words, and the others are called the closed class words. A word in any of the four open classes may be used to form the basis of a phrase. This word is called the head of the phrase and indicates the type of thing, activity, or quality that the phrase describes. For example, with noun phrases, the head word indicates the general classes of objects being described. The phrases the dog the mangy dog the mangy dog at the pound are all noun phrases that describe an object in the class of dogs. The first describes a member from die class of all dogs, the second an object from the class of mangy dogs, and the third an object from the class of mangy dogs that are at the pound. The word dog is the head of each of these phrases. Noun Phrases Verb Phrases The president oflhe company looked up the chimney His desire to succeed believed that the world was flat Several challenges from the opposing team ate the pizza Adjective Phrases Adverbial Phrases easy to assemble rapidly like a bat happy that he'd won the prize intermittently throughout the day angry as a hippo inside the house Figure 2.1 Examples of heads and complements Similarly, the adjective phrases hungry very hungry hungry as a horse all describe the quality of hunger. In each case the word hungry is the head. In some cases a phrase may consist of a single head. For example, the word sand can be a noun phrase, hungry can be an adjective phrase, and walked can be a verb phrase. In many other cases the head requires additional phrases following it to express the desired meaning. For example, the verb put cannot form a verb phrase in isolation; thus the following words do not form a meaningful sentence: *Jack put. To be meaningful, the verb put must be followed by a noun phrase and a phrase describing a location, as in the verb phrase put the dog in the house. The phrase or set of phrases needed to complete the meaning of such a head is called the complement of the head. In the preceding phrase put is the head and the dog in the house is the complement. Heads of all the major classes may require complements. Figure 2.1 gives some examples of phrases, with the head indicated by boldface and the complements by italics. In the remainder of this chapter, we will look at these different types of phrases in more detail and see how they are structured and how they contribute to the meaning of sentences. 2.2 The Elements of Simple Noun Phrases Noun phrases (NPs) are used to refer to things: objects, places, concepts, events, qualities, and so on. The simplest NP consists of a single pronoun: he, she, they, you, me, it, I, and so on. Pronouns can refer to physical objects, as in the sentence It hid under the rug. 26 CHAPTER 2 Linguistic Background: An Outline of English Syntax 27 to events, as in the sentence Once I opened the door, I regretted it for months, and to qualities, as in the sentence He was so angry, but he didn't show it. Pronouns do not take any modifiers except in rare forms, as in the sentence He who hesitates is lost. Another basic form of noun phrase consists of a name or proper noun, such as John or Rochester. These nouns appear in capitalized form in carefully written English. Names may also consist of multiple words, as in the New York Times and Stratford-on-Avon. Excluding pronouns and proper names, the head of a noun phrase is usually a common noun. Nouns divide into two main classes: count nouns—nouns that describe specific objects or sets of objects, mass nouns—nouns that describe composites or substances. Count nouns acquired their name because they can be counted. There may be one dog or many dogs, one book or several books, one crowd or several crowds. If a single count noun is used to describe a whole class of objects, it must be in its plural form. Thus you can say Dogs are friendly but not *Dog is friendly. Mass nouns cannot be counted. There may be some water, some wheat, or some sand. If you try to count with a mass noun, you change the meaning. For example, some wheat refers to a portion of some quantity of wheat, whereas one wheat is a single type of wheat rather than a single grain of wheat. A mass noun can be used to describe a whole class of material without using a plural form. Thus you say Water is necessary for life, not *Waters are necessary for life. In addition to a head, a noun phrase may contain specifiers and qualifiers preceding the head. The qualifiers further describe the general class of objects identified by the head, while the specifiers indicate how many such objects are being described, as well as how the objects being described relate to the speaker and hearer. Specifiers are constructed out of ordinals (such as first and second), cardinals (such as one and two), and determiners. Determiners can be subdivided into the following general classes: articles—the words the, a, and an. demonstratives—words such as this, that, these, and those. possessives—noun phrases followed by the suffix 's, such as John's and the fat man's, as welL as possessive pronouns, such as her, my, and whose. wh-determiners—words used in questions, such as which and what. quantifying determiners—words such as some, every, most, no, any, both, and half. Number First Person Second Person Third Person singular I you he (masculine) she (feminine) it (neuter) plural we you they Figure 2.2 Pronoun system (as subject) Number First Person Second Person Third Person singular my your his, her, its plural our your their Figure 23 Pronoun system (possessives) A simple noun phrase may have at most one determiner, one ordinal, and one cardinal. It is possible to have all three, as in the first three contestants. An exception to this rule exists with a few quantifying determiners such as many, few, several, and little. These words can be preceded by an article, yielding noun phrases such as the few songs we knew. Using this evidence, you could subcategorize the quantifying determiners into those that allow this and those that don't, but the present coarse categorization is fine for our purposes at this time. The qualifiers in a noun phrase occur after the specifiers (if any) and before the head. They consist of adjectives and nouns being used as modifiers. The following are more precise definitions: adjectives—words that attribute qualities to objects yet do not refer to the qualities themselves (for example, angry is an adjective that attributes the quality of anger to something). noun modifiers—mass or count nouns used to modify another noun, as in the cook book or the ceiling paint can. Before moving on to other structures, consider the different inflectional forms that nouns take and how they are realized in English. Two forms of nouns—the singular and plural forms—have already been mentioned. Pronouns take forms based on person (first, second, and third) and gender (masculine, feminine, and neuter). Each of these distinctions reflects a systematic analysis that is almost wholly explicit in some languages, such as Latin, while implicit in others. In French, for example, nouns are classified by their gender. In English many of these distinctions are not explicitly marked except in a few cases. The pronouns provide the best example of this. They distinguish number, person, gender, and case (that is. whether they are used as possessive, subject, or object), as shown in Figures 2.2 through 2.4. 28 CHAPTER 2 Linguistic Background: An Outline of English Syntax 29 Number First Person Second Person Third Person him singular me you her it plural us you them Figure 2.4 Pronoun system (as object) Mood Example declarative (or assertion) The cat is sleeping. yes/no question Is the cat sleeping? wh-question What is sleeping? or Which cat is sleeping? imperative (or command) Shoot the cat! Figure 2.5 Basic moods of sentences Form Examples Example Uses base hit, cry, Hit the ball! go, be I want to go. simple present hit, cries, The dog cries every day. go, am lam thirsty. simple past hit, cried, I was thirsty. went, was I went to the store. present participle bitting, crying, I'm going to the store. going, being Being the last in line aggravates me. past participle hit, cried, I've been there before. gone, been The cake was gone. Figure 2.6 The five verb forms 2.3 Verb Phrases and Simple Sentences While an NP is used to refer to things, a sentence (S) is used to assert, query, or command. You may assert that some sentence is true, ask whether a sentence is true, or command someone io do somelhing described in the .sentence. The way a sentence is used is called its mood. Figure 2.5 shows four basic sentence moods. A simple declarative sentence consists of an NP, the subject, followed by a verb phrase (VP), the predicate. A simple VP may consist of some adverbial modifiers followed by the head verb and its complements. Every verb must appear in one of the five possible forms shown in Figure 2.6. Tense The Verb Sequence Example simple present simple present He walks to the store. simple past simple past He walked to the store. simple future will + infinitive He will walk to the store. present perfect have in present He has walked to the store. + past participle future perfect will + have in infinitive I will have walked to the store. + past participle past perfect have in past I had walked to the store. (or pluperfect) + past participle Figure 2.7 The basic tenses Tense Structure Example present progressive be in present He is walking. + present participle past progressive be in past He was walking. + present participle future progressive will + be in infinitive He will be walking. + present participle present perfect have in present He has been walking. progressive + be in past participle + present participle future perfect will + have in present He will have been walking. progressive + be as past participle + present participle past perfect have in past He had been walking. progressive + be in past participle + present participle Figure 2.8 The progressive tenses Verbs can be divided into several different classes: the auxiliary verbs, such as be, do, and have; the modal verbs, such as will, can, and could; and the main verbs, such as eat, ran, and believe. The auxiliary and modal verbs usually take a verb phrase as a complement, which produces a sequence of verbs, each the head of its own verb phrase. These Sequences are used to form sentences with different tenses. The tense system identifies when the proposition described in the sentence is said to be true. The tense system is complex; only the basic forms are outlined in Figure 2.7. In addition, verbs may be in the progressive tense. Corresponding to the tenses listed in Figure 2.7 are the progressive tenses shown in Figure 2.8. 30 CHAPTER 2 Linguistic Background: An Outline of English Syntax 31 First Second Third Singular lam you are he is I walk you walk she walks Plural we are you are they are we walk you walk they walk Figure 2.9 Person/number forms of verbs Each progressive tense is formed by the normal tense construction of the verb be followed by a present participle. Verb groups also encode person and number information in the first word in the verb group. The person and number must agree with the noun phrase that is the subject of the verb phrase. Some verbs distinguish nearly all the possibilities, but most verbs distinguish only the third person singular (by adding an -s suffix). Some examples are shown in Figure 2.9. Transitivity and Passives The last verb in a verb sequence is called the main verb, and is drawn from the open class of verbs. Depending on the verb, a wide variety of complement structures are allowed. For example, certain verbs may stand alone with no complement. These are called intransitive verbs and include examples such as laugh (for example, Jack laughed) and run (for example, He will have been running). Another common complement form requires a noun phrase to follow the verb. These are called transitive verbs and include verbs such as find (for example, Jack found a key). Notice that find cannot be intransitive (for example, *Jack found is not a reasonable sentence), whereas laugh cannot be transitive (for example, *Jack laughed a key is not a reasonable sentence). A verb like run, on the other hand, can be transitive or intransitive, but the meaning of the verb is different in each case (for example, Jack ran vs. Jack ran the machine). Transitive verbs allow another form of verb group called the passive form, which is constructed using a be auxiliary followed by the past participle. In the passive form the noun phrase that would usually be in the object position is used in the subject position, as can be seen by the examples in Figure 2.10. Note that tense is still carried by the initial verb in the verb group. Also, even though the first noun phrase semantically seems to be the object of the verb in passive sentences, it is syntactically the subject. This can be seen by checking the pronoun forms. For example, / was hit is correct, not *Me was hit. Furthermore, the tense and number agreement is between the verb and the syntactic subject. Thus you say / was hit by them, not */ were hit by them. Some verbs allow two noun phrases to follow them in a sentence; for example, Jack gave Sue a book or Jack found me a key. In such sentences the Active Sentence Related Passive Sentence Jack saw the ball. The ball was seen by Jack. I will find the clue. The clue will be found by me. Jack hit me. I was hit by Jack. Figure 2.10 Active sentences with corresponding passive sentences second NP corresponds to the object NP outlined earlier and is sometimes called the direct object. The other NP is called the indirect object. Generally, such sentences have an equivalent sentence where the indirect object appears with a preposition, as in Jack gave a book to Sue or Jack found a key for me. Particles Some verb forms are constructed from a verb and an additional word called a particle. Particles generally overlap with the class of prepositions considered in the next section. Some examples are up, out, over, and in. With verbs such as look, take, or put, you can construct many different verbs by combining the verb with a particle (for example, look up, look out, look over, and so on). In some sentences the difference between a particle and a preposition results in two different readings for the same sentence. For example, look over the paper would mean reading the paper, if you consider over a particle (the verb is look over). In contrast, the same sentence would mean looking at something else behind or above the paper, if you consider over a preposition (the verb is look). You can make a sharp distinction between particles and prepositions when the object of the verb is a pronoun. With a verb-particle sentence, the pronoun must precede the particle, as in / looked it up. With the prepositional reading, the pronoun follows the preposition, as in I looked up it. Particles also may follow the object NP. Thus you can say / gave up the game to Mary or / gave the game up to Mary. This is not allowed with prepositions; for example, you cannot say */ climbed the ladder up. Clausal Complements Many verbs allow clauses as complements. Clauses share most of the same properties of sentences and may have a subject, indicate tense, and occur in passivized forms. One common clause form consists of a sentence form preceded by the complementizer that, as in that Jack ate the pizza. This clause will be identified by the expression S[fhat], indicating a special subclass of S structures. This clause may appear as the complement of the verb know, as in Sam knows that Jack ate the pizza. The passive is possible, as in Sam knows that the pizza was eaten by Jack Linguistic Background: An Outline of English Syntax 33 Another clause type involves the infinitive form of the verb. The VP[infJ clause is simply a VP starting in the infinitive form, as in the complement of the verb wish in Jack wishes to eat the pizza. An infinitive sentence S[infJ form is also possible where the subject is indicated by a for phrase, as in Jack wishes for Sam to eat the pizza. Another important class of clauses are sentences with complementizers that are wh-words, such as who, what, where, why, whether, and how many. These question clauses, S[WH], can be used as a complement of verbs such as know, as in Sam knows whether we went to the party and The police know who committed the crime. Prepositional Phrase Complements Many verbs require complements that involve a specific prepositional phrase (PP). The verb give takes a complement consisting of an NP and a PP with the preposition to, as in Jack gave the book to the library. No other preposition can be used. Consider *Jack gave the book from the library. (OK only if from the library modifies book.) In contrast, a verb like put can take any PP that describes a location, as in Jack put the book in the box. Jack put the book inside the box. Jack put the book by the door. To account for this, we allow complement specifications that indicate prepositional phrases with particular prepositions. Thus the verb give would have a complement of the form NP+PP[to]. Similarly the verb decide would have a complement form NP+PP[about], and the verb blame would have a complement form NP+PP[on], as in Jack blamed the accident on the police. Verbs such as put, which take any phrase that can describe a location (complement NP+Location), are also common in English. While locations are typically prepositional phrases, they also can be noun phrases, such as home, or particles, such as back or here. A distinction can be made between phrases that describe locations and phrases that describe a path of motion, although many location phrases can be interpreted either way. The distinction can be made in some cases, though. For instance, prepositional phrases beginning with to generally indicate a path of motion. Thus they cannot be used with a verb such as put that requires a location (for example, */ put the ball to the box). This distinction will be explored further in Chapter 4. Figure 2.11 summarizes many of the verb complement structures found in English. A full list would contain over 40 different forms. Note that while the examples typically use a different verb for each form, most verbs will allow several different complement structures. Verb Complement Structure Example laugh Empty (intransitive) Jack laughed. find NP (transitive) Jack found a key. give NP+NP (bitransitive) Jack gave Sue the paper. give NP+PP[fo] Jack gave the book to the library. reside Location phrase Jack resides in Rochester put NP+Location phrase Jack put the book inside. speak PP[with]+PP[about] Jack spoke with Sue about the book. try VP[to] Jack tried to apologize. tell NP+VP[to] Jack told the man to go. wish S[to] Jack wished for the man to go. keep VP[ing] Jack keeps hoping for the best. catch NP+VP[ing] Jack caught Sam looking in his desk. watch NP+VP[base] Jack watched Sam eat the pizza. regret Sfthat] Jack regretted that he'd eaten the whole thing. tell NP+Slthat] Jack told Sue that he was sorry. seem ADJP Jack seems unhappy in his new job. think NP+ADJP Jack thinks Sue is happy in her job. know S[WH] Jack knows where the money is. Figure 2.11 Some common verb complement structures in English 2.4 Noun Phrases Revisited Section 2.2 introduced simple noun phrases. This section considers more complex forms in which NPs contain sentences or verb phrases as subcomponents. All the examples in Section 2.2 had heads that took the null complement. Many nouns, however, may take complements. Many of these fall into the class of complements that require a specific prepositional phrase. For example, the noun love has a complement form PP[ofJ, as in their love of France, the noun reliance has the complement form PP[onj, as in his reliance on handouts, and the noun familiarity has the complement form PP[with], as in a familiarity with computers. Many nouns, such as desire, reluctance, and research, take an infinitive VP form as a complement, as in the noun phrases his desire to release the guinea pig, a reluctance to open the case again, and the doctor's research to find a cure for cancer. These nouns, in fact, can also take the S[inf] form, as in my hope for John to open the case again. Noun phrases can also be built out of clauses, which were introduced in the last section as the complements for verbs. For example, a that clause (S[that]) can be used as the subject of a sentence, as in the sentence That George had the ring was surprising. Infinitive forms of verb phrases (VPfinfJ) and sentences (S[infJ) can also function as noun phrases, as in the sentences To own a car would.be 34 CHAPTER 2 Linguistic Background: An Outline of English Syntax 35 delightful and For us to complete a project on time would be unprecedented. In 1 addition, the gerundive forms (VP[ing] and S[ing]) can also function as noun 4 phrases, as in the sentences Giving up the game was unfortunate and John's giving up the game caused a riot. | Relative clauses involve sentence forms used as modifiers in noun phrases. i These clauses are often introduced by relative pronouns such as who, which, i that, and so on, as in \ The man who gave Bill the money ... == The rug that George gave to Ernest ... -The man whom George gave the money to ji In each of these relative clauses, the embedded sentence is the same structure as a ] regular sentence except that one noun phrase is missing. If this missing NP is i filled in with the NP that the sentence modifies, the result is a complete sentence that captures the same meaning as what was conveyed by the relative clause. The missing NPs in the preceding three sentences occur in the subject position, in the i object position, and as object to a preposition, respectively. Deleting the relative pronoun and filling in the missing NP in each produces the following: The man gave Bill the money. ~\ George gave the rug to Ernest. George gave the money to the man. As was true earlier, relative clauses can be modified in the same ways as regular sentences. In particular, passive forms of the preceding sentences would be as follows: Bill was given the money by the man. The rug was given to Ernest by George. The money was given to the man by George. Correspondingly, these sentences could have relative clauses in the passive form as follows: The man Bill was given the money by ... The rug that was given to Ernest by George ... ■ The man whom the money was given to by George ... Notice that some relative clauses need not be introduced by a relative pronoun. Often the relative pronoun can be omitted, producing what is called a base ; relative clause, as in the NP the man George gave the money to. Yet another ■ form deletes the relative pronoun and an auxiliary be form, creating a reduced 4 relative clause, as in the NP the man given the money, which means the same as the NP the man who was given the money. i 2.5 Adjective Phrases You have already seen simple adjective phrases (ADJPs) consisting of a single adjective in several examples. More complex adjective phrases are also possible, as adjectives may take many of the same complement forms that occur with verbs. This includes specific prepositional phrases, as with the adjective pleased, which takes the complement form PP[with] (for example, Jack was pleased with the prize), or angry with the complement form PP[at] (for example, Jack was angry at the committee). Angry also may take an S[that] complement form, as in Jack was angry that he was left behind. Other adjectives take infinitive forms, such as the adjective willing with the complement form VP[inf], as in Jack seemed willing to lead the chorus. These more complex adjective phrases are most commonly found as the complements of verbs such as be or seem, or following the head in a noun phrase. They generally cannot be used as modifiers preceding the heads of noun phrases (for example, consider *the angry at the committee man vs. the angry man vs. the man angry at the committee). Adjective phrases may also take a degree modifier preceding the head, as in the adjective phrase very angry or somewhat fond of Mary. More complex degree modifications are possible, as mfar too heavy and much more desperate. Finally, certain constructs have degree modifiers that involve their own complement forms, as in too stupid to come in out of the rain, so boring that everyone fell asleep, and as slow as a dead horse. 2.6 Adverbial Phrases You have already seen adverbs in use in several constructs, such as indicators of degree (for example, very, rather, too), and in location phrases (for example, here, everywhere). Other forms of adverbs indicate the manner in which something is done (for example, slowly, hesitantly), the time of something (for example, now, yesterday), or the frequency of something (for example, frequently, rarely, never). Adverbs may occur in several different positions in sentences: in the sentence initial position (for example, Then, Jack will open the drawer), in the verb sequence (for example, Jack then will open the drawer, Jack will then open the drawer), and in the sentence final position (for example, Jack opened the drawer then). The exact restrictions on what adverb can go where, however, is quite idiosyncratic to the particular adverb. In addition to these adverbs, adverbial modifiers can be constructed out of a wide range of constructs, such as prepositional phrases indicating, among other things, location (for example, in the box) or manner (for example, in great haste); noun phrases indicating, among other things, frequency (for example, every day); or clauses indicating, among other things, the time (for example, when the bomb exploded). Such adverbial phrases, however, usually cannot occur except in the sentence initial or sentence final position. For example, we can say Every day 36 CHAPTER 2 Linguistic Background: An Outline of English Syntax 37 John opens his drawer or John opens his drawer every day. but not *John every day opens his drawer. Because of the wide range of forms, it generally is more useful to consider adverbial phrases (ADVPs) by function rather than syntactic form. Thus we can consider manner, temporal, duration, location, degree, and frequency adverbial < phrases each as its own form. We considered the location and degree forms earlier, so here we will consider some of the others. j Temporal adverbials occur in a wide range of forms: adverbial particles : (for example, now), noun phrases (for example, today, yesterday), prepositional ■ phrases (for example, at noon, during the fight), and clauses (for example, when the clock struck noon, before the fight started). ; Frequency adverbials also can occur in a wide range of forms: particles (for example, often), noun phrases (for example, every day), prepositional phrases (for example, at every party), and clauses (for example, every time that John comes for a visit). Duration adverbials appear most commonly as prepositional phrases (for example, for three hours, about 20 feet) and clauses (for example, until the moon turns blue). '• Manner adverbials occur in a wide range of forms, including particles (for ! example, slowly), noun phrases (for example, this way), prepositional phrases (for example, in great haste), and clauses (for example, by holding the embers at the end of a stick). In the analyses that follow, adverbials will most commonly occur as modifiers of the action or state described in a sentence. As such, an issue arises as ■ to how to distinguish verb complements from adverbials. One distinction is that adverbial phrases are always optional. Thus you should be able to delete the adverbial and still have a sentence with approximately the same meaning (missing, obviously, the contribution of the adverbial). Consider the sentences < Jack put the box by the door. Jack ate the pizza by the door. In the first sentence the prepositional phrase is clearly a complement, since deleting it to produce *Jack put the box results in a nonsensical utterance. On the other hand, deleting the phrase from the second sentence has only a minor effect: Jack ate the pizza is just a less general assertion about the same situation described by Jack ate the pizza by the door. » Summary _ The major phrase structures of English have been introduced—namely, noun phrases, sentences, prepositional, phrases, adjective, phrases, and adverbial J phrases. These will serve as the building blocks for the syntactic structures 4 introduced in the following chapters. M Related Work and Further Readings_______ An excellent overview of English syntax is found in Baker (1989). The most comprehensive sources are books that attempt to describe the entire structure of English, such as Huddleston (1988), Quirk et al. (1972), and Leech and Svartvik (1975). Exercises for Chapter 2_ 1. (easy) The text described several different example tests for distinguishing word classes. For example, nouns can occur in sentences of the form / saw the X, whereas adjectives can occur in sentences of the form It's so X. Give some additional tests to distinguish these forms and to distinguish between count nouns and mass nouns. State whether each of the following words can be used as an adjective, count noun, or mass noun. If the word is ambiguous, give all its possible uses. milk, house, liquid, green, group, concept, airborne 2. (easy) Identify every major phrase (noun, verb, adjective, or adverbial phrases) in the following sentences. For each, indicate the head of the phrase and any complements of the head. Be sure to distinguish between complements and optional modifiers. The man played his fiddle in the street. The people dissatisfied with the verdict left the courtroom. 3. (easy) A very useful test for determining the syntactic role of words and phrases is the conjunction test. Conjunctions, such as and and or, tend to join together two phrases of the same type. For instance, you can conjoin nouns, as in the man and woman; noun phrases, as in the angry men and the large dogs; and adjective phrases, as in the cow, angry and confused, broke the gate. In each of the following sentences, identify the type of phrase being conjoined, and underline each phrase. He was tired and hungrier than a herd of elephants. We have never walked home or to the store from here. The dog returned quickly and dropped the stick. 4. (easy) Explain in detail, using the terminology of this chapter, why each of the following sentences is ill formed. In particular, state what rule (given in this chapter) has been violated. a He barked the wrong tree up. b. She turned waters into wine. c. Don't take many all the cookies! d. I feel floor today. e. They all laughed the boy. 38 CHAPTER 2 5. (easy) Classify the following verbs as being intransitive, transitive, or bitransitive (that is, it takes a two-NP complement). If the verb can be used in more than one of these forms, give each possible classification. Give an example sentence for each form to demonstrate your analysis. a cry b. sing c. donate d. put 6. (easy) Using the verb to be, give example sentences that use the six basic tenses listed in Figure 2.7 and the six progressive tenses listed in Figure 2.8. 7. (easy) Using the verb donate, give examples of the passive form for each of the six basic tenses listed in Figure 2.7. 8. (easy) Classify the following verbs by specifying what different complement structures they allow, using the forms defined in Figure 2.11. give, know, assume, insert Give an example of an additional complement structure that is allowed by one of these verbs but not listed in the figure. 9. (easy) Find five verbs not discussed in this chapter that take an indirect object and, for each one, give a paraphrase of the same sentence using a prepositional phrase instead of the indirect object. Try for as wide a range as possible. Can you find one that cannot be paraphrased using either the preposition to or fori 10. (medium) Wh-questipns are questions that use a class of words that includes what, where, who, when, whose, which, and how. For each of these words, give the syntactic categories (for example, verb, noun, noun group, adjective, quantifier, prepositional phrase, and so on) in which the words can be used. Justify each classification with some examples that demonstrate it. Use both positive and negative arguments as necessary (such as "it is one of these because . . .," or "it can't be one of these even though it looks like it might, because . . ."). CHAPTER Grammars and Parsin 3.1 Grammars and Sentence Structure 3.2 What Makes a Good Grammar 33 A Top-Down Parser 3.4 A Bottom-Up Chart Parser 3.5 Transition Network Grammars o 3.6 Top-Down Chart Parsing o 3.7 Finite State Models and Morphological Processing o 3.8 Grammars and Logic Programming Grammars and Parsing 41 To examine how the syntactic structure of a sentence can be computed, you must consider two things: the grammar, which is a formal specification of the structures allowable in the language, and the parsing technique, which is the method of analyzing a sentence to determine its structure according to the grammar. This chapter examines different ways to specify simple grammars and considers some fundamental parsing techniques. Chapter 4 then describes the methods for constructing syntactic representations that are useful for later semantic interpretation. The discussion begins by introducing a notation for describing the structure of natural language and describing some naive parsing techniques for that grammar. The second section describes some characteristics of a good grammar. The third section then considers a simple parsing technique and introduces the idea of parsing as a search process. The fourth section describes a method for building efficient parsers using a structure called a chart. The fifth section then describes an alternative representation of grammars based on transition networks. The remaining sections deal with optional and advanced issues. Section 3.6 describes a top-down chart parser that combines the advantages of top-down and bottom-up approaches. Section 3.7 introduces the notion of finite state transducers and discusses their use in morphological processing. Section 3.8 shows how to encode context-free grammars as assertions in PROLOG, introducing the notion of logic grammars. 3.1 Grammars and Sentence Structure This section considers methods of describing the structure of sentences and explores ways of characterizing all the legal structures in a language. The most common way of representing how a sentence is broken into its major subparts, and how those subparts are broken up in turn, is as a tree. The tree representation for the sentence John ate the cat is shown in Figure 3.1. This illustration can be read as follows: The sentence (S) consists of an initial noun phrase (NP) and a verb phrase (VP). The initial noun phrase is made of the simple NAME John. The verb phrase is composed of a verb (V) ate and an NP, which consists of an article (ART) the and a common noun (N) cat. In list notation this same structure could be represented as (S (NP (NAME John)) (VP (Vate) (NP (ART the) (Neat)))) Since trees play such an important role throughout this book, some terminology needs to be introduced. Trees are a special form of graph, which are structures consisting of labeled nodes (for example, the nodes are labeled S, NP, and so on in Figure 3.1) connected by links. They are called trees because they resemble upside-down trees, and much of the terminology is derived from this analogy with actual trees. The node at the top is called the root of the tree, while 42 CHAPTER 3 Grammars and Parsing 43 Figure 3.1 A tree representation of John ate the cat 1. S - -> NPVP 5. NAME -> John 2. VP ^ VNP 6. V ate 3. NP -> NAME 7. ART -> the 4. NP -> ART N 8. N -> cat Grammar 3.2 A simple grammar the nodes at the bottom are called the leaves. We say a link points from a parent node to a child node. The node labeled S in Figure 3.1 is the parent node of the nodes labeled NP and VP, and the node labeled NP is in turn the parent node of the node labeled NAME. While every child node has a unique parent, a parent may point to many child nodes. An ancestor of a node N is defined as N's parent, or the parent of its parent, and so on. A node is dominated by its ancestor nodes. The root node dominates all other nodes in the tree. To construct a tree structure for a sentence, you must know what structures are legal for English. A set of rewrite rules describes what tree structures are allowable. These rules say that a certain symbol may be expanded in the tree by a sequence of other symbols. A set of rules that would allow the tree structure in Figure 3.1 is shown as Grammar 3.2. Rule 1 says that an S may consist of an NP followed by a VP. Rule 2 says that a VP may consist of a V followed by an NP. Rules 3 and 4 say that an NP may consist of a NAME or may consist of an ART followed by an N. Rules 5-8 define possible words for the categories. Grammars consisting entirely of rules with a single symbol on the left-hand side, called the mother, are called context-free grammars (CFGs). CFGs are a very important class of grammars for two reasons: The formalism is powerful enough to describe most of the structure in natural languages, yet it is restricted enough so that efficient parsers can be built to analyze sentences. Symbols that cannot be further decomposed in a grammar, namely the words in the preceding example, are called terminal symbols. The other symbols, such as NP. VP, and S, are called nonterminal symbols. The grammatical symbols such as N and V that describe word categories are called lexical symbols. Of course, many words will be listed under multiple categories. For example, can would be listed under V and N. Grammars have a special symbol called the start symbol. In this book, the start symbol will always be S. A grammar is said to derive a sentence if there is a sequence of rules that allow you to rewrite the start symbol into the sentence. For instance, Grammar 3.2 derives the sentence John ate the cat. This can be seen by showing the sequence of rewrites starting from the S symbol, as follows: NPVP NAME VP John VP John V NP John ate NP John ate ART N John ate the N John ate the cat (rewriting S) (rewriting NP) (rewriting NAME) (rewriting VP) (rewriting V) (rewriting NP) (rewriting ART) (rewriting N) Two important processes are based on derivations. The first is sentence generation, which uses derivations to construct legal sentences. A simple generator could be implemented by randomly choosing rewrite rules, starting from the S symbol, until you have a sequence of words. The preceding example shows that the sentence John ate the cat can be generated from the grammar. The second process based on derivations is parsing, which identifies the structure of sentences given a grammar. There are two basic methods of searching. A top-down strategy starts with the S symbol and then searches through different ways to rewrite the symbols until the input sentence is generated, or until all possibilities have been explored. The preceding example demonstrates that John ate the cat is a legal sentence by showing the derivation that could be found by this process. In a bottom-up strategy, you start with the words in the sentence and use the rewrite rules backward to reduce the sequence of symbols until it consists solely of S. The left-hand side of each rule is used to rewrite the symbol on the right-hand side. A possible bottom-up parse of the sentence John ate the cat is NAME ate the cat NAME V the cat NAME V ART cat NAME V ART N NP V ART N NP VNP NPVP S (rewriting John) (rewriting ate) (rewriting the) (rewriting cat) (rewriting NAME) (rewriting ART N) (rewriting V NP) (rewriting NP VP) A tree representation, such as Figure 3.1, can be viewed as a record of the CFG rules that account for the structure of the sentence. In other words, if you 44 CHAPTER 3 Grammars and Parsing 45 kept a record of the parsing process, working either top-down or bottom-up, it would be something similar to the parse tree representation. 3.2 What Makes a Good Grammar In constructing a grammar for a language, you are interested in generality, the range of sentences the grammar analyzes correctly; selectivity, the range of non-sentences it identifies as problematic; and understandability, the simplicity of the grammar itself. In small grammars, such as those that describe only a few types of sentences, one structural analysis of a sentence may appear as understandable as another, and little can be said as to why one is superior to the other. As you attempt to extend a grammar to cover a wide range of sentences, however, you often find that one analysis is easily extendable while the other requires complex modification. The analysis that retains its simplicity and generality as it is extended is more desirable. Unfortunately, here you will be working mostly with small grammars and so will have only a few opportunities to evaluate an analysis as it is extended. You can attempt to make your solutions generalizable, however, by keeping in mind certain properties that any solution should have. In particular, pay close attention to the way the sentence is divided into its subparts, called constituents. Besides using your intuition, you can apply a few specific tests, discussed here. Anytime you decide that a group of words forms a particular constituent, try to construct a new sentence that involves that group of words in a conjunction with another group of words classified as the same type of constituent. This is a good test because for the most part only constituents of the same type can be conjoined. The sentences in Figure 3.3, for example, are acceptable, but the following sentences are not: *I ate a hamburger and on the stove. *I ate a cold hot dog and well burned. *I ate the hot dog slowly and a hamburger. To summarize, if the proposed constituent doesn't conjoin in some sentence with a constituent of the same class, it is probably incorrect. Another test involves inserting the proposed constituent into other sentences that take the same category of constituent. For example, if you say that John's hitting of Mary is an NP in John's hitting of Mary alarmed Sue, then it should be usable as an NP in other sentences as well. In fact this is true—the NP can be the object of a verb, as in / cannot explain John's hitting of Mary as well as in the passive form of the initial sentence Sue was alarmed by John's hitting of Mary. Given this evidence, you can conclude that the proposed constituent appears to behave just like other NPs. ■ NP-NP: 1 ate a hamburger and a hot dog. VP-VP: I will eat the hamburger and throw away the hot dog. S-S: I ate a hamburger and John ate a hot dog. PP-PP: I saw a hot dog in the bag and on the stove. ADJP-ADJP: I ate a cold and well burned hot dog. AD VP-AD VP: I ate the hot dog slowly and very carefully. N-N: I ate a hamburger and hot dog. V-V: I will cook and burn a hamburger. AUX-AUX: I can and will eat the hot dog. ADJ-ADJ: I ate the very cold and burned hot dog (that is, very cold and very burned). Figure 3.3 Various forms of conjunctions As another example of applying these principles, consider the two sentences / looked up John's phone number and / looked up John's chimney. Should these sentences have the identical structure? If so, you would presumably analyze both as subject-verb-complement sentences with the complement in both cases being a PP. That is, up John's phone number would be a PP. When you try the conjunction test, you should become suspicious of this analysis. Conjoining up John's phone number with another PP, as in */ looked up John's phone number and in his cupboards, is certainly bizarre. Note that / looked up John's chimney and in his cupboards is perfectly acceptable. Thus apparently the analysis of up John's phone number as a PP is incorrect. Further evidence against the PP analysis is that up John's phone number does not seem usable as a PP in any sentences other than ones involving a few verbs such as look or thought. Even with the verb look, an alternative sentence such as *Up John's phone number, I looked is quite implausible compared to Up John's chimney, I looked. This type of test can be taken further by considering changing the PP in a manner that usually is allowed. In particular, you should be able to replace the NP John's phone number by the pronoun it. But the resulting sentence, / looked up it, could not be used with the same meaning as / looked up John's phone number. In fact, the only way to use a pronoun and retain the original meaning is to use / looked it up, corresponding to the form / looked John's phone number up. Thus a different analysis is needed for each of the two sentences. If up John's phone number is not a PP, then two remaining analyses may be possible. The VP could be the complex verb looked up followed by an NP, or it could consist of three components: the V looked, a particle up, and an NP. Either of these is a better solution. What types of tests might you do to decide between them? As you develop a grammar, each constituent is used in more and more different ways. As a result, you have a growing number of tests that can be performed to see if a new analysis is reasonable or not. Sometimes the analysis of a 46 CHAPTER 3 Grammars and Parsing 47 BOX 3.1 Generative Capacity Grammatical formalisms based on rewrite rules can be compared according to their generative capacity, which is the range of languages that each formalism can describe. This book is concerned with natural languages, but it turns out that no natural language can be characterized precisely enough to define generative capacity. Formal languages, however, allow a precise mathematical characterization. Consider a formal language consisting of the symbols a, b, c, and d (think of these as words). Then consider a language LI that allows any sequence of letters in alphabetical order. For example, abd, ad, bed, b, and abed are all legal sentences. To describe this language, we can write a grammar in which the right-hand side of every rule consists of one terminal symbol possibly followed by one nonterminal. Such a grammar is called a regular grammar. For LI the grammar would be S -> aSl S -> d SI -» d S3 -> d S -> bS2 SI -> bS2 S2 -» c S3 S -> cS3 SI -> cS3 S2 -4 d Consider another language, L2, that consists only of sentences that have a sequence of a's followed by an equal number of £>'s—that is, ab, aabb, aaabbb, and so on. You cannot write a regular grammar that can generate L2 exactly. A context-free grammar to generate L2, however, is simple: S -> ab S -> aSb Some languages cannot be generated by a CFG. One example is the language that consists of a sequence of a's, followed by the same number of b's, followed by the same number of c's—that is, abc, aabbec, aaabbbece, and so on. Similarly, no context-free grammar can generate the language that consists of any sequence of letters repeated in the same order twice, such as abab, abcabc, acdabacdab, and so on. There are more general grammatical systems that can generate such sequences, however. One important class is the context-sensitive grammar, which consists of rules of the form OcAp -> ca|/p where A is a symbol, a and P are (possibly empty) sequences of symbols, and i|/ is a nonempty sequence of symbols. Even more general are the type 0 grammars, which allow arbitrary rewrite rules. Work in formal language theory began with Chomsky (1956). Since the languages generated by regular grammars are a subset of those generated by context-free grammars, which in turn are a subset of those generated by context-sensitive grammars, which in turn are a subset of those generated by type 0 languages, they form a hierarchy of languages (called the Chomsky Hierarchy). new form might force you to back up and modify the existing grammar. This backward step is unavoidable given the current state of linguistic knowledge. The important point to remember, though, is that when a new rule is proposed for a grammar, you must carefully consider its interaction with existing rules. 1. S -» NPVP 4. VP -> V 2 ' NP -> ARTN 5. VP -» VNP 3. NP -> ARTADJN Grammar 3.4 3.3 A Top-Down Parser A parsing algorithm can be described as a procedure that searches through various ways of combining grammatical rules to find a combination that generates a tree that could be the structure of the input sentence. To keep this initial formulation simple, we will not explicitly construct the tree. Rather, the algorithm will simply return a yes or no answer as to whether such a tree could be built. In other words, the algorithm will say whether a certain sentence is accepted by the grammar or not. This section considers a simple top-down parsing method in some detail and then relates this to work in artificial intelligence (AI) on search procedures. A top-down parser starts with the S symbol and attempts to rewrite it into a sequence of terminal symbols that matches the classes of the words in the input sentence. The state of the parse at any given time can be represented as a list of symbols that are the result of operations applied so far, called the symbol list. For example, the parser starts in the state (S) and after applying the rule S -> NP VP the symbol list will be (NP VP). If it then applies the rule NP -» ART N, the symbol list will be (ART N VP), and so on. The parser could continue in this fashion until the state consisted entirely of terminal symbols, and then it could check the input sentence to see if it matched. But this would be quite wasteful, for a mistake made early on (say, in choosing the rule that rewrites S) is not discovered until much later. A better algorithm checks the input as soon as it can. In addition, rather than having a separate rule to indicate the possible syntactic categories for each word, a structure called the lexicon is used to efficiently store the possible categories for each word. For now the lexicon will be very simple. A very small lexicon for use in the examples is cried: V dogs: N, V the: ART With a lexicon specified, a grammar, such as that shown as Grammar 3.4, need not contain any lexical rules. Given these changes, a state of the parse is now defined by a pair: a symbol list similar to before and a number indicating the current position in the sentence. Positions fall between the Words, with 1 being the position before the first word. For example, here is a sentence with its positions indicated: 48 CHAPTER 3 Grammars and Parsing 49 1 The 2 dogs 3 cried 4 A typical parse state would be ((N VP) 2) indicating that the parser needs to find an N followed by a VP, starting at position two. New states are generated from old states depending on whether the first symbol is a lexical symbol or not. If it is a lexical symbol, like N in the preceding example, and if the next word can belong to that lexical category, then you can update the state by removing the first symbol and updating the position counter. In this case, since the word dogs is listed as an N in the lexicon, the next parser state would be ((VP) 3) which means it needs to find a VP starting at position 3. If the first symbol is a nonterminal, like VP, then it is rewritten using a rule from the grammar. For example, using rule 4 in Grammar 3.4, the new state would be ((V) 3) which means it needs to find a V starting at position 3. On the other hand, using rule 5, the new state would be ((VNP)3) A parsing algorithm that is guaranteed to find a parse if there is one must systematically explore every possible new state. One simple technique for this is called backtracking. Using this approach, rather than generating a single new state from the state ((VP) 3), you generate all possible new states. One of these is picked to be the next state and the rest are saved as backup states. If you ever reach a situation where the current state cannot lead to a solution, you simply pick a new current state from the list of backup states. Here is the algoridim in a little more detail. A Simple Top-Down Parsing Algorithm The algorithm manipulates a list of possible states, called the possibilities list. The first element of this list is the current state, which consists of a symbol lis! and a word position in the sentence, and the remaining elements of the search state are the backup states, each indicating an alternate symbol-list-word-position pair. For example, the possibilities list (((N)2) ((NAME) 1) ((ADJN)l)) indicates that die current state consists of the symbol list (N) at position 2, and that there are two possible backup states: one consisting of the symbol list (NAME) at position 1 and the other consisting of the symbol list (ADJ N) at position 1. Step Current State Backup States Comment 1. ((S) 1) initial position 2 ((NPVP)1) rewriting S by rule 1 3. ((ARTNVP)1) rewriting NP by rules 2 & 3 ((ART ADJ N VP) 1) 4. «NVP)2) matching ART with the ((ART ADJ N VP) 1) 5. ((VP) 3) matching N with dogs ((ART ADJ N VP) 1) 6. ((V) 3) rewriting VP by rules 5-8 ((VNP)3) ((ART ADJ N VP) 1) 7. the parse succeeds as V is matched to cried, leaving an empty grammatical symbol list with an empty sentence Figure 3.5 Top-down depth-first parse of \ The 2 dogs 3 cried 4 The algorithm starts with the initial state ((S) 1) and no backup states. 1. Select the current state: Take the first state off the possibilities list and call it C. If the possibilities list is empty, then the algorithm fails (that is, no successful parse is possible). 2. If C consists of an empty symbol list and the word position is at the end of the sentence, then the algorithm succeeds. 3. Otherwise, generate the next possible states. 3.1. If the first symbol on the symbol list of C is a lexical symbol, and the next word in the sentence can be in that class, then create a new state by removing the first symbol from the symbol list and updating the word position, and add it to the possibilities list. 3.2. Otherwise, if the first symbol on the symbol list of C is a non-terminal, generate a new state for each rule in the grammar that can rewrite that nonterminal symbol and add them all to the possibilities list. Consider an example. Using Grammar 3.4, Figure 3.5 shows a trace of the algorithm on the sentence The dogs cried. First, the initial S symbol is rewritten using rule 1 to produce a new current state of ((NP VP) 1) in step 2. The NP is then rewritten in turn, but since there are two possible rules for NP in the grammar, two possible states are generated: The new current state involves (ART N VP) at position 1, whereas the backup state involves (ART ADJ N VP) at position I. In step 4 a word in category ART is found at position 1 of the 50 CHAPTER 3 Grammars and Parsing 51 sentence, and the new current state becomes (N VP). The backup state generated in step 3 remains untouched. The parse continues in this fashion to step 5, where two different rules can rewrite VP. The first rule generates the new current state, while the other rule is pushed onto the stack of backup states. The parse completes successfully in step 7, since the current state is empty and all the words in the input sentence have been accounted for. Consider the same algorithm and grammar operating on the sentence 1 The 2 old 3 man 4 cried 5 In this case assume that the word old is ambiguous between an ADJ and an N and that the word maw is ambiguous.between an N and a V (as in the sentence The sailors man the boats'). Specifically, the lexicon is the: ART old: ADJ, N man:. N, V cried: V . The parse proceeds as follows (see Figure 3.6). The initial S symbol is rewritten by rule 1 to produce the new current state of ((NP VP) 1). The NP is rewritten in turn, giving the new state of ((ART N VP) 1) with a backup state of ((ART ADJ N VP) 1). The parse continues, finding the as an ART to produce the state ((N VP) 2) and then old as an N to obtain the state ((VP) 3). There are now two ways to rewrite the VP, giving us a current state of ((V) 3) and the backup states of ((V NP) 3) and ((ART ADJ N) 1) from before. The word man can be parsed as a V, giving the state (() 4). Unfortunately, while the symbol list is empty, the word position is not at the end of the sentence, so no new state can be generated and a backup state must be used. In the next cycle, step 8, ((V NP) 3) is attempted. Again man is taken as a V and the new state ((NP) 4) generated. None of the rewrites of NP yield a successful parse. Finally, in step 12, the last backup state, ((ART ADJ N VP) 1), is tried and leads to a successful parse. Parsing as a Search Procedure You can think of parsing as a special case of a search problem as defined in AI. In particular, the top-down parser in this section was described in terms of the following generalized search procedure. The possibilities list is initially set to the start state of the parse. Then you repeat the following steps until you have success or failure: 1. Select the first state from the possibilities list (and remove it from the list). 2. Generate the new states by trying every possible option from the selected state (there may be none if we are on a bad path). 3. Add the states generated in step 2 to the possibilities list. Step Current State 1. ((S) 1) 2. ((NPVP)1) 3. ((ARTN-VP) 1), 4. ((NVP)2) 5. ((VP) 3) 6. ((V)3) 7. (04) 8. ((VNP)3) 9. ((NP) 4) 10. ((ART N) 4) 11. ((ART ADJ N) 4) 12. ((ART ADJ N VP) 1) 13. ((ADJ N VP) 2) 14. ((NVP)3) 15. ((VP) 4) 16. ((V)4) 17. (05) Backup States ((ART ADJ N VP) 1) ((ART ADJ N VP) 1) ((ART ADJ N VP) 1) ((VNP)3) ((ART ADJ N VP) 1) ((VNP)3) ((ART ADJ N VP) 1) ((ART ADJ N VP) 1) ((ART ADJ N VP) 1) ((ART ADJ N) 4) ((ART ADJ N VP) 1) ((ART ADJ N VP) 1) ((VNP)4) Comment S rewritten to NP VP NP rewritten producing two new states the backup state remains the first backup is chosen looking for ART at 4 fails fails again now exploring backup state saved in step 3 Figure 3.6 A top-down parse ot \ The 2 old 3 man 4 cried 5 For a depth-first strategy, the possibilities list is a stack. In other words, step 1 always takes the first element off the list, and step 3 always puts the new states on the front of the list, yielding a last-in first-out (LIFO) strategy. Tn contrast, in a breadth-first strategy the possibilities list is manipulated as a queue. Step 3 adds the new positions onto the end of the list, rather than the beginning, yielding a first-in first-out (FIFO) strategy. 52 CHAPTER 3 Grammars and Parsing 53 1 ((S) 1) i 2 ((NPVP)l) 2 3 ((ART N VP) 1) 3 4 ((N VP) 2) 5 5 ((VP) 3) 7 6 ((V)3) 9 8 ((VNP)3) 10 12 ((ART ADJ N VP) 1) 4 13 ((ADJ N VP) 2) 6 14 ((N VP) 3) 8 15 ((VP) 4) 11 7 ((()4) 12 9 ((NP)4) 13 16 ((V)4) 14 ((VNP)4) 15 10 ((ART N) 4) 16 11 ((ART ADJ N) 4) 17 17 ((() 5) 18 Figure 3.7 Search tree for two parse strategies (depth-first strategy on left; breadth-first on right) We can compare these search strategies using a tree format, as in Figure 3.7. which shows the entire space of parser slates for the last example. Each node in the tree represents a parser state, and the sons of a node are the possible moves from that state. The number beside each node records when the node was selected to be processed by the algorithm. On the left side is the order produced by the depth-first strategy, and on the right side is the order produced by the breadth-first strategy. Remember, the sentence being parsed is I The 2 old 3 man 4 cried 5 The main difference between depth-first and breadth-first searches in this simple example is the order in which the two possible interpretations of the first NP are examined. With the depth-first strategy, one interpretation is considered and expanded until it fails; only then is the second one considered. With the breadth-first strategy, both interpretations are considered alternately, each being expanded one step at a time. In this example, both depth-first and breadth-first searches found the solution but searched the space in a different order. A depth-first search often moves quickly to a solution but in other cases may spend considerable time pursuing futile paths. The breadth-first strategy explores each possible solution to a certain depth before moving on. In this particular example the depth-first strategy found the solution in one less step than the breadth-first. (The state in the bottom right-hand side of Figure 3.7 was not explored by the depth-first parse.) In certain cases it is possible to put these simple search strategies into an infinite loop. For example, consider a left-recursive rule that Could be a first account of the possessive in English (as in the NP the man's coat): NP -> NP 's N With a naive depth-first strategy, a state starting with the nonterminal NP would be rewritten to a new state beginning with NP's N. But this state also begins with an NP that could be rewritten in the same way. Unless an explicit check were incorporated into the parser, it would rewrite NPs forever! The breadth-first strategy does better with left-recursive rules, as it tries all other ways to rewrite the original NP before coming to the newly generated state with the new NP. But with an ungrammatical sentence it would not terminate because it would rewrite the NP forever while searching for a solution. For this reason, many systems prohibit left-recursive rules from the grammar. Many parsers built today use the depth-first strategy because it tends to minimize the number of backup states needed and thus uses less memory and requires less bookkeeping. 3.4 A Bottom-Up Chart Parser The main difference between top-down and bottom-up parsers is the way the grammar rules are used. For example, consider the rule NP ART ADJ N Tn a top-down system you use the rule to find an NP by looking for the sequence ART ADJ N. In a bottom-up parser you use the rule to take a sequence ART ADJ N that you have found and identify it as an NP. The basic operation in bottom-up parsing then is to take a sequence of symbols and match it to the right-hand side of the rules. You could build a bottom-up parser simply by formulating this matching process as a search process. The state would simply consist of a symbol list, starting with the words in the sentence. Successor states could be generated by exploring all possible ways to • rewrite a word by its possible lexical categories • replace a sequence of symbols that matches the right-hand side of a grammar rule by its left-hand side symbol 54 CHAPTER 3 Grammars and Parsing 55 1. S -»-MPVP 2 NP -> ARTADJN 3. NP -4 ARTN 4. NP -> ADJN 5. VP -» AUXVP 6. VP -> VNP Grammar 3.8 A simple context-free grammar Unfortunately, such a simple implementation would be prohibitively expensive, as the parser would tend to try the same matches again and again, thus duplicating much of its work unnecessarily. To avoid this problem, a data structure called a chart is introduced that allows the parser to store the partial results of the matching it has done so far so that the work need not be reduplicated. Matches are always considered from the point of view of one constituent, called the key. To find rules that match a string involving the key, look for rules that start with the key, or for rules that have already been started by earlier keys and require the present key either to complete the rule or to extend the rule. For instance, consider Grammar 3.8. Assume you are parsing a sentence that starts with an ART. With this ART as the key, rules 2 and 3 are matched because they start with ART. To record this for analyzing the next key, you need to record that rules 2 and 3 could be continued at the point after the ART. You denote this fact by writing the rule with a dot (o), indicating what has been seen so far. Thus you record 2'. y. NP NP ART ° ADJ N ART ° N If the next input key is an ADJ, then rule 4 may be started, and the modified rule 2'may be extended to give 2". NP ART ADJ ° N The chart maintains the record of all the constituents derived from the sentence so far in the parse. It also maintains the record of rules that have matched partially but are not complete. These are called the active arcs. For example, after seeing an initial ART followed by an ADJ in the preceding example, you would have the chart shown in Figure 3.9. You should interpret this figure as follows. There are two completed constituents on the chart: ART1 from position 1 to 2 and ADJ1 from position 2 to 3. There arc four active arcs indicating possible constituents. These are indicated by the arrows and are interpreted as follows (from top to bottom). There is a potential NP starting at position 1. which needs an ADJ starling at position 2. There is another potential NP stalling at position 1. which needs an N starting at position 2. There is a potential NP ART1 ADJ1 1. 2 NP -> ART ° ADJ N ^ NP —> ART° N ^ NP -> ADJ ° N NP -> ART ADJ o N Figure 3.9 The chart after seeing an ADJ in position 2 To add a constituent C from position pi to P2: 1. Insert C into the chart from position to P2. 2. For any active arc of the form X -> X, ... ° C ... Xn from position p0 to pl, add a new active arc X —> X i ... C 0 ... Xn from position pg to P2. 3. For any active arc of the form X —> X i ...Xn°C from position p0 to pi, then add a new constituent of type X from p0 to P2 to the agenda. Figure 3.10 The arc extension algorithm starting at position 2 with an ADJ, which needs an N starting at position 3. Finally, there is a potential NP starting at position 1 with an ART and then an ADJ, which needs an N starling at position 3. The basic operation of a chart-based parser involves combining an active arc with a completed constituent. The result is either a new completed constituent or a new active arc that is an extension of the original active arc. New completed constituents are maintained on a list called the agenda until they themselves are added to the chart. This process is defined more precisely by the arc extension algorithm shown in Figure 3.10. Given this algorithm, the bottom-up chart parsing algorithm is specified in Figure 3.11. As with the top-down parsers, you may use a depth-first or breadth-first search strategy, depending on whether the agenda is implemented as a stack or a queue. Also, for a full breadth-first strategy, you would need to read in the entire input and add the interpretations of the words onto the agenda before starting the algorithm. Let us assume a depth-first search strategy for the following example. Consider using the algorithm on the sentence The large can can hold the water using Grammar 3.8 with the following lexicon: 56 CHAPTER 3 Grammars and Parsing 57 Do until there is no input left: 1. If the agenda is empty, look up the interpretations for the next word in the input and add them to the agenda. 2. Select a constituent from the agenda (let's call it constituent C from position p, top,). 3. For each rule in the grammar of form X —> C Xy ... Xn, add an active arc of form X -> 0 C Xj ... Xn from position p, to p2. 4. Add C to the chart using the arc extension algorithm above. Figure 3.11 A bottom-up chart parsing algorithm the: large: can: hold: water: ART ADJ N, AUX, V N,V N, V To best understand the example, draw the chart as it is extended at each step of the algorithm. The agenda is initially empty, so the word the is read and a constituent ART1 placed on the agenda. Entering ART1: (the from 1 to 2) Adds active arc NP -> ART ° ADJ N from 1 Adds active arc NP -s> ART 0 N from 1 to 2 to 2 Both these active arcs were added by step 3 of the parsing algorithm and were derived from rules 2 and 3 in the grammar, respectively. Next the word large is read and a constituent ADJ1 is created. Entering ADJ1: (large from 2 to 3) Adds arc NP -> ADJ ° N from 2 to 3 Adds arc NP -> ART ADJ ° N from 1 to 3 The first arc was added in step 3 of the algorithm. The second arc added here is an extension of the first active arc that was added when ART1 was added to the chart using the arc extension algorithm (step 4). The chart at this point has already been shown in Figure 3.9. Notice that active arcs are never removed from the chart. For example, when the arc NP —> ART ° ADJ N from 1 to 2 was extended, producing the arc from 1 to 3, both arcs remained on the chart. This is necessary because the arcs could be used again in a different way by another interpretation, For the next word, can, three constituents, Nl, AUX1, and VI are created for its three interpretations. NP2 (rule 4) NP1 (rule 2) ART1 ADJ1 Nl AUX1 VI 1 the 2 large NP -» ART o ADJ 3 can NP -> ART °N NP H> ADJ ° N NP -» ART ADJ ° N NP ° VP S _> NP o VP VP^ AUX°VP VP -> V° NP Figure 3.12 After parsing the large can Entering Nl: (can from 3 to 4) No active arcs are added in step 2, but two are completed in step 4 by the arc extension algorithm, producing two NPs that are added to the agenda: The first, an NP from 1 to 4, is constructed from rule 2, while the second, an NP from 2 to 4, is constructed from rule 4. These NPs are now at the top of the agenda. Entering NP1: an NP (the large can from 1 to 4) Adding active arc S NP ° VP from 1 to 4 Entering NP2: an NP (large can from 2 to 4) Adding arc S -> NP ° VP from 2 to 4 Entering AUX1: (can from 3 to 4) Adding arc VP -> AUX ° VP from 3 to 4 Entering VI: (can from 3 to 4) Adding arc VP -> V°NP from 3 to 4 The chart is shown in Figure 3.12, which illustrates all the completed constituents (NP2, NPI. ART1, ADJ1. Nl. AUX I. VI) and all the uncompleted active arcs entered so far. The next word is can again, and N2, AUX2, and V2 are created. 58 CHAPTER 3 Grammars and Parsing 59 NP2(rule4) NPl (rale 2) Nl N2 VI V2 V3 ART1 ADJ1 AUX1 AUX2 N3 1 the 2 large 3 can 4 can _S-> NP°VP_^ S —> NP o VP ^ 5 hold VP -> AUX° VP.. VP-> V°NP>. VP —» AUX o VF VP-S. V°NP^ VP-> V°NPN Figure 3.13 The chart after adding hold, omitting arcs generated for the first NP Entering N2: (can from 4 to 5, the second can) Adds no active arcs Entering AUX2: (can from 4 to 5) Adds arc VP -» AUX ° VP from 4 to 5 Entering V2: (can from 4 to 5) Adds arc VP -> V ° NP from 4 to 5 The next word is hold, and N3 and V3 are created. Entering N3: (hold from 5 to 6) Adds no active arcs Entering V3: (hold from 5 to 6) Adds arc VP -> V ° NP from 5 to 6 The chart in Figure 3.13 shows all the completed constituents built so far, together with all the active arcs, except for those used in the first NP. Entering ART2: (the from 6 to 7) Adding arc NP -» ART ° ADJ N from 6 to 7 Adding arc NP -> ART ° N from 6 to 7 NP2(rule4) NPl (rale 2) Nl N2 MM mile 3 > VI V2 V3 V4 ART1 ADJ1 ATJX1 AUX2 N3 ART2 N4 1 the 2 large 3 can _S -> NP °VP S -> NP ° VP 5 hold 6 the 7 water 8 VP -> AUX jWP VP -» AUX ^VP Figure 3.14 The chart after all the NPs are found, omitting all but the crucial active arcs Entering N4: (water from 7 to 8) No active arcs added in step 3 An NP, NP3, from 6 to 8 is pushed onto the agenda, by completing arc NP -> ART ° N from 6 to 7 Entering NP3: (the water from 6 to 8) A VP, VP1, from 5 to 8 is pushed onto the agenda, by completing VP -» V ° NP from 5 to 6 Adds arc S -* NP0 VP from 6 to 8 The chart at this stage is shown in Figure 3.14, but only the active arcs to be used in the remainder of the parse are shown. Entering VP1: (hold the water from 5 to 8) A VP, VP2, from 4 to 8 is pushed onto the agenda, by completing VP -» AUX ° VP from 4 to 5 Entering VP2: (can hold the water from 4 to 8) An S, SI, is added from 1 to 8, by completing arc S -» NP ° VP from 1 to 4 A VP, VP3, is added from 3 to 8, by completing arc VP -» AUX ° VP from 3 to 4 An S, S2, is added from 2 to 8, by completing arc S -> NP ° VP from 2 to 4 Since you have derived an S covering the entire sentence, you can stop successfully. If you wanted to find all possible interpretations for the sentence, 60 CHAPTER 3 Grammars and Parsing 61 S1 (rule 1 with NP1 and VP2) S2 (rule 1 with NP2 and VP2) VP3 (rule 5 with AUX1 and VP2) NP2 (rule 4) VP2 (rule 5) NP1 (rule 2) VPl(rule6) Nl N2 NP3 (rule 3) VI V2 V3 V4 ART1 ADJ1 AUX1 AUX2 N3 ART2 N4 1 the 2 large 3 can 4 can 5 hold 6 the 7 water Figure 3.15 The final chart you would continue parsing until the agenda became empty. The chart would then contain as many S structures covering the entire set of positions as there were different structural interpretations. In addition, this representation of the entire set of structures would be more efficient than a list of interpretations, because the different S structures might share common subparts represented in the chart only once. Figure 3.15 shows the final chart. Efficiency Considerations Chart-based parsers can be considerably more efficient than parsers that rely only on a search because the same constituent is never constructed more than once. For instance, a pure top-down or bottom-up search strategy could require up to Cn operations to parse a sentence of length n, where C is a constant that depends on the specific algorithm you use. Even if C is very small, this exponential complexity rapidly makes the algorithm unusable. A chart-based parser, on the other hand, in the worst case would build every possible constituent between every possible pair of positions. This allows us to show that it has a worst-case complexity of K*n3, where n is the length of the sentence and K is a constant depending on the algorithm. Of course, a chart parser involves more work in each step, so K will be larger than C. To contrast the two approaches, assume that C is 10 and that K is a hundred times worse, 1000. Given a sentence of 12 words, the brute force search might take 1012 operations (that is, 1,000,000,000,000), whereas the chart parser would take 1000 * 123 (that is. 1.728,000). Under these assumptions, the chart parser would be up to 500,000 times faster than the brute force search on some examples! i 3.5 Transition Network Grammars 50 far we have examined only one formalism for representing grammars, namely context-free rewrite rules. Here we consider another formalism that is useful in a wide range of applications. It is based on the notion of a transition network consisting of nodes, and labeled arcs. One of the nodes is specified as the initial state, or start state. Consider the network named NP in Grammar 3.16, with the initial state labeled NP and each arc labeled with a word category. Starting at the initial state, you can traverse an arc if the current word in the sentence is in the category on the arc. If the arc is followed, the current word is updated to the next word. A phrase is a legal NP if there is a path from the node NP to a pop arc (an arc labeled pop) that accounts for every word in the phrase. This network recog -nizes the same set of sentences as the following context-free grammar: NP -> ART NP1 NP1 -» ADINP1 NP1 -> N Consider parsing the NP a purple cow with this network. Starting at the node NP, you can follow the arc labeled art, since the current word is an article— namely, a. From node NP1 you can follow the arc labeled adj using the adjective purple, and finally, again from NP1, you can follow the arc labeled noun using the noun cow. Since you have reached a pop arc, a purple cow is a legal NP. Simple transition networks are often called finite state machines (FSMs). Finite state machines are equivalent in expressive power to regular grammars (see Box 3.2), and thus are not powerful enough to describe all languages that can be described by a CFG. To get the descriptive power of CFGs, you need a notion of recursion in the network grammar. A recursive transition network (RTN) is like a simple transition network, except that it allows arc labels to refer to other networks as well as word categories. Thus, given the NP network in Grammar 3.16, a network for simple English sentences can be expressed as shown in Grammar 3.17. Uppercase labels refer to networks. The arc from S to 51 can be followed only if the NP network can be successfully traversed to a pop arc. Although not shown in this example, RTNs allow true recursion—that is, a network might have an arc labeled with its own name. Consider finding a path through the S network for the sentence The purple cow ate the grass. Starting at node S, to follow the arc labeled NP, you need to traverse the NP network. Starting at node NP, traverse the network as before for the input the purple cow. Following the pop arc in the NP network, return to the S network and traverse the arc to node SI. From node SI you follow the arc labeled verb using the word ate. Finally, the arc labeled NP can be followed if ' you can traverse the NP network again. This time the remaining input consists of the words the grass. You follow the arc labeled art and then the arc labeled noun in the NP network; then take the pop arc from node NP2 and then another pop from node S3. Since you have traversed the network and used all the words in the sentence, The purple cow ate the grass is accepted as a legal sentence. 62 CHAPTER 3 Grammars and Parsing 63 Grammar 3.17 Arc Type Example How Used CAT noun succeeds only if current word is of the named category WRD of succeeds only if current word is identical to the label PUSH NP succeeds only if named network can be successfully traversed JUMP jump always succeeds POP pop succeeds and signals the successful end of the network Figure 3.18 The arc labels tor RTNs In practice, RTN systems incorporate some additional arc types that are useful but not formally necessary. Figure 3.18 summarizes the arc types, together with the notation that will be used in this book to indicate these arc types. According to this terminology, arcs that are labeled with networks are called push arcs, and arcs labeled with word categories are called cat arcs. In addition, an arc that can always be followed is called a jump arc. Top-Down Parsing with Recursive Transition Networks An algorithm for parsing with RTNs can be developed along the same lines as the algorithms for parsing CFGs. The state of the parse at any moment can be represented by the following: current position—a pointer to the next word to be parsed, current node—the node at which you are located in the network, return points—a stack of nodes in other networks where you will continue if you pop from the current network. Grammar 3.19 First, consider an algorithm for searching an RTN that assumes that if you can follow an arc, it will be the correct one in the final parse. Say you are in the middle of a parse and know the three pieces of information just cited. You can leave the current node and traverse an arc in the following cases: Case 1: If arc names word category and next word in sentence is in that category, Then (1) update current position to start at the next word; (2) update current node to the destination of the arc. Case 2: If arc is a push arc to a network N, Then (1) add the destination of the arc onto return points; (2) update current node to the starting node in network N. Case 3: If arc is a pop arc and return points list is not empty, Then (1) remove first return point and make it current node. Case 4: If arc is a pop arc, return points list is empty and there are no words left, Then (1) parse completes successfully. Grammar 3.19 shows a network grammar. The numbers on the arcs simply indicate the order in which arcs will be tried when more than one arc leaves a node. Grammars and Parsing 65 Current Current Return Arc to be Step Node Position Points Followed Comments 1. (S, 1, NIL) S/l initial position 2. (NP, 1, (SI)) NP/1 followed push arc to NP net- work, to return ultimately to SI 3. (NP1, 2, (SD) NP1/1 followed arc NP/1 (the) 4. (NP1, 3, (SD) NP1/2 followed arc NP1/1 (old) 5. (NP2, 4, (SD) NP2/2 followed arc NP1/2 (man) since NP1/1 is not applicable 6. (SI, 4, NIL) Sl/1 the pop arc gets us back to SI 7. (S2, 5, NIL) S2/1 followed arc S2/1 (cried) 8. parse succeeds on pop arc from S2 Figure 3,20 A trace of a top-down parse Figure 3.20 demonstrates that the grammar accepts the sentence 1 The 2 old 3 man 4 cried 5 by showing the sequence of parse states that can be generated by the algorithm. In the trace, each arc is identified by the name of the node that it leaves plus the number identifier. Thus arc S/l is the arc labeled 1 leaving the S node. If you start at node S, the only possible arc to follow is the push arc NP. As specified in case 2 of the algorithm, the new parse state is computed by setting the current node to NP and putting node SI on the return points list. From node NP, arc NP/1 is followed and, as specified in case 1 of the algorithm, the input is checked for a word in category art. Since this check succeeds, the arc is followed and the current position is updated (step 3). The parse continues in this manner to step 5, when a pop arc is followed, causing the current node to be reset to SI (that is, the NP arc succeeded). The parse succeeds after finding a verb in step 6 and following the pop arc from the S network in step 7. In this example the parse succeeded because the first arc that succeeded was ultimately the correct one in every case. However, with a sentence like The green faded, where green can be an adjective or a noun, this algorithm would fail because it would initially classify green as an adjective and then not find a noun following. To be able to recover from such failures, we save all possible backup states as we go along, just as we did with the CFG top-down parsing algorithm. Consider this technique in operation on the following sentence: 1 One 2 saw 3 the 4 man 5 The parser initially attempts to parse the sentence as beginning with the NP one saw, but after failing to find a verb, it backtracks and finds a successful parse starting with the NP one. The trace of the parse is shown in Figure 3.21, where at Step Current State Arc to be Followed Backup States 1. (S, 1, NIL) S/l NIL 2_ (NP, 1, (SI)) NP/2 (& NP/3 for backup) NIL 3. (NP1,2, (SI)) NP1/2 (NP2, 2, (SI)) 4. (NP2, 3, (SI)) NP2/1 (NP2, 2, (SI)) 5. (SI, 3, NIL) no arc can be followed (NP2, 2, (SI)) 6. (NP2, 2, (SI)) NP2/1 NIL 7. (SI, 2, NIL) Sl/1 NIL 8. (S2, 3, NIL) S2/2 NIL 9. (NP, 3, (S2)) NP/1 NIL 10. (NP1,4, (S2)) NP1/2 NIL 11. (NP2,5, (S2)) NP2/1 ML 12. (S2, 5, NIL) S2/1 NIL. 13. parse succeeds NIL Figure 3.21 A top-down RTN parse with backtracking each stage the current parse state is shown in the form of a triple (current node, current position, return points), together with possible states for backtracking. The figure also shows the arcs used to generate the new state and backup states. This trace behaves identically to the previous example except in two places. In step 2, two arcs leaving node NP could accept the word one. Arc NP/2 classifies one as a number and produces the next current state. Arc NP/3 classifies it as a pronoun and produces a backup state. This backup state is actually used later in step 6 when it is found that none of the arcs leaving node SI can accept the input word the. Of course, in general, many more backup states are generated than in this simple example. In these cases there will be a list of possible backup states. Depending on how this list is organized, you can produce different orderings on when the states are examined. An RTN parser can be constructed to use a chart-like structure to gain the advantages of chart parsing. In RTN systems, the chart is often called the well-formed substring table (WFST). Each time a pop is followed, the constituent is placed on the WFST, and every time a push is found, the WFST is checked before the subnetwork is invoked. If the chart contains constituent(s) of the type being pushed for, these are used and the subnetwork is not reinvoked. An RTN using a WFST has the same complexity as the chart parser described in the last section: K*n3, where n is the length of the sentence. o 3.6 Top-Down Chart Parsing So far, you have seen a simple top-down method and a bottom-up chart-based method for parsing context-free grammars. Each of the approaches has its advantages and disadvantages. In this section a new parsing method is presented that 66 CHAPTER 3 Grammars and Parsing 67 actually captures the advantages of both. But first, consider the pluses and minuses of the approaches. Top-down methods have the advantage of being highly predictive. A word might be ambiguous in isolation, but if some of those possible categories cannot be used in a legal sentence, then these categories may never even be considered. For example, consider Grammar 3.8 in a top-down parse of the sentence The can holds the water, where can may be an AUX, V, or N, as before. The top-down parser would rewrite (S) to (NP VP) and then rewrite the NP to produce three possibilities, (ART ADJ N VP), (ART N VP), and (ADJ N VP). Taking the first, the parser checks if the first word, the, can be an ART, and then if the next word, can, can be an ADJ, which fails. Trying the next possibility, the parser checks the again, and then checks if can can be an N, which succeeds. The interpretations of can as an auxiliary and a main verb are never considered because no syntactic tree generated by the grammar would ever predict an AUX or V in this position. In contrast, the bottom-up parser would have considered all three interpretations of can from the start—that is, all three would be added to the chart and would combine with active arcs. Given this argument, the top-down approach seems more efficient. On the other hand, consider the top-down parser in the example above needed to check that the word the was an ART twice, once for each rule. This reduplication of effort is very common in pure top-down approaches and becomes a serious problem, and large constituents may be rebuilt again and again as they are used in different rules. In contrast, the bottom-up parser only checks the input once, and only builds each constituent exactly once. So by this argument, the bottom-up approach appears more efficient. You can gain the advantages of both by combining the methods. A small variation in the bottom-up chart algorithm yields a technique that is predictive like the top-down approaches yet avoids any reduplication of work as in the bottom-up approaches. As before, the algorithm is driven by an agenda of completed constituents and the arc extension algorithm, which combines active arcs with constituents when they are added to the chart. While both use the technique of extending arcs with constituents, the difference is in how new arcs are generated from the grammar, in the bottom-up approach, new active arcs are generated whenever a completed constituent is added that could be the first constituent of the right-hand side of a rule. With the top-down approach, new active arcs are generated whenever a new active arc is added to the chart, as described in the top-down arc introduction algorithm shown in Figure 3.22. The parsing algorithm is then easily stated, as is also shown in Figure 3.22. Consider this new algorithm operating with the same grammar on the same sentence as in Section 3.4, namely The large can can hold the water. Tn the initialization stage, an arc labeled S —> ° NP VP is added. Then, active arcs lor each rule that can derive an NP are added: NP h> ° ART ADJ N. NP -> ° ART N. Top-Down Arc Introduction Algorithm To add an arc S -»Q ... ° Q ... Cn ending at position j, do the following: For each rule in the grammar of form Q -> X i ... Xk, recursively add the new arc Q —> 0 Xi ... Xk from position j to j. Top-Down Chart Parsing Algorithm Initialization: For every rule in the grammar of form S -> X x ... Xk, add an arc-labeled S -» 0 X] ... Xk using the arc introduction algorithm. Parsing: Do until there is no input left: 1. If the agenda is empty, look up the interpretations of the next word and add them to the agenda. 2. Select a constituent from the agenda (call it constituent C). 3. Using the arc extension algorithm, combine C with every active arc on the chart. Any new constituents are added to the agenda. 4. For any active arcs created in step 3, add them to the chart using the top-down arc introduction algorithm. Figure 3.22 The top-down arc introduction and chart parsing algorithms ( \ s -> ° NP VP / NP^>°ADJN NP -> ° ART N NP -> ° ART ADJ N Figure 3.23 The initial chart and NP -> ° ADJ N are all added from position 1 to 1. Thus the initialized chart is as shown in Figure 3.23. The trace of the parse is as follows: Entering ART1 (the) from 1 to 2 Two arcs can be extended by the arc extension algorithm NP -> ART ° N from 1 to 2 NP -> ART ° ADJ N from 1 to 2 Entering ADJ1 (large) from 2 to 3 One arc can be extended NP -> ART ADJ ° N from 1 to 3 Entering AUX1 (can) from 3 to 4 No activity, constituent is ignored Entering VI (can) from 3 to 4 No activity, constituent is ignored 68 CHAPTER 3 Grammars and Parsing 69 NPl (rule 2) ART1 ADJ1 Nl. 1 the 2 large 3 can NP -» ART ° N^ NP -» ART ° AQJN NP -> ART ADJ 0 N > NP ° VP S-> NP NP ■ NP >NPVP >0 ADJ N >°ART N ► ° ART ADJ N VP -> ° AUX VP VP —> ° V NP Figure 3.24 The chart after building the first NP Entering Nl (caw) from 3 to 4 One arc extended and completed yielding NPl from 1 to 4 (the large can) Entering NPl from 1 to 4 One arc can be extended S -> NP ° VP from 1 to 4 Using the top-down rule (step 4), new active arcs are added for VP VP -> ° AUX VP from 4 to 4 VP -> ° V NP from 4 to 4 At this stage, the chart is as shown in Figure 3.24. Compare this with Figure 3.10. It contains fewer completed constituents since only those that are allowed by the top-down filtering have been constructed. The algorithm continues, adding the three interpretations of can as an AUX, V, and N. The AUX reading extends the VP -» ° AUX VP arc at position 4 and adds active arcs for a new VP starting at position 5. The V reading extends the VP -> 0 V NP arc and adds active arcs for an NP starting at position 5. The N reading does not extend any arc and so is ignored. After the two readings of hold (as an N and V) are added, the chart is as shown in Figure 3.25. Again, compare with the corresponding chart for the bottom-up parser in Figure 3.13. The rest of the sentence is parsed similarly, and the final chart is shown in Figure 3.26. In comparing this to the final chart produced by the bottom-up parser (Figure 3.15), you see that the number of constituents generated has dropped from 21 to 13. NPl (rule 2) V2 ART1 ADJ1 Nl. AUX2 V3 1 the 2 large 3 can S->NP°VP 4 can 5 hold 6 VP —> AUX ° VP VP->V°NP VP —> ° AUX VP VP —> ° V NP NP —> ° ART ADJ N NP -> ° ART N NP -> ° ADJ N Figure 3.25 The chart after adding hold, omitting arcs generated for the first NP S1 (rule 1 with NPl and VP2) VP2 (rule 5 with AUX2 and VP1) VP1 (rule 6 with V3 and NP2) NPl (rule 2) V2 NP2 (rule 3) ART1 ADJ1 Nl AUX2 V3 ART2 N4 1 the 2 large 3 can 4 can 5 hold 6 the 7 water 8 Figure 3.26 The final chart for the top-down filtering algorithm While it is not a big difference here with such a simple grammar, the difference can be dramatic with a sizable grammar. It turns out in the worst-case analysis that the top-down chart parser is not more efficient that the pure bottom-up chart parser. Both have a worst-case complexity of K*n3 for a sentence of length n. In practice, however, the top-down method is considerably more efficient for any reasonable grammar. 70 CHAPTER 3 BOX 3.2 Generative Capacity of Transition Networks Transition network systems can be classified by the types of languages they can describe. In fact, you can draw correspondences between various network systems and rewrite-rule systems. For instance, simple transition networks (that is, finite state machines) with no push arcs are expressively equivalent to regular grammars—that is, every language that can be described by a simple transition network can be described by a regular grammar, and vice versa. An FSM for the first language described in Box 3.1 is d jump jump Recursive transition networks, on the other hand, are expressively equivalent to context-free grammars. Thus an RTNcan be converted into a CFG and vice versa. A recursive transition network for the language consisting of a number of a's followed by an equal number of b's is 3.7 Finite State Models and Morphological Processing Although in simple examples and small systems you can list all the words allowed by the system, large vocabulary systems face a serious problem in representing the lexicon. Not only are there a large number of words, but each word may combine with affixes to produce additional related words. One way to address this problem is to preprocess the input sentence into a sequence of morphemes. A word may consist of single morpheme, but often a word consists of a root form plus an affix. For instance, the word eaten consists of the root form eat and the suffix -en, which indicates the past participle form. Without any preprocessing, a lexicon would have to list all the forms of eat, including eats, eating, ate, and eaten. With preprocessing, there would be one morpheme eat that may combine with suffixes such as -ing. -s, and -en, and one entry for the irregular form ate. Thus the lexicon would only need to store two entries (eat and Grammars and Parsing 71 Figure 3.27 A simple FST for the forms of happy ate) rather than four. Likewise the word happiest breaks down into the root form happy and the suffix -est, and thus does not need a separate entry in the lexicon. Of course, not all forms are allowed; for example, the word seed cannot be decomposed into a root form se (or see) and a suffix -ed. The lexicon would have to encode what forms are allowed with each root. One of the most popular models is based on finite state transducers (FSTs), which are like finite state machines except that they produce an output given an input. An arc in an FST is labeled with a pair of symbols. For example, an arc labeled i:y could only be followed if the current input is the letter i and the output is the letter y. FSTs can be used to concisely represent the lexicon and to transform the surface form of words into a sequence of morphemes. Figure 3.27 shows a simple FST that defines the forms of the word happy and its derived forms. It transforms the word happier into the sequence happy +er and happiest into the sequence happy +est. Arcs labeled by a single letter have that letter as both the input and the output. Nodes that are double circles indicate success states, that is, acceptable words. Consider processing the input word happier starting from state 1. The upper network accepts the first four letters, happ, and copies them to the output. From state 5 you could accept a y and have a complete word, or you could jump to state 6 to consider affixes. (The dashed link, indicating a jump, is not formally necessary but is useful for showing the break between the processing of the root form and the processing of the suffix.) For the word happier, you must jump to state 6. The next letter must be an i, which is transformed into a y. This is followed by a transition that uses no input (the empty symbol e) and outputs a plus sign. From state 8, the input must be an e, and the output is also e. This must be followed by an r to get to state 10, which is encoded as a double circle indicating a possible end of word (that is, a success state for the FST). Thus this FST accepts the appropriate forms and outputs the desired sequence of morphemes. The entire lexicon can be encoded as an FST that encodes all the legal input words and transforms Ihem into morphemic sequences. The FSTs for the 72 CHAPTER 3 Figure 3.28 A fragment of an FST defining some nouns (singular and plural) different suffixes need only be defined once, and all root forms that allow that suffix can point to the same node. Words that share a common prefix (such as torch, toss, and to) also can share the same nodes, greatly reducing the size of the network. The FST in Figure 3.28 accepts the following words, which all start with t: tie (state 4), ties (10), trap (7), traps (10), try (11), tries (15), to (16), torch (19), torches (15), toss (21), and tosses (15). In addition, it outputs the appropriate sequence of morphemes. Note that you may pass through acceptable states along the way when processing a word. For instance, with the input toss you would pass through state 15, indicating that to is a word. This analysis is not useful, however, because if to was accepted then the letters ss would not be accounted for. Using such an FST, an input sentence can be processed into a sequence of morphemes. Occasionally, a word will be ambiguous and have multiple different decompositions into morphemes. This is rare enough, however, that we will ignore this minor complication throughout the book. 3.8 Grammars and Logic Programming Another popular method of building a parser for CFGs is to encode the rules of the grammar directly in a logic programming language such as PROLOG. It turns but that the standard PROLOG interpretation algorithm uses exactly the same search strategy as the depth-first top-down parsing algorithm, so all that is needed is a way to reformulate context-free grammar rules as clauses in PROLOG. Consider the following CFG rule: S -> NPVP Grammars and Parsing 73 This rule can be reformulated as an axiom that says, "A sequence of words is a legal S if it begins with a legal NP that is followed by a legal VP." If you number each word in a sentence by its position, you can restate this rule as: "There is an S between position pi and p3, if there is a position p2 such that there is an NP between pi and p2 and a VP between p2 and p3." In PROLOG this would be the following axiom, where variables are indicated as atoms with an initial capitalized letter: s(P1, P3) :- np(P1, P2), vp(P2, P3) To set up the process, add axioms listing the words in the sentence by their position. For example, the sentence John ate the cat is described by wordfjohn, 1, 2) word(ate, 2, 3) word(the, 3, 4) word(cat, 4, 5) The lexicon is defined by a set of predicates such as the following: isart(the) isnameQohn) isverb(ate) isnoun(cat) Ambiguous words would produce multiple assertions—one for each syntactic category to which they belong. For each syntactic category, you can define a predicate that is true only if the word between the two specified positions is of that category, as follows: n(l, O) :- word(Word, I, O), isnoun(Word) art(l, O) :- word(Word, I, O), isart(Word) v(l, O) :- word(Word, I, O), isverb(Word) name(l, O) :- word(Word, I, O), isname(Word) Using the axioms in Figure 3.29, you can prove that John ate the cat is a legal sentence by proving s(1, 5), as in Figure 3.30. In Figure 3.30, when there is a possibility of confusing different variables that have the same name, a prime (') is appended to the variable name to make it unique. This proof trace is in the same format as the trace for the top-down CFG parser, as follows. The state of the proof at any time is the list of subgoals yet to be proven. Since the word positions are included in the goal description, no separate position column need be traced. The backup states are also lists of subgoals, maintained automatically by a system like PROLOG to implement backtracking. A typical trace of a proof in such a system shows only the current state at any time. Because the standard PROLOG search strategy is the same as the depth-first top-down paring strategy, a parser built from PROLOG will have the same computational complexity. Cn, that is. the number of steps can be exponential in the 74 CHAPTER 3 Grammars and Parsing 75 1. s(P1,P3) : - np(P1, P2), vp(P2, P3) 2 np(P1,P3) :- art(P1, P2), n(P2, P3) 3. np(P1,P3) :- name(P1,P3) 4. pp(Pi, P3) p(P1, P2), np(P2, P3) 5. vp(P1,P2) :-v(P1,P2) 6. vp(P1, P3) :- v(P1, P2), np(P2, P3) 7. vp(P1, P3) :- v(P1, P2), pp(P2, P3) Figure 3.29 A PROLOG-based representation of Grammar 3.4 Step Current State Backup States Comments 1. s(1,5) 2. np(1, P2) vp(P2, 5) 3. art(1,P2') n(P2', P2) vp(P2, 5) name(1, P2) vp(P2, 5) fails as no ART at position 1 4. name(1, P2) vp(P2, 5) 5. vp(2, 5) name(1, 2) proven 6. v(2. 5) v(2, P2) np(P2, 5) v(2, P2) pp(P2, 5) fails as no verb spans positions 2 to 5 7. v(2, P2) np(P2, 5) v(2, P2) pp(P2, 5) 8. np(3, 5) v(2, P2) pp(P2, 5) v(2, 3) proven 9. art(3, P2) n(P2, 5) name(3, 5) v(2, P2) pp(P2, 5) 10. n(4, 5) name(3, 5) v(2, P2) pp(P2, 5) art(3, 4) proven 11. V proof succeeds name(3, 5) n(4, 5) proven v(2, P2) pp(P2, 5) Summary The two basic grammatical formalisms are context-free grammars (CFGs) and recursive transition networks (RTNs). A variety of parsing algorithms can be used for each. For instance, a simple top-down backtracking algorithm can be used for both formalisms and, in fact, the same algorithm can be used in the standard logic-programming-based grammars as well. The most efficient parsers use a chart-like structure to record every constituent built during a parse. By reusing this information later in the search, considerable work can be saved. Related Work and Further Readings Figure 3.30 A trace of a PROLOG-based parse of John ate the cat length of the input. Even wilh this worst-case analysis, PROLOG-based grammars can be quite efficient in practice. It is also possible to insert chart-like mechanisms to improve the efficiency of a grammar, although then the simple correspondence between context-free rules and PROLOG rules is lost. Some of these issues will be discussed in the next chapter. It is worthwhile to try some simple grammars written in PROLOG to better understand top-down, depth-first search. By turning on the tracing facility, you can obtain a trace similar in content to that shown in Figure 3.3(1 There is a vast literature on syntactic formalisms and parsing algorithms. The notion of context-free grammars was introduced by Chomsky (1956) and has been studied extensively since in linguistics and in computer science. Some of this work will be discussed in detail later, as it is more relevant to the material in the following chapters. Most of the parsing algorithms were developed in the mid-1960s in computer science, usually with the goal of analyzing programming languages rather than natural language. A classic reference for work in this area is Aho, Sethi, and Ullman (1986), or Aho and Ullman (1972), if the former is not available. The notion of a chart is described in Kay (1973; 1980) and has been adapted by many parsing systems since. The bottom-up chart parser described in this chapter is similar to the left-corner parsing algorithm in Aho and Ullman (1972), while the top-down chart parser is similar to that described by Earley (1970) and hence called the Earley algorithm. Transition network grammars and parsers are described in Woods (1970; 1973) and parsers based on logic programming are described and compared with transition network systems in Pereira and Warren (1980). Winograd (1983) discusses most of the approaches described here from a slightly different perspective, which could be useful if a specific technique is difficult to understand. Gazdar andMellish (1989a; 1989b) give detailed descriptions of implementations of parsers in LISP and in PROLOG. In addition, descriptions of transition network parsers can be found in many introductory AI texts, such as Rich and Knight (1992), Winston (1992), and Charniak and McDermott (1985). These books also contain descriptions of the search techniques underlying many of the parsing algorithms. Norvig (1992) is an excellent source on AI programming techniques. The best sources for work on computational morphology are two books: Sproat (1992) and Ritchie et al. (1992). Much of the recent work on finite state models has been based on the KIMMO system (Koskenniemi, 1983). Rather than requiring the construction of a huge network, KIMMO uses a set of FSTs which are run in parallel; that is, all of them must simultaneously accept the input and agree on the output. Typically, these FSTs are expressed using an abstract 76 CHAPTER 3 Grammars and Parsing 77 language that allows general morphological rules to be expressed concisely. A compiler can then be used to generate the appropriate networks for the system. Finite state models are useful for a wide range of processing tasks besides morphological analysis. Blank (1989), for instance, is developing a grammar for English using only finite state methods. Finite state grammars are also used extensively in speech recognition systems. Exercises for Chapter 3_ 1. (easy) a Express the following tree in the list notation in Section 3.1. the dog chased a toad the park b. Is there a tree structure that could not be expressed as a list structure? How about a list structure that could not be expressed as a tree? (easy) Given the CFG in Grammar 3.4, define an appropriate lexicon and show a trace in the format of Figure 3.5 of a top-down CFG parse of the sentence The man walked the old dog. (easy) Given the RTN in Grammar 3.19 and a lexicon in which green can be an adjective or a noun, show a trace in the format of Figure 3.21 of a top-down RTN parse of the sentence The green faded. (easy) Given the PROLOG-based grammar defined in Figure 3.29, show a trace in the format of Figure 3.30 of the proof that the following is a legal sentence: The cat ate John. (medium) Map the following context-free grammar into an equivalent recursive transition network that uses only three networks—an S, NP, and PP network. Make your networks as small as possible. S -» NPVP VP -» V VP -» VNP VP -» VPP NP -> ARTNP2 NP -» NP2 NP2 -> N NP2 -> ADJNP2 NP2 -> NP3 PREPS NP3 -> N PREPS -> PP PREPS -> PP PREPS PP -> NP 6. (medium) Given the CFG in Exercise 5 and the following lexicon, construct a trace of a pure top-down parse and a pure bottom-up parse of the sentence The herons fly in groups. Make your traces as clear as possible, select the rules in the order given in Exercise 5, and indicate all parts of the search. The lexicon entries for each word are the: ART herons: N fly: NVADJ in: P groups: NV 7. (medium) Consider the following grammar: S -> ADJS N S -> N ADJS -> ADJS ADJ ADJS -> ADJ Lexicon: ADJ: red, N: house a What happens to the top-down depth-first parser operating on this grammar trying to parse the input red red! In particular, state whether the parser succeeds, fails, or never stops. b. How about a top-down breadth-first parser operating on the same input red red? c How about a top-down breadth-first parser operating on the input red house'} d. How about a bottom-up depth-first parser on red house? e. For the cases where the parser fails to stop, give a grammar that is equivalent to the one shown in this exercise and that is parsed correctly. (Correct behavior includes failing on unacceptable phrases as well as succeeding on acceptable ones.) f. With the new grammar in part (e), do all the preceding parsers now operate correctly on the two phrases red red and red house? 78 CHAPTER 3 8. (medium) Consider the following CFG: S NPV S -» NPAUXV NP -> ARTN Trace one of the chart parsers in processing the sentence 1 The 2 man 3 is 4 laughing 5 with the lexicon entries: the: ART man: N is: AUX laughing: V Show every step of the parse, giving the parse stack, and drawing the chart each time a nonterminal constituent is added to the chart. 9. (medium) Consider the following CFG that generates sequences of letters: Grammars and Parsing 79 s -> a X c s —> b X c s —> b X d s -» b X e s -» c X e X -> f X x -> g a If you had to write a parser for this grammar, would it be better to use a pure top-down or a pure bottom-up approach? Why? b. Trace the parser of your choice operating on the input bffge. 10. (medium) Consider the following CFG and RTN: NP -> ART NPl NPl -» ADJNPPS PPS -> PP PPS -> PPPPS PP -> PNP a State two ways in which the languages described by these two grammars differ. For each, give a sample sentence that is recognized by one grammar but not the other and that demonstrates the difference. b. Write a new CFG equivalent to the RTN shown here. c Write a new RTN equivalent to the CFG shown here. 11. (hard) Consider the following sentences: List A List B i. Joe is reading the book. ii. Joe had won a letter. Hi. Joe has to win. iv. Joe will have the letter. v. The letter in the book was read. vi. The letter must have been in the book by Joe. vii. The man could have had one. i. *Joe has reading the book. ii. * Joe had win. Hi. *Joe winning. iv. *Joe will had the letter. v. *The book was win by Joe. vi. *Joe will can be mad. vii. *The man can have having one. Write a context-free grammar that accepts all the sentences in list A while rejecting the sentences in list B. You may find it useful to make reference to the grammatical forms of verbs discussed in Chapter 2. Implement one of the chart-based parsing strategies and, using the grammar specified in part (a), demonstrate that your parser correctly accepts all the sentences in A and rejects those in B. You should maintain enough information in each entry on the chart so that you can reconstruct the parse tree for each possible interpretation. Make sure your method of recording the structure is well documented and clearly demonstrated. List three (distinct) grammatical forms that would not be recognized by a parser implementing the grammar in part (a). Provide an example of your own for each of these grammatical forms. CHAPTER Features and Augmented Grammars 4.1 Feature Systems and Augmented Grammars 4.2 Some Basic Feature Systems for English 4.3 Morphological Analysis and the Lexicon 4.4 A Simple Grammar Using Features 4.5 Parsing with Features 4.6 Augmented Transition Networks 4.7 Definite Clause Grammars 4.8 Generalized Feature Systems and Unification Grammars Features and Augmented Grammars 83 Context-free grammars provide the basis for most of the computational parsing mechanisms developed to date, but as they have been described so far, they would be very inconvenient for capturing natural languages. This chapter describes an extension to the basic context-free mechanism that defines constituents by a set of features. This extension allows aspects of natural language such as agreement and subcategorization to be handled in an intuitive and concise way. Section 4.1 introduces the notion of feature systems and the generalization of context-free grammars to allow features. Section 4.2 then describes some useful feature systems for English that are typical of those in use in various grammars. Section 4.3 explores some issues in defining the lexicon and shows how using features makes the task considerably simpler. Section 4.4 describes a sample context-free grammar using features and introduces some conventions that simplify the process. Section 4.5 describes how to extend a chart parser to handle a grammar with features. The remaining sections, which are optional, describe how features are used in other grammatical formalisms and explore some more advanced material. Section 4.6 introduces augmented transition networks, which are a generalization of recursive transition networks with features, and Section 4.7 describes definite clause grammars based on PROLOG. Section 4.8 describes generalized feature systems and unification grammars. 4.1 Feature Systems and Augmented Grammars In natural languages there are often agreement restrictions between words and phrases. For example, the NP a men is not correct English because the article a indicates a single object while the noun men indicates a plural object; the noun phrase does not satisfy the number agreement restriction of English. There are many other forms of agreement, including subject-verb agreement, gender agreement for pronouns, restrictions between the head of a phrase and the form of its complement, and so on. To handle such phenomena conveniently, the grammatical formalism is extended to allow constituents to have features. For example, we might define a feature NUMBER that may take a value of either s (for singular) orp (for plural), and we then might write an augmented CFG rule such as NP -> ART N only when NUMBER! agrees with NUMBER 2 This rule says that a legal noun phrase consists of an article followed by a noun, but only when the number feature of the first word agrees with the number feature of the second. This one rule is equivalent to two CFG rules that would use different terminal symbols for encoding singular and plural forms of all noun phrases, such as NP-SING -> ART-SING N-SING NP-PLURAL -> ART-PLURAL N-PLURAL While the two approaches seem similar in ease-of-use in this one example, consider that all rules in the grammar that use an NP on the right-hand side would 84 CHAPTER 4 Features and Augmented Grammars 85 now need to be duplicated to include a rule for NP-SING and a rule for NP-PLURAL, effectively doubling the size of the grammar. And handling additional features, such as person agreement, would double the size of the grammar again "; and again. Using features, the size of the augmented grammar remains the same as the original one yet accounts for agreement constraints. j To accomplish this, a constituent is defined as a feature structure—a j mapping from features to values that defines the relevant properties of the -constituent. In the examples in this book, feature names in formulas will be written in boldface. For example, a feature structure for a constituent ART1 that represents a particular use of the word a might be written as follows: j ART1: (CAT ART i ROOTa j NUMBER s) i This says it is a constituent in the category ART that has as its root the word a i and is singular. Usually an abbreviation is used that gives the CAT value more prominence and provides an intuitive tie back to simple context-free grammars. - In this abbreviated form, constituent ART1 would be written as m I ART1: (ART ROOT a NUMBER s) J Feature structures can be used to represent larger constituents as well. To do this, 1 feature structures themselves can occur as values. Special features based on the integers—1, 2, 3, and so on—will stand for the first subconstituent, second subconstituent, and so on, as needed. With this, the representation of the NP 1 constituent for the phrase a fish could be '■ NP1: (NP NUMBER s j 1 (ART ROOTa | NUMBER s) j 2(N ROOT fish 1 NUMBER s)) i Note that this can also be viewed as a representation of a parse tree shown in i Figure 4.1, where the subconstituent features 1 and 2 correspond to the sub- j constituent links in the tree. j The rules in an augmented grammar are stated in terms of feature structures i rather than simple categories. Variables are allowed as feature values so that a ,| rule can apply to a wide range of situations. For example, a rule for simple noun m phrases would be as follows: IB (NP NUMBER ?n) (ART NUMBER ?n) (N NUMBER ?n) M This says that an NP constituent can consist of two subconstituents, the Wt first being an ART and the second being an N, in which the NUMBER feature in J3 all three constituents is identical. According to this rule, constituent NPI given Jm previously is a legal constituent. On the other hand, the constituent --sM NUMBER s s Figure 4.1 Viewing a feature structure as an extended parse tree *(NP 1 (ART NUMBER s) 2 (N NUMBER s)) is not allowed by this rule because there is no NUMBER feature in the NP, and the constituent *(NP NUMBERS 1 (ART NUMBER s) 2 (N NUMBER.p)) is not allowed because the NUMBER feature of the N constituent is not identical to the other two NUMBER features. Variables are also useful in specifying ambiguity in a constituent. For instance, the word fish is ambiguous between a singular and a plural reading. Thus the word might have two entries in the lexicon that differ only by the value of the NUMBER feature. Alternatively, we could define a single entry that uses a \ ariable as the value of the NUMBER feature, that is, (N ROOT fish NUMBER ?n) This works because any value of the NUMBER feature is allowed for the word fish. In many cases, however, not just any value would work, but a range of values is possible. To handle these cases, we introduce constrained variables, which are variables that can only take a value out of a specified list. For example, the variable ?n{s p) would be a variable that can take the value s or the value p. Typically, when we write such variables, we will drop the variable name altogether and just list the possible values. Given this, the word fish might be represented by the constituent (N ROOT fish NUMBER ?n{s p}) or more simply as (N ROOT fish NUMBER {s p}) Features and Augmented Grammars 87 BOX 4.1 Formalizing Feature Structures There is an active area of research in the formal properties of feature structures. This work views a feature system as a formal logic. A feature structure is defined as a partial function from features to feature values. For example, the feature structure ART1: (CAT ART ROOT a NUMBERS) is treated as an abbreviation of the following statement in FOPC: ART1( CAT) = ART a ART] (ROOT) = a a ART1 (NUMBER) = s Feature structures with disjunctive values map to disjunctions. The structure THE1: (CAT ART ROOT the NUMBER {s p}) would be represented as THE1( CAT) = ART a THE1 (ROOT) = the a(THE1(NUMBER) = sv THEl(NUMBER) = p) Given this, agreement between feature values can be defined as equality equations. There is an interesting issue of whether an augmented context-free grammar can describe languages that cannot be described by a simple context-free grammar. The answer depends on the constraints on what can be a feature value. If the set of feature values is finite, then it would always be possible to create new constituent categories for every combination of features. Thus it is expressively equivalent to a context-free grammar. If the set of feature values is unconstrained, however, then such grammars have arbitrary computational power. In practice, even when the set of values is not explicitly restricted, this power is not used, and the standard parsing algorithms can be used on grammars that include features. 4.2 Some Basic Feature Systems for English This section describes some basic feature systems that are commonly used in grammars of English and develops the particular set of features used throughout this book. Specifically, it considers number and person agreement, verb form features, and features required to handle subcategorization constraints. You should read this to become familiar with the features in general and then refer back to it later when you need a detailed specification of a particular feature. Person and Number Features In the previous section, you saw the number system in English: Words may be classified as to whether they can describe a single object or multiple objects. While number agreement restrictions occur in several different places in English, they are most importantly found in subject-verb agreement. But subjects and verbs must also agree on another dimension, namely with respect to the person. The possible values of this dimension are First Person (1): The noun phrase refers to the speaker, or a group of people including the speaker (for example, I, we, you, and I). Second Person (2): The noun phrase refers to the listener, or a group including the listener but not including the speaker (for example, you, all of you). Third Person (3): The noun phrase refers to one or more objects, not including the speaker or hearer. Since number and person features always co-occur, it is convenient to combine the two into a single feature, AGR, that has six possible values: first person singular (Is), second person singular (2s), third person singular (3s), and first, second and third person plural (lp, 2p, and 3p, respectively). For example, an instance of the word is can agree only with a third person singular subject, so its AGR feature would be 3s. An instance of the word are, however, may agree with second person singular or any of the plural forms, so its AGR feature would be a variable ranging over the values {2s lp 2p 3p}. Verb-Form Features and Verb Subcategorization Another very important feature system in English involves the form of the verb. This feature is used in many situations, such as the analysis of auxiliaries and generally in the subcategorization restrictions of many head words. As described in Chapter 2, there are five basic forms of verbs. The feature system for verb forms will be slightly more complicated in order to conveniently capture certain phenomena. In particular, we will use the following feature values for the feature WORM: base—base form (for example, go, be, say, decide) pres—simple present tense (for example, go, goes, am, is, say, says, decide) past—simple past tense (for example, went, was, said, decided) fin—finite (that is, a tensed form, equivalent to {pres past}) ing—present participle (for example, going, being, saying, deciding) pastprt—past participle (for example, gone, been, said, decided) inf—a special feature value that is used for infinitive forms with the word to 88 CHAPTER 4 Features and Augmented Grammars 89 Value Example Verb Example _none laugh Jack laughed. _np find Jack found a key. _np_np give Jack gave Sue the paper. _vp:inf want Jack wants to fly. _np_vp:inf tell Jack told the man to go. _vp:ing keep Jack keeps hoping for the best. _np_vp:ing catch Jack caught Sam looking at his desk. _np_vp:base watch Jack watched Sam look at his desk. Figure 4.2 The SUBCAT values for NP/VP combinations To handle the interactions between words and their complements, an addi - j tional feature, SUBCAT, is used. Chapter 2 described some common verb sub- j categorization possibilities. Each one will correspond to a different value of the =j SUBCAT feature. Figure 4.2 shows some SUBCAT values for complements -| consisting of combinations of NPs and VPs. To help you remember the meaning J of the feature values, they are formed by listing the main category of each part of -j the complement. If the category is restricted by a feature value, then the feature J value follows the constituent separated by a colon. Thus the value _np_vp:inf I will be used to indicate a complement that consists of an NP followed by a VP i with VFORM value inf. Of course, this naming is just a convention to help the < reader; you could give these values any arbitrary name, since their significance is j determined solely by the grammar rules that involve the feature. For instance, the i rule for verbs with a SUBCAT value of _np_vp:inf would be ; (VP) (V SUBCAT _np_vp:inf) i (NP) (VP VFORM inf) This says that a VP can consist of a V with SUBCAT value _np_vp:inf, followed by an NP, followed by a VP with VFORM value inf. Clearly, this rule could be rewritten using any other unique symbol instead of _np_vp:inf, as long as the j lexicon is changed to use this new value. • Many verbs have complement structures that require a prepositional phrase with a particular preposition, or one that plays a particular role. For example, the ; verb give allows a complement consisting of an NP followed by a PP using the preposition to, as in Jack gave the money to the bank. Other verbs, such as put, 5 require a prepositional phrase that describes a location, using prepositions such as -A in, inside, on, and by. To express (his within the feature system, we introduce a feature PFORM on prepositional phrases. A prepositional phrase with a PFORM " value such as TO must have the preposition to as its head, and so on. A prcposi- 1 tional phrase with a PFORM value LOC must describe a location. Another useful -1 PFORM value is MOT. used with verbs such as walk, which may lake a 3 Value Example Prepositions Example TO to I gave it to the bank. LOC in, on, by, inside, on top of I put it on the desk. MOT to, from, along,... 1 walked to the store. Figure 43 Some values of the PFORM feature for prepositional phrases Value Example Verb Example _np_pp:to give Jack gave the key to the man. _pp:loc be Jack is at the store. _np_pp:loc put Jack put the box in the corner. _pp:mot go Jack went to the store. _np_pp:mot take Jack took the hat to the party. _adjp be, seem Jack is happy. _np_adjp keep Jack kept the dinner hot. _s:that believe Jack believed that the world was flat. _s:for hope Jack hoped for the man to win the prize. Figure 4.4 Additional SUBCAT values prepositional phrase that describes some aspect of a path, as in We walked to the store. Prepositions that can create such phrases include to, from, and along. The LOC and MOT values might seem hard to distinguish, as certain prepositions might describe either a location or a path, but they are distinct. For example, while Jack put the box {in on by} the corner is fine, *Jack put the box {to from along} the corner is ill-formed. Figure 4.3 summarizes the PFORM feature. This feature can be used to restrict the complement forms for various verbs. Using the naming convention discussed previously, the SUBCAT value of a verb such as put would be _np_pp:loc, and the appropriate rule in the grammar would be (VP) (V SUBCAT _np_pp:loc) (NP) (PP PFORM LOC) For embedded sentences, a complementizer is often needed and must be subcategorized for. Thus a COMP feature with possible values for, that, and no-comp will be useful. For example, the verb tell can subcategorize for an S that has the complementizer that. Thus one SUBCAT value of tell will be _s:that. Similarly, the verb wish subcategorizes for an S with the complementizer for, as in We wished for the rain to stop. Thus one value of the SUBCAT feature for wish is _s:for..Figure 4.4 lists some of these additional SUBCAT values and examples for a variety of verbs. In this section, all the examples with the 90 CHAPTER 4 Features and Augmented Grammars 91 SUBCAT feature have involved verbs, but nouns, prepositions, and adjectives may also use the SUBCAT feature and subcategorize for their complements in the same way. Binary Features Certain features are binary in that a constituent either has or doesn't have the feature. In our formalization a binary feature is simply a feature whose value is restricted to be either + or -. For example, the INV feature is a binary feature that indicates whether or not an S structure has an inverted subject (as in a yes/no question). The S structure for the sentence Jack laughed will have an INV value whereas the S structure for the sentence Did Jack laugh? will have the INV value +. Often, the value is used as a prefix, and we would say that a structure has the feature +INV or -INV. Other binary features will be introduced as necessary throughout the development of the grammars. The Default Value for Features It will be useful on many occasions to allow a default value for features. Anytime a constituent is constructed that could have a feature, but a value is not specified, the feature takes the default value of -. This is especially useful for binary features but is used for nonbinary features as well; this usually ensures that any later agreement check on the feature will fail. The default value is inserted when the constituent is first constructed. 4.3 Morphological Analysis and the Lexicon Before you can specify a grammar, you must define the lexicon. This section explores some issues in lexicon design and the need for a morphological analysis component. The lexicon must contain information about all the different words that can be used, including all the relevant feature value restrictions. When a word is ambiguous, it may be described by multiple entries in the lexicon, one for each different use. Because words tend to follow regular morphological patterns, however, many forms of words need not be explicitly included in the lexicon. Most English verbs, for example, use the same set of suffixes to indicate different forms: -s is added for third person singular present tense, -ed for past tense, -ing for the present participle, and so on. Without any morphological analysis, the lexicon would have to contain every one of these forms. For the verb want this would require six entries, for want (both in base and present form), wants, wanting, and wanted (both in past and past participle form). In contrast, by using the methods described in Section 3.7 to strip suffixes there needs to be only one entry for want. The idea is to store the base form of the verb in the lexicon and use context-free rules to combine verbs with suffixes to derive the other entries. Consider the following rule for present tense verbs: (V ROOT ?r SUBCAT ?s WORM pres AGR 3s) -> (V ROOT ?r SUBCAT ?s VFORM base) (+S) where +S is a new lexical category that contains only the suffix morpheme -s. This rule, coupled with the lexicon entry want:(V ROOT want SUBCAT {_np _vp:inf _np_vp:inf] VFORM base) would produce the following constituent given the input string want -s want:(V ROOT want SUBCAT {_np _vp:inf _np_vp:inf} VFORM pres AGR 3s) Another rule would generate the constituents for the present tense form not in third person singular, which for most verbs is identical to the root form: (V ROOT ?r SUBCAT ?s VFORM pres AGR (Is 2s lp 2p 3p}) -4 (V ROOT ?r SUBCAT ?s VFORM base) But this rule needs to be modified in order to avoid generating erroneous interpretations. Currently, it can transform any base form verb into a present tense form, which is clearly wrong for some irregular verbs. For instance, the base form he cannot be used as a present form (for example, *We be at the store). To cover these cases, a feature is introduced to identify irregular forms. Specifically, verbs with the binary feature +IRREG-PRES have irregular present tense forms. Now the rule above can be stated correctly: (V ROOT ?r SUBCAT ?s VFORM pres AGR {Is 2s lp 2p 3p}) -> (V ROOT ?r SUBCAT ?s VFORM base IRREG-PRES -) Because of the default mechanism, the IRREG-PRES feature need only be specified on the irregular verbs. The regular verbs default to -, as desired. Similar binary features would be needed to flag irregular past forms (IRREG-PAST, such as saw), and to distinguish -en past participles from -ed past participles (EN-PASTPRT). These features restrict the application of the standard lexical rules, and the irregular forms are added explicitly to the lexicon. Grammar 4.5 gives a set of rules for deriving different verb and noun forms using these features. Given a large set of features, the task of writing lexical entries appears very difficult. Most frameworks allow some mechanisms that help alleviate these problems. The first technique—allowing default values for features—has already been mentioned. With this capability, if an entry takes a default value for a given feature, then it need not be explicitly stated. Another commonly used technique is 92 CHAPTER 4 Present Tense 1. (V ROOT ?r SUBCAT ?s VFORM pres AGR 3s) -> (V ROOT ?r SUBCAT ?s VFORM base IRREG-PRES -) +S 2. (V ROOT ?r SUBCAT ?s VFORM pres AGR {Is 2s lp 2p 3p}) -» (V ROOT ?r SUBCAT ?s VFORM base IRREG-PRES -) Past Tense 3. (V ROOT ?r SUBCAT ?s VFORM past AGR {Is 2s 3s lp 2p 3p}) -» (V ROOT ?r SUBCAT ?s VFORM base IRREG-PAST -) +ED Past Participle 4. (V ROOT ?r SUBCAT ?s VFORM pastprt) -> (V ROOT ?r SUBCAT ?s VFORM base EN-PASTPRT -) +ED 5. (V ROOT ?r SUBCAT ?s VFORM pastprt) -> (V ROOT ?r SUBCAT ?s VFORM base EN-PASTPRT +) +EN Present Participle 6. (V ROOT ?r SUBCAT ?s VFORM tag) -> (V ROOT ?r SUBCAT ?s VFORM base) +ING Plural Nouns 7. (N ROOT ?r AGR 3p) -» (N ROOT ?r AGR 3s IRREG-PL -) +S Grammar 4.5 Some lexical rules for common suffixes on verbs and nouns to allow the lexicon writing to define clusters of features, and then indicate a cluster with a single symbol rather than listing them all. Later, additional techniques will be discussed that allow the inheritance of features in a feature hierarchy. Figure 4.6 contains a small lexicon. It contains many of the words to be used in the examples that follow. It contains three entries for the word saw—as a noun, as a regular verb, and as the irregular past tense form of the verb see—as illustrated in the sentences The saw was broken. Jack wanted me to saw the board in half. I saw Jack eat the pizza. With an algorithm for stripping the suffixes and regularizing the spelling, as described in Section 3.7, the derived entries can be generated using any of the basic parsing algorithms on Grammar 4.5. With the lexicon in Figure 4.6 and Grammar 4.5, correct constituents for the following words can be derived: been, being, cries, cried, crying, dogs, saws (two interpretations), sawed, sawing, seen, seeing, seeds, wants, wanting, and wanted. For example, the word cries would be transformed into the sequence cry +s, and then rule 1 would produce the present tense entry from the base form in the lexicon. Features and Augmented Grammars 93 a: (CAT ART saw: (CAT N - ROOTA1 ROOT SAW1 AGR 3s) AGR 3s) be: (CATV saw: (CATV ROOTBE1 , ROOT SAW2 VFORM base VFORM base IRREG-PRES + SUBCAT _np) IRREG-PAST + saw: (CAT V SUBCAT {_adjp_np}) ROOT SEE1 cry: (CATV VFORM past ROOT CRY1 SUBCAT _np) VFORM base see: (CATV SUBCAT .none) ROOT SEE1 dog: (CAT N VFORM base ROOT DOG1 SUBCAT _np AGR 3s) IRREG-PAST + fish: (CAT N EN-PASTPRT +) ROOT FISH1 seed: (CAT N AGR {3s 3p} ROOT SEED1 IRREG-PL +) AGR 3s) happy: (CAT ADJ the: (CAT ART SUBCAT _vp:inf) ROOT THE1 he: (CAT PRO AGR {3s 3p}) ROOT HE1 to: (CAT TO) AGR 3s) want: (CATV is: (CATV ROOT WANT1 ROOT BE1 VFORM base VFORM pres SUBCAT {_np_vp:inf_np_vp:inf)) SUBCAT {_adjp jap] was: (CAT V AGR 3s) ROOT BE1 Jack: (CAT NAME VFORM past AGR 3s) AGR {Is 3s} man: (CAT Nl SUBCAT {_adjp _np)) ROOT MAN1 were: (CATV AGR 3s) ROOT BE men: (CAT N VFORM past ROOT MAN1 AGR {2s lp 2p 3p) AGR 3p) SUBCAT {_adjp_np)) Figure 4.6 A lexicon Often a word will have multiple interpretations that use different entries and different lexical rules. The word saws, for instance, transformed into the sequence saw +s, can be a plural noun (via rule 7 and the first entry for saw), or the third person present form of the verb saw (via rule 1 and the second entry for saw). Note that rule 1 cannot apply to the third entry, as its VFORM is not base. Features and Augmented Grammars 95 The success of this approach depends on being able to prohibit erroneous derivations, such as analyzing seed as the past tense of the verb see. This analysis will never be considered if the FST that strips suffixes is correctly designed. Specifically, the word see will not allow a transition to the states that allow the -ed suffix. But even if this were produced for some reason, the IRREG-PAST value + in the entry for see would prohibit rule 3 from applying. 4.4 A Simple Grammar Using Features This section presents a simple grammar using the feature systems and lexicon developed in the earlier sections. It will handle sentences such as the following: The man cries. The men cry. The man saw the dogs. He wants the dog. He wants to be happy. He wants the man to see the dog. He is happy to be a dog. It does not find the following acceptable: *The men cries. *The man cry. *The man saw to be happy. *He wants. *He wants the man saw the dog. Before developing the grammar, some additional conventions are introduced that will be very useful throughout the book. It is very cumbersome to write grammatical rules that include all the necessary features. But there are certain regularities in the use of features that can be exploited to simplify the process of writing rules. For instance, many feature values are unique to a feature (for example, the value inf can only appear in the VFORM feature, and _np_vp:inf can only appear in the SUBCAT feature). Because of this, we can omit the feature name without introducing any ambiguity. Unique feature values will be listed using square parentheses. Thus (VP SUBCAT inf) will be abbreviated as VP[mf]. Since binary features do not have unique values, a special convention is introduced for them. For a binary feature B, the constituent C[+B1 indicates the constituent (C B +). Many features arc constrained so that the value on the mother must be identical to the value on its head subconstituent. These are called head features. For instance, in all VP rules the VFORM and AGR values are the same in the VP and the head verb, as in the rule 1 BOX 4.2 Systemic Grammar An important influence on the development of computational feature-based systems was systemic grammar (Halliday, 1985). This theory emphasizes the functional role of linguistic constructs as they affect communication. The grammar is organized as a set of choices about discourse function that determine the structure of the sentence. The choices are organized into hierarchical structures called systems. For example, the mood system would capture all the choices that affect the mood of the sentence. Part of this structure looks as follows: major indicative - imperative declarative interrogative. bound relative yes/no wh This structure indicates that once certain choices are made, others become relevant. For instance, if you decide that a sentence is in the declarative mood, then the choice between bound and relative becomes relevant. The choice between yes/no and wh, on the other hand, is not relevant to a declarative sentence. Systemic grammar was used in Winograd (1973), and Winograd (1983) contains a good discussion of the formalism. In recent years it has mainly been used in natural language generation systems because it provides a good formalism for organizing the choices that need to be made while planning a sentence (for example, see Mann and Mathiesson (1985) and Patten (1988)). (VP VFORM ?v AGR ?a) -4 (V VFORM ?v AGR ?a SUBCAT _np_vp:inf) (NP) (VP VFORM inf) If the head features can be declared separately from the rules, the system can automatically add these features to the rules as needed. With VFORM and AGR declared as head features, the previous VP rule can be abbreviated as VP -4 (V SUBCAT _np_vp:inf) NP (VP VFORM inf) The head constituent in a rule will be indicated in italics. Combining all the abbreviation conventions, the rule could be further simplified to VP V\ _np_vp:inf] NP VPfinf] A simple grammar using these conventions is shown as Grammar 4.7. Except for rules 1 and 2, which must enforce number agreement, all the rest of the feature constraints can be captured using the conventions that have been 96 CHAPTER 4 Features and Augmented Grammars 97 1. S[-inv] -> (NP AGR ?a) (VP[[pres past}]AGR ?a) 2. NP -> (ART AGR ?a) (NAGR ?a) 3. NP -> P/?0 4. VP->V[_«o«e] 5. VPV[_»/>]NP 6. VP —> V[_vp:inf] VPfinf] 7. VP —> VI_np_vp.-OT/] NP VP[infJ 8. VP^VLodjplADJP 9. VP[inf] -^TOVP[base] 10. ADJP —> ADJ 11. ADJP H> AD7[_vp.'m/l VP[inf] Head features for S, VP: VFORM, AGR Head features for NP: AGR Grammar 4.7 A simple grammar in abbreviated form 1. (S INV - VFORM ?v{pres past} AGR ?a) -> (NP AGR ?a) (VP VFORM ?v {prespast} AGR ?a) 2 (NP AGR ?a) -> (ART AGR ?a) (NAGR ?a) 3. (NP AGR ?a) h> (PRO AGR ?a) 4. (VP AGR ?a VFORM ?v) (V SUBCAT _none AGR ?a VFORM ?v) 5. (VP AGR ?a VFORM ?v) -H> (V SUBCAT _np AGR ?a VFORM ?v) NP 6. (VP AGR ?a VFORM ?v) -4 (VSUBCAT_vp:inf AGR ?a VFORM ?v) (VP VFORM inf) 7. (VP AGR ?a VFORM ?v) -h> (VSUBCAT_np_vp:inf AGR ?a VFORM ?v)NP(VP VFORM inf) 8. (VP AGR ?a VFORM ?v) (V SUBCAT _adjp AGR ?a VFORM ?v) ADJP 9. (VP SUBCAT inf AGR ?a VFORM inf) -> (TO AGÄ ?a VFORM inf) (VP VFORM base) 10. ADJP ADJ 11. ADJP -h> ADJ (SUBCAT Jnf) (VP VFORM inf) Grammar 4.8 The expanded grammar showing all features introduced. The head features for each category are declared at the bottom of the figure. This grammar is an abbreviation of Grammar 4.8. Consider how rule 1 in Grammar 4.7 abbreviates rule 1 in Grammar 4.8. The abbreviated mle is S [3s] / \ ART [3s] N L3s] V [pres, 3s, _none] the S [3s / \ NP [3s] VP [pres, 3s] PRO [3s] V [pres, 3s, _vp:inf] ^ VP [inf] He TO VP [base / \ V [base, _adjpl ADJP happy Figure 4.9 Two sample parse trees with feature values S[-inv] -> (NP AGR ?a) (VP[{prespast}] AGR ?a) The unique values can be expanded in the obvious way: the value [-inv] becomes (INV -) and the value [{pres past)] becomes (VFORM ?v{pres past)). The head features for S arc AGR and VFORM. so these features must be added to the S and VP head. The resulting rule is (S INV - VFORM ?v{pres past) AGR ?a) -> (NP AGR ?a) (VP VFORM ?v{pres past) AGR ?a) as shown in Grammar 4.8. The abbreviated form is also very useful for summarizing parse trees. For instance, Figure 4.9 shows the parse trees for two of the previous sample sentences, demonstrating that each is an acceptable sentence. 98 CHAPTER 4 Features and Augmented Grammars 99 Consider why each of the ill-formed sentences introduced at the beginning of this section are not accepted by Grammar 4.7. Both *The men cries and *Tiie man cry are not acceptable because the number agreement restriction on rule 1 is not satisfied: The NP constituent for the men has the AGR value 3p, while the VP cries has the AGR value 3s. Thus rule 1 cannot apply. Similarly, the man cry is not accepted by the grammar since the man has AGR 3s and the VP cry has as its AGR value a variable ranging over {Is 2s lp 2p 3p}. The phrase *the man saw to be happy is not accepted because the verb saw has a SUBCAT value _np. Thus only rule 5 could be used to build a VP. But rule 5 requires an NP complement, and it is not possible for the words to be happy to be a legal NP. The phrase He wants is not accepted since the verb wants has a SUBCAT value ranging over {_np _vp:inf _np_vp:inf}, and thus only rules 5, 6, and 7 could apply to build a VP. But all these rules require a nonempty complement of some kind. The phrase *He wants the man saw the dog is not accepted for similar reasons, but this requires a little more analysis. Again, rules 5, 6, and 7 are possible with the verb wants. Rules 5 and 6 will not work, but rule 7 looks close, as it requires an NP and a VP[inf]. The phrase the man gives us the required NP, but saw the dog fails to be a VP[inf]. In particular, saw the man is a legal VP, but its VFORM feature will be past, not inf. 4.5 Parsing with Features The parsing algorithms developed in Chapter 3 for context-free grammars can be extended to handle augmented context-free grammars. This involves generalizing the algorithm for matching rules to constituents. For instance, the chart-parsing algorithms developed in Chapter 3 all used an operation for extending active arcs with a new constituent. A constituent X could extend an arc of the form C^Cl ...CioX... Cn to produce a new arc of the form C^ CI ...CiXo ...Cn A similar operation can be used for grammars with features, but the parser may have to instantiate variables in the original arc before it can be extended by X. The key to defining this matching operation precisely is to remember the definition of grammar rules with features. A rule such as 1. (NP AGR ?a) -> o (ART AGR ?a) (N AGR ?a) says that an NP can be constructed out of an ART and an N if all three agree on the AGR feature. It does not place any restrictions on any other features that the NP, ART, or N may have. Thus, when matching constituents against this rule, the only thing that matters is the AGR feature. All other features in the constituent can be ignored. For instance, consider extending arc 1 with the constituent 2. (ART ROOT A AGR 3s) To make arc 1 applicable, the variable ?a must be instantiated to 3s, producing 3. (NP AGR 3s) h> o (ART AGR 3s) (N AGR 3s) This arc can now be extended because every feature in the rule is in constituent 2: 4. (NP AGR 3s) o (ART AGR 3s) o (N AGR 3s) Now, consider extending this arc with the constituent for the word dog: 5. (N ROOT DOG1 AGR 3s) This can be done because the AGR features agree. This completes the arc 6. (NP AGR 3s) ^ (ART AGR 3s) (N AGR 3s) ° This means the parser has found a constituent of the form (NP AGR 3s). This algorithm can be specified more precisely as follows: Given an arc A, where the constituent following the dot is called NEXT, and a new constituent X, which is being used to extend the arc, a Find an instantiation of the variables such that all the features specified in NEXT are found in X. b. Create a new arc A', which is a copy of A except for the instantiations of the variables determined in step (a). c. Update A' as usual in a chart parser. For instance, let A be arc 1, and X be the ART constituent 2. Then NEXT will be (ART AGR ?a). In step a, NEXT is matched against X, and you find that ?a must be instantiated to 3s. In step b, a new copy of A is made, which is shown as arc 3. In step c, the arc is updated to produce the new arc shown as arc 4. When constrained variables, such as ?a{3s 3p}, are involved, the matching proceeds in the same manner, but the variable binding must be one of the listed values. If a variable is used in a constituent, then one of its possible values must match the requirement in the rule. If both the rule and the constituent contain variables, the result is a variable ranging over the intersection of their allowed values. For instance, consider extending arc 1 with the constituent (ART ROOT the AGR ?v{3s 3p}), that is, the word the. To apply, the variable ?a would have to be instantiated to ?v{3s 3p}, producing the rule (NP AGR ?v{3s 3p}) -> (ART AGR ?v{3s 3p}) o (N AGR ?v{3s 3p}) This arc could be extended by (N ROOT dog AGR 3s), because ?v(3s 3p} could be instantiated by the value 3s. The resulting arc would be identical to arc 6. The entry in the chart for the is not changed by this operation. It still has the value ?v{3s 3p}. The AGR feature is restricted to 3s only in the arc. Another extension is useful for recording the structure of the parse. Subconstituent features (1, 2, and so on, depending on which subconstituent is being added) are automatically inserted by the parser each time an arc is extended. The values of these features name subconstituents already in the chart. 100 CHAPTER 4 Features and Augmented Grammars 101 SI CATS AGR 3s VFORM pres IN\ 1 NPl....... ...... 2 VP3 VP3 CAT VP VFORM pres AGR 3s 1 VI . 2 VP2 VP2 CAT VP VFORM inf 1 TOl 2 VP1 NPl CAT NP AGR 3s 1 PROl VP1 CAT VP VFORM base 1 V2 PROl CAT PRO AGR 3s VI CATV ROOT want VFORM pres AGR 3s SUBCAT {_np,_vp:inf, _np_vp:inf) TOl CAT TO V2 CATV ROOT cry VFORM base SUBCAT _none i He wants to cry " i Figure 4.10 The chart for He wants to cry. With this treatment, and assuming that the chart already contains two constituents, ART1 and Nl, for the words the and dog, the constituent added to the chart for the phrase the dog would be (NP AGR 3s 1 ART1 2 Nl) where ART1 = (ART ROOT the AGR {3s 3p}) and Nl = (N ROOT dog AGR {3s}). Note that the AGR feature of ART1 was not changed. Thus it could be used with other interpretations that require the value 3p if they are possible. Any of the chart-parsing algorithms described in Chapter 3 can now be used with an augmented grammar by using these extensions to extend arcs and build constituents. Consider an example. Figure 4.10 contains the final chart produced from parsing the sentence He wants to cry using Grammar 4.8. The rest of this section considers how some of the nonterminal symbols were constructed for the chart. Constituent NPl was constructed by rule 3, repeated here for convenience: 3. (NP AGR ?a) -> (PRO AGR ?a) To match the constituent PROl, the variable ?a must be instantiated to 3s. Thus the new constituent built is NPl: (CATNP AGR 3s lPROl) Next consider constructing constituent VP1 using rule 4, namely 4. (VP AGR ?a VFORM ?v) (V SUBCAT _none AGR ?a VFORM ?v) For the right-hand side to match constituent V2, the variable ?v must be instantiated to base. The AGR feature of V2 is not defined, so it defaults to -. The new constituent is VP1: (CAT VP AGR-VFORM base 1 V2) Generally, default values are not shown in the chart. In a similar way, constituent VP2 is built from TOl and VP1 using rule 9, VP3 is built from VI and VP2 using rule 6, and SI is built from NPl and VP3 using rule 1. o 4.6 Augmented Transition Networks Features can also be added to a recursive transition network to produce an augmented transition network (ATN). Features in an ATN are traditionally called registers. Constituent structures are created by allowing each network to have a set of registers. Each time a new network is pushed, a new set of registers is created. As the network is traversed, these registers are set to values by actions associated with each arc. When the network is popped, the registers are assembled to form a constituent structure, with the CAT slot being the network name. Grammar 4.11 is a simple NP network. The actions are listed in the table below the network. ATNs use a special mechanism to extract the result of following an arc. When a lexical arc, such as arc 1, is followed, the constituent built from the word in the input is put into a special variable named *. The action DET := 1 then assigns this constituent to the DET register. The second action on this arc, AGR := AGR* assigns the AGR register of the network to the value of the AGR register of the new word (the constituent in *). Agreement checks are specified in the tests. A test is an expression that succeeds if it returns a nonempty value and fails if it returns the empty set or nil. 102 CHAPTER 4 Features and Augmented Grammars 103 art n pop^^ f NP J 1 ( NPH TnP2] name Arc 1 Test none Actions DET :=-*-- ......-AGR := AGR, 2 AGR nAGR* HEAD := * AGR := AGR n AGR* 3 none . NAME := * AGR := AGR, Grammar 4.11 A simple NP network NP v NP pop c S2 ) 6 f S3 J Arc Test Actions 4 none SUBJ := * 5 AGRSUBJn AGR, MAIN-V := * AGR := AGR suBj n AGR* 6 none OBJ := * j Grammar 4.12 A simple S network If a test fails, its arc is not traversed. The test on arc 2 indicates that the arc can be followed only if the AGR feature of the network has a non-null intersection with the AGR register of the new word (the noun constituent in *). Features on push arcs are treated similarly. The constituent built by traversing the NP network is returned as the value *. Thus in Grammar 4.12, the action on the arc from S to SI, SUBJ := * would assign the constituent returned by the NP network to the register SUBJ. The test on arc 2 will succeed only if the AGR register of the constituent in the SUBJ register has a non-null intersection with the AGR register of the new constituent (the verb). This test enforces subject-verb agreement. Trace of S Network Step Node Position Arc Followed Registers Set 1. S 1 arc 4 succeeds SUBJ <- (NP DET the (for recursive call HEAD dog see trace below) AGR 3s) 5. SI 3 arc 5 (checks if MAIN-V <— saw 3p n 3p) AGR 4- 3p 6. S2 4 arc 6 (for recursive OBJ <- (NP NAME Jack call trace, see below) AGR 3s) y. S3 5 pop arc succeeds returns (S SUBJ(NP DET the HEAD dog AGR 3s) MAIN-V saw AGR 3p OBJ(NP NAME Jack AGR 3s)) Trace of First NP Call: Arc 4 Step Node Position Arc Followed Registers Set 1 NP 1 1 DET _none SUBCAT MAlN-v '"i-.np-.np jump jump Actions SUBJ := * MOOD = DECL MAIN-V := * OBJ :=* SUBCAT := SUBCATmain-v ^ I SUBCAT :=_none IOBJ := OBJ OBJ :=* MODS .- Append(MODS,*) _np _np_np) Grammar 4.14 An S network for assertions word position, the arc that is followed from the node, and the register manipulations that are performed for the successful parse. It starts in the S network but moves immediately to the NP network from the call on arc 4. The NP network checks for number agreement as it accepts the word sequence The dog. It constructs a noun phrase with the AGR feature plural. When the pop arc is followed, it completes arc 4 in the S network. The NP is assigned to the SUBJ register and then checked for agreement with the verb when arc 3 is followed. The NP Jack is accepted in another call to the NP network. An ATN Grammar for Simple Declarative Sentences Here is a more comprehensive example of the use of an ATN to describe some declarative sentences. The allowed sentence structure is an initial NP followed by a main verb, which may then be followed by a maximum of two NPs and many PPs. depending on the verb. Using the feature system extensively, you can create a grammar that accepts any of the preceding complement forms, leaving the actual verb-complement agreement to the feature restrictions. Grammar 4.14 shows the S network. Arcs are numbered using the conventions discussed in Chapter 3. For instance, the arc S3/1 is the arc labeled 1 leaving node S3. The NP network in Grammar 4.15 allows simple names, bare plural nouns, pronouns, and a simple sequence of a determiner followed by an adjective and a head noun. Arc NP/1 NP/2 NP/3 NP/4 NP2/1 NP2/2 NP3/1 Test AGR, n3p AGR n AGR, Actions DET := * PRO := * HEAD := * AGR := AGR, NAME :=* ADJS :=Append(ADJS, *) HEAD := * AGR := AGR n AGR, MODS - Append(MODS, *) Grammar 4.15 The NP network Allowable noun complements include an optional number of prepositional phrases. The prepositional phrase network in Grammar 4.16 is straightforward. Examples of parsing sentences with this grammar are left for the exercises. Presetting Registers One further extension to the feature-manipulation facilities in ATNs involves the ability to preset registers in a network as that network is being called, much like parameter passing in a programming language. This facility, called the SENDR action in the original ATN systems, is useful to pass information to the network that aids in analyzing the new constituent. Consider the class of verbs, including want and pray, that accept complements using the infinitive forms of verbs, which are introduced by the word to. According to the classification in Section 4.2, this includes the following: 106 CHAPTER 4 prep NP V - pop ______ V ) 1 { PP2 ] Arc PP/1 PPl/1 Test Action P •=: * POBJ := * Grammar 4.16 The PP network _vp:inf _np_vp:inf Mary wants to have a party. Mary wants John to have a party. In the context-free grammar developed earlier, such complements were treated as VPs with the VFORM value inf. To capture this same analysis in an ATN, you would need to be able to call a network corresponding to VPs but preset the VFORM register in that network to inf. Another common analysis of these constructs is to view the complements as a special form of sentence with an understood subject. In the first case it is Mary who would be the understood subject (that is, the host), while in the other case it is John. To capture this analysis, many ATN grammars preset the SUBJ register in the new S network when it is called. o 4.7 Definite Clause Grammars You can augment a logic grammar by adding extra arguments to each predicate to encode features. As a very simple example, you could modify the PROLOG rules to enforce number agreement by adding an extra argument for the number on every predicate for which the number feature is relevant. Thus you would have rules such as those shown in Grammar 4.17. Consider parsing the noun phrase the dog cried, which would be captured by the assertions word(the, 1,2):-word(dog, 2, 3) :- With these axioms, when the word the is parsed by rule 2 in Grammar 4.17, the number feature for the word is returned. You can see this in the following trace of the simple proof of np(1, Number, 3) Using rule I, you have the following subgoals: Features and Augmented Grammars 107 1. np(P1, Number, P3) :-art(P1, Number, P2), n(2, Number, P3) art(l, Number, O) isart(a, 3s) :-isart(the, 3s) :-isart(the, 3p) :-n(l, Number, O) :-isnoun(dog, 3s) :-isnoun(dogs, 3p) -word(Word, I, O), isart(Word, Number) word(Word, I, O), isnounfWord, Number) Grammar 4.17 s.(P1, Number, s(Np, Vp), P3) :- np(P1, Number, Np, P2), vp(P2, Number, Vp, P3) np(P1, Number, np(Art, N), P3) :- art(P1, Number!, Art, P2), n(P2, Number2, N, P3) vp(P1, Number, vp(Verb), P2) :- v(P1, Verb, P2) art(l, Number, art(Word), O) :- word(Word, I, O), isart(Word, Number) n(l, Number, n(Word), O) :- word(Word, I, O), isnoun(Word, Number) v(l, Number, v(Word), O) :- word(Word, I, O), isverb(Word, Number) Grammar 4.18 art(1, Number, P2) n(P2, Number, 3) The first subgoal succeeds by using rule 2 and proving word(the, 1, 2) isart(the, 3s) which binds the variables in rule 2 as follows: Number <- 3s P2^2 Thus, the second subgoal now is n(2, 3s, 3) Using rule 6, this reduces to the subgoals word(Word, 2, 3) and isnoun(Word, 3s), which are established by the input and rule 7, respectively, with Word bound to dog. Thus the parse succeeds and the number agreement was enforced. The grammar can also be extended to record the structure of the parse by adding another argument to the rules. For example, to construct a parse tree, you could use the rules shown in Grammar 4.18. These rules would allow you to prove the following on the sentence The dog cried: 108 CHAPTER 4 Features and Augmented Grammars 109 Figure 4.19 A tree representation of the structure S(1, 3s, s(np(art(the), n(dog)), vp(v(cried))), 4) In other words, between positions 1 and 4 there is a sentence with number feature 3s and the structure s(np(art(the), n(dog)), vp(v(cried))) which is a representation of the parse tree shown in Figure 4.19. For specifying grammars, most logic-based grammar systems provide a more convenient format that is automatically converted into PROLOG clauses like those in Grammar 4.18. Since the word position arguments are on every predicate, they can be omitted and inserted by the system. Similarly, since all the predicates representing terminal symbols (for example, art, N, V) are defined systematically, the system can generate these rules automatically. These abbreviations are encoded in a format called definite clause grammars (DCGs). For example, you could write Grammar 4.18 by specifying just three DCG rules. (The symbol -> is traditionally used to signal DCG rules.) s(Number, s(Np, Vp)) -» np(N1, Np), vp(N2, Vp) np(Number, np(Art, N)) -> art(N1, Art), n(N2, N) vp(Number, vp(Verb)) -» v(Number, Verb) The result is a formalism similar to the augmented context-free grammars described in Section 4.3. This similarity can be made even closer if a feature/value representation is used to represent structure. For instance, let us replace all the argument positions with a single argument that is a feature structure. The DCG equivalent to the first five rules in Grammar 4.7 would be as shown in Grammar 4.20, where square parentheses indicate lists in PROLOG. Notice that rule 1 will not succeed unless the S, NP, and VP all agree on the agr feature, because they have the same variable as a value. Of course, this encoding will only work if every constituent specifies its feature values in the same order so that the feature lists unify. For example, if the 1. s([inv - agr Agr]) -» np([agr Agr]), vp([agr Agr vform pres]) 2. np([agr Agr2]) -> art([agr Agr2]), n([agr Agr2]) . 3. np([agr Agr3]) -> pro([agr Agr3]) 4. vp([agr Agr4 vform Vf3]) -» v([subcat _none agr Agr4 vform Vf3]) 5. vp([agr Agr4 vform Vf3]) -> v([subcat _np agr Agr4 vform Vf3]) np() Grammar 4.20 A DCG version of Grammar 4.7 S features consist of inv, agr, inv, and vform, every rule using an S must specify all these features in the same order. In a realistic grammar this would be awkward, because there will be many possible features for each constituent, and each one must be listed in every occurrence. One way to solve this problem is to extend the program that converts DCG rules to PROLOG rules so that it adds in any unspecified features with variable values. This way the user need only specify the features that are important. 4.8 Generalized Feature Systems and Unification Grammars You have seen that feature structures are very useful for generalizing the notion of context-free grammars and transition networks. In fact, feature structures can be generalized to the extent that they make the context-free grammar unnecessary. The entire grammar can be specified as a set of constraints between feature structures. Such systems are often called unification grammars. This section provides an introduction to the basic issues underlying such formalisms. The key concept of a unification grammar is the extension relation between two feature structures. A feature structure Fl extends, or is more specific than, a feature structure F2 if every feature value in Fl is specified in F2. For example, the feature structure (CATV ROOT cry) extends the feature structure (CAT V), as its CAT value is V as required, and the ROOT feature is unconstrained in the latter feature structure. On the other hand, neither of the feature structures (CATV ROOT cry) (CATV VFORM pres) extend the other, because both lack information required by the other. In particular, the first lacks the VFORM feature required by the second, and the second lacks the ROOT feature required by the first. 110 CHAPTER 4 Features and Augmented Grammars 111 Two feature structures unify if there is a feature structure that is an extension of both. The most general unifier is the rninimal feature structure that is an extension of both. The most general unifier of the above two feature structures is (CATV ROOT cry VFORM pres) Note that this is an extension of both original feature structures, and there is no smaller feature structure that is an extension of both. In contrast, the structures (CATV AGR 3s) (CATV AGR 3p) do not unify. There can be no FS that is an extension of both because they specify contradictory AGR feature values. This formalism can be extended to allow simple disjunctive values (for example, {3s 3p}) in the natural way. For example, (AGR 3s) extends (AGR {3s 3p}). The notion of unification is all that is needed to specify a grammar, as all feature agreement checks and manipulations can be specified in terms of unification relationships. A rule such as S -> NP VP in Grammar 4.7 could be expressed in a unification grammar as X0 XI X2 CATo = S CAT!= NP CAT2 =VP AGRq = AGR, = AGR2 VFORM o = VFORM2 This says that a constituent X0 can be constructed out of a sequence of constituents XI and X2 if the CAT of X0 is S, the CAT of XI is NP, the CAT of X2 is VP, the AGR values of all three constituents are identical, and the VFORM values of constituents X0 and X2 are identical. If the CAT value is always specified, such rules can be abbreviated as follows: S -» NP VP AGR= AGRi = AGR2 VFORM = VFORM 2 where the CAT values are used in the rule. Also, the 0 subscript is omitted. Using these abbreviations, a subpart of Grammar 4.7 is rewritten as the unification grammar in Grammar 4.21. Since such grammars retain the structure of context-free rules, the standard parsing algorithms can be used on unification grammars. The next section shows how to interpret the feature equations to build the constituents. t V. S^NPVP 2. NP —> ARTN 8'. VP -» V ADJP 10'. ADJP -» ADJ AGR = AGR, = AGR2 VFORM = VFORM 2 AGR = AGR, = AGR2 SUBCAT, =_adjp VFORM = VFORM i AGR = AGR, Grammar 4.21 A unification grammar Figure 4.22 Two noun phrase DAGs Formal Development: Feature Structures as DAGs The unification-based formalism can be defined precisely by representing feature structures as directed acyclic graphs (DAGs). Each constituent and value is represented as a node, and the features are represented as labeled arcs. Representations of the following two constituents are shown in Figure 4.22: Nl: (CATN ROOT fish AGR{3s3p}) N2: (CATN AGR 3s) The sources of a DAG are the nodes that have no incoming edges. Feature structure DAGs have a unique source node, called the root node. The DAG is said to be rooted by this node. The sinks of a DAG are nodes with no outgoing edges. The sinks of feature structures are labeled with an atomic feature or set of features (for example, {3s 3p}). The unification of two feature structures is defined in terms of a graph-matching algorithm. This takes two rooted graphs and returns a new graph that is 112 CHAPTER 4 Features and Augmented Grammars 113 To unify a DAG rooted at node n; with a DAG rooted at node Nj: 1. If N; equals Nj, then return Nj and succeed. 2 If both Nj andNj are sink nodes, then if their labels have a non-null intersection, return a new node with the intersection as its label. Otherwise, the DAGs do not unify. 3. If N; and Nj are not sinks, then create a new node N. For each arc labeled F leaving Nj to node NFj, 3a. If there is an arc labeled F leaving Nj to node NFj, then recursively unify NF; and NFj. Build an arc labeled F from N to the result of the recursive call. 3b. If there is no arc labeled F from Nj, build an arc labeled F from N to NF (. 3c. For each arc labeled F from Nj to node NFj where there is no F arc leaving Nj, create a new arc labeled F from N to NFj. Figure 4.23 The graph unification algorithm the unification of the two. The algorithm is shown in Figure 4.23. The result of applying this algorithm to nodes Nl and N2 in Figure 4.22 is the new constituent N3: (CATN ROOT fish AGR 3s) You should trace the algorithm on this example by hand to see how it works. This is a very simple case, as there is only one level of recursion. The initial call with the nodes Nl and N2 is handled in step 3, and each recursive call simply involves matching of sink nodes in step 2. With this algorithm in hand, the algorithm for constructing a new constituent using the graph unification equations can be described. Once this is developed, you can build a parser using any standard parsing algorithm. The algorithm to build a new constituent of category C using a rule with feature equations of form F; = V, where F, indicates the F feature of the i'th subconstituent, is shown as Figure 4.24. Consider an example. Assume the following two constituents shown in Figure 4.25 are defined already. In LISP notation, they are ART1: (CAT ART ROOT the AGR {3s 3p}) Nl: (CATN ROOT fish AGR{3s3p}) The new NP will be built using rule 2' in Grammar 4.21. The equations are CAT0 = NP CATi = ART CAT2=N AGR = AGRi : AGR2 The algorithm produces the constituent represented by the DAG in Figure 4.26. Given a rule XO -»XI ... Xn and set of feature equations of form Fj = V, where SCI,SCn are the subconstituents corresponding to XI,Xn, this algorithm builds a DAG that satisfies all the feature equations. Create a node CCO to be the root of the new feature structure. Make a copy of each DAG rooted by SC; (call the new root of each CC;), and add an arc labeled i from CCO to each CCj. For each feature equation of form F_ = V, where V is a value, follow the F link from node CCi to a node Ni, and unify Ni with V. For each feature equation (of form F_ = Gj), If there is an F link from CCj, and a G link from CCj, then i. follow the F link to node Nj and the G link to Nj; ii. unify N; and Nj, using the graph unification algorithm, to create new node X; hi. change all arcs pointing to either Nj or Nj to point to X; If there is no F link from CCj, but there is a G link from CCj to node Nj, create an F link from CC j to Nj; 4c. If there is no G link from CC:, but there is an F link from CCj to Nj, create a G 4a 4b. link from CCj to N Figure 4.24 An algorithm to build a new constituent using feature equations Figure 4,25 Lexical entries for the and fish Figure 4.26 The graph for the NP the fish 114 CHAPTER 4 Figure 4.27 The analysis of the VP is happy Continuing the example, assume that the VP is happy is analyzed similarly and is represented as in Figure 4.27. To simplify the graph, the CAT arcs are not shown. Figure 4.28 shows the analysis of The fish is happy constructed from rule 1 in Grammar 4.21. The value of the AGR slot is now the same node for SI, NP1, ART1, Nl, VP1, and VI. Thus the value of the AGR feature of CC1, for instance, changed when the AGR features of NP1 and VP1 were unified. So far, you have seen unification grammars used only to mimic the behavior of the augmented context-free grammars developed earlier. But the unification framework is considerably richer because there is no requirement that rules be based on syntactic categories. For example, there is a class of phrases in English called predicative phrases, which can occur in sentences of the form NPbe_ This includes prepositional phrases (He is in the house), noun phrases (He is a traitor), and adjective phrases (He is happy). Grammars often include a binary feature, say PRED, that is true of phrases that can be used in this way. In a standard CFG you would need to specify a different rule for each category, as in VP -» (V ROOT be) (NP PRED +) VP -» (V ROOT be) (PP PRED+) VP -> (V ROOT be) (ADJP PRED +) With the unification grammar framework, one rule handles all the categories, namely XO -> XI X2 CAT0 = VP CAT, = V ROOTi = be PRED2 = + in which any constituent X2 with the +PRED feature is allowed. Features and Augmented Grammars 115 : Figure 4.28 An analysis of The fish is happy. Of course, if the categories of constituents are not specified, it is not so clear how to adapt the standard CFG parsing algorithms to unification grammar. It can be shown that as long as there is a finite subset of features such that at least one of the features is specified in every constituent, the unification grammar can be converted into a context-free grammar. This set of features is sometimes called the context-free backbone of the grammar. Another powerful feature of unification grammars is that it allows much more information to be encoded in the lexicon. In fact, almost the entire grammar can be encoded in the lexicon, leaving very few rules in the grammar. Consider all the rules that were introduced earlier to deal with different verb subcate-gorizations. These could be condensed to a few rules: one for verbs that sub-categorize for one subconstituent, another for verbs that subcategorize for two, and so on. The actual category restrictions on the complement structure would be encoded in the lexicon entry for the verb. For example, the verb put might have a lexicon entry as follows: put: (CAT V SUBCAT (FIRST (CAT NP) SECOND (CAT PP LOC +))) 116 CHAPTER 4 Features and Augmented Grammars 117 The general rule for verbs that subcategorize for two constituents would be VP -> V X2 X3 2 = FIRSTsubcAT i 3=SECONDSUBCATi This rule would accept any sequence V X2 X3, in which X2 unifies with the FIRST feature of the SUBCAT feature of XI and X3 unifies with the SECOND feature of the SUBCAT feature of XI. If V is the verb put, this rule would require X2 to unify with (CAT NP) and X3 to unify with (CAT PP LOC +), as desired. Of course, this same rule would put completely different constraints on X2 and X3 for a verb like want, which would require X3 to unify with (CAT VP VFORM inf). Such techniques can dramatically reduce the size of the grammar. For instance, a reasonable grammar would require at least 40 rules to handle verb subcategorizations. But no rule involves more than three subconstituents in the verb complement. Thus these 40 rules could be reduced to four in the unification grammar, including a rule for null complements. In addition, these same rules could be reused to handle all nouns and adjective subcategorizations, if you generalize the category restrictions on XI. Summary This chapter has extended the grammatical formalisms introduced in Chapter 3 by adding features to each constituent and augmenting the grammatical rules. Features are essential to enable the construction of wide-coverage grammars of natural language. A useful set of features for English was defined, and morphological analysis techniques for mapping words to feature structures (that is, constituents) were developed. Several different forms of augmentation were examined. The first allowed context-free rules to specify feature agreement restrictions in addition to the basic category, producing augmented context-free grammars. The standard parsing algorithms described in Chapter 3 can be extended to handle such grammars. The second technique produced augmented transition networks, in which different procedural tests and actions could be defined on each arc to manipulate feature structures. The final method was based on the unification of feature structures and was used to develop unification grammars. Related Work and Further Readings Augmented CFGs have been used in computational models since the introduction of attribute grammars by Knuth (1968), who employed them for parsing programming languages. Since then many systems have utilized annotated rules of some form. Many early systems relied on arbitrary LISP code for annotations, although the types of operations commonly implemented were simple feature BOX 4.3 Lexical Functional Grammar A linguistic theory that has been influential in the development of computational formalisms is lexical functional grammar (Kaplan and Bresnan, 1982), usually abbreviated as LFG. LFG can be viewed as a type of unification grammar. A typical LFG rule is'as follows: NP (T SUBJ) = I VP t=4. The up arrow (T) indicates the constituent named on the left-hand side of the rule (the S constituent); the down arrow (I) indicates the constituent to which the annotation is attached. Thus this rule is equivalent to the following in the notation: S -» NPVP SUBJ= 1 0=2 Note the unification of the entire S and VP structure, making them the same constituent. Computationally, this can be viewed as an efficient way to transfer all the registers from the VP to the S, but it has linguistic implications as well, as discussed in Kaplan and Bresnan (1982). The effect is that any further modification to the S or VP structure will affect both, since they are now the same constituent. LFGs encode most of their information in the lexicon. In particular, lexical entries may indicate which slots they will fill in the constituent that contains them. For example, the entries for a, the, and bird might be as follows: the bird ART ART N (T SPEC) = INDEF (t AGR) = (3s) (T SPEC) = DEF (T AGR) = (3s) (T HEAD) = BIRD The up-arrow annotations actually fill in slots in the NP structure. Using the rule NP -» ART N on the phrase a bird would result in the AGR feature of the NP being set to 3s when the word a is parsed, and then this value is unified with 3s to check number agreement when the noun bird is parsed. With the article the, no number agreement is checked, since the word the does not set the AGR feature of its NP. tests and structure building along the lines of those discussed here. Examples of such systems are Sager (1981) and Robinson (1982). The particular features used in this chapter and the techniques for augmenting rules are loosely based on work in the generalized phrase structure grammar (GPSG) tradition (Gazdar, Klein, Pullum, and Sag, 1985). GPSG introduces a finite set of feature values that can be attached to any grammatical symbol. These play much the same role as the annotations developed in this book except that there are no feature names. Rather, since all values are unique, they define both 118 CHAPTER 4 Features and Augmented Grammars 119 the feature type and the value simultaneously. For example, a plural noun phrase in the third person with gender female would be of the grammatical type NP[PL,3,F] Since the number of feature values is finite, the grammar is formally equivalent to a context-free grammar with a symbol for every combination of categories and features. One of the important contributions of GPSG, however, is the rich structure it imposes on the propagation of features. Rather than using explicit feature equation rules, GPSG relies on general principles of feature propagation that apply to all rules in the grammar. A good example of such a general principle is the head feature convention, which states that all the head features on the parent constituent must be identical to its head constituent. Another general principle enforces agreement restrictions between constituents. Some of the conventions introduced in this chapter to reduce the number of feature equations that must be defined by hand for each rule are motivated by these theoretical claims. An excellent survey of GPSG and LFG is found in Sells (1985). The ATN framework described here is drawn from the work described in Woods (1970; 1973) and Kaplan (1973). A good survey of ATNs is Bates (1978). Logic programming approaches to natural language parsing originated in the early 1970s with work by Colmerauer (1978). This approach is perhaps best described in a paper by Pereira and Warren (1980) and in Pereira and Shieber (1987). Other interesting developments can be found in McCord (1980) and Pereira (1981). A good recent example is Alshawi (1992). The discussion of unification grammars is based loosely on work by Kay (1982) and the PATR-II system (Shieber, 1984; 1986). There is a considerable amount of active research in this area. An excellent survey of the area is found in Shieber (1986). There is also a growing body of work on formalizing different forms of feature structures. Good examples are Rounds (1988), Shieber (1992), and Johnson (1991). Exercises for Chapter 4___ 1. (easy) Using Grammar 4.7, draw the complete charts resulting from parsing the two sentences The man cries and He wants to be happy, whose final analyses are shown in Figure 4.9. You may use any parsing algorithm you want, but make sure that you state which one you are using and that the final chart contains every completed constituent built during the parse. 2. (medium) Define the minimal scl of lexicon entries for the following verbs so that, using the morphological analysis algorithm, all standard forms of the verb arc recognized and no illegal forms are inadvertently produced. Discuss any problems that arise and assumptions you make, and suggest modifications to the algorithms presented here if needed. Base Present Forms Past Past-Participle Present-Participle go sing bid go, goes sing, sings bid, bids went sang bid gone sung bidden going singing bidding 3. (medium) Extend the lexicon in Figure 4.6 and Grammar 4.7 so that the following two sentences are accepted: He was sad to see the dog cry. He saw the man saw the wood with the saw. Justify your new rules by showing that they correctly handle a range of similar cases. Either implement and test your extended grammar using the supplied parser, or draw out the full chart that would be constructed for each sentence by a top-down chart parser. 4. (medium) a Write a grammar with features that will successfully allow the following phrases as noun phrases: three o'clock ten minutes to six half past four but will not permit the following: half to eight ten forty-five after six quarter after eight seven thirty-five three twenty o'clock Specify the feature manipulations necessary so that once the parse is completed, two features, HOUR and MINUTES, are set in the NP constituent. If this requires an extension to the feature mechanism, carefully describe the extension you assume. b. Choose two other forms of grammatical phrases accepted by the grammar. Find an acceptable phrase not accepted by your grammar. If any nongrammatical phrases are allowed, give one example. 5. (medium) English pronouns distinguish case. Thus I can be used as a subject, and me can be used as an object. Similarly, there is a difference between he and him, we and us, and they and them. The distinction is not made for the pronoun you. Specify an augmented context-free grammar and lexicon for simple subject-verb-object sentences that allows only appropriate pronouns in the subject and object positions and does number agreement between the subject and verb. Thus it should accept / hit him, but not me love you. Your grammar should account for all the pronouns mentioned in this question, but it need have only one verb entry and need cover no other noun phrases but pronouns. 120 CHAPTER 4 6. (medium) Consider the following simple ATN: NP v np pop a Specify some words by category and give four structurally different sentences accepted by this network. b. Specify an augmentation for this network in the notation defined in this chapter so that sentences with the main verb give are allowed only if the subject is animate, and sentences with the main verb be may take either an animate or inanimate subject. Show a lexicon containing a few words that can be used to demonstrate the network's selectivity. (medium) Using the following unification grammar, draw the DAGs for the two NP structures as they are when they are first constructed by the parser, and then give the DAG for the complete sentence (which will include all subconstituents of S as well) The fish is a large one. You may assume the lexicon in Figure 4.6, but define lexical entries for any words not covered there. 1. • NPVP 2. NP -> ART N 3. NP -» ART ADJ N 4. VP->VNP INV=- VFORM2 =pres AGR= AGR] = AGR2 AGR = AGRi = AGR2 AGR = AGR] = AGR3 VFORM = VFORM 1 AGR = AGRi = AGR2 ROOT = BE1 Grammars for Natural Language 5.1 Auxiliary Verbs and Verb Phrases 52 Movement Phenomena in Language 53 Handling Questions in Context-Free Grammars 5.4 Relative Clauses 5.5 The Hold Mechanism in ATNs 5.6 Gap Threading Grammars for Natural Language 123 Augmented context-free grammars provide a powerful formalism for capturing many generalities in natural language. This chapter considers several aspects of the structure of English and examines how feature systems can be used to handle them. Section 5.1 discusses auxiliary verbs and introduces features that capture the ordering and agreement constraints. Section 5.2 then discusses the general class of problems often characterized as movement phenomena. The rest of the chapter examines various approaches to handling movement. Section 5.3 discusses handling yes/no questions and wh-questions, and Section 5.4 discusses relative clauses. The remaining sections discuss alternative approaches. Section 5.5 discusses the use of the hold mechanism in ATNs and Section 5.6 discusses gap threading in logic grammars. 5.1 Auxiliary Verbs and Verb Phrases English sentences typically contain a sequence of auxiliary verbs followed by a main verb, as in the following: I can see the house. I will have seen the house. I was watching the movie. I should have been watching the movie. These may at first appear to be arbitrary sequences of verbs, including have, be, do, can, will, and so on, but in fact there is a rich structure. Consider how the auxiliaries constrain the verb that follows them. In particular, the auxiliary have must be followed by a past participle form (either another auxiliary or the main verb), and the auxiliary be must be followed by a present participle form, or, in the case of passive sentences, by the past participle form. The auxiliary do usually occurs alone but can accept a base form following it (/ did eat my carrots!). Auxiliaries such as can and must must always be followed by a base form. In addition, the first auxiliary (or verb) in the sequence must agree with the subject in simple declarative sentences and be in a finite form (past or present). For example, */ going, *we be gone, and *they am are all unacceptable. This section explores how to capture the structure of auxiliary forms using a combination of new rules and feature restrictions. The principal idea is that auxiliary verbs have subcategorization features that restrict their verb phrase complements. To develop this, a clear distinction is made between auxiliary and main verbs. While some auxiliary verbs have many of the properties of regular verbs, it is important to distinguish them. For example, auxiliary verbs can be placed before an adverbial not in a sentence, whereas a main verb cannot: I am not going! You did not try it. - He could not have seen the car. 124 CHAPTER 5 Grammars for Natural Language 125 Auxiliary COMPFORM Construction Example modal base modal can see the house have pastprt perfect have seen the house be ing progressive is lifting the box be pastprt passive was seen by the crowd Figure 5.1 The COMPFORM restrictions for auxiliary verbs In addition, only auxiliary verbs can precede the subject NP in yes/no questions: Did you see the car? Can I try it? *Eat John the pizza? In contrast, main verbs may appear as the sole verb in a sentence, and if made into a yes/no question require the addition of the auxiliary do: I ate the pizza. Did I eat the pizza? The boy climbed in the window. Did the boy climb in the window? I have a pen. Do I have a pen? The primary auxiliaries are based on the root forms be and have. The other auxiliaries are called modal auxiliaries and generally appear only in the finite tense forms (simple present and past). These include the following verbs organized in pairs corresponding roughly to present and past verb forms do (did), can (could), may (might), shall (should), will (would), must, need, and dare. In addition, there are phrases that also serve a modal auxiliary function, such as ought to, used to, and be going to. Notice that have and be can be either an auxiliary or a main verb by these tests. Because they behave quite differently, these words have different lexical entries as auxiliaries and main verbs to allow for different properties; for example, the auxiliary have requires a past-participle verb phrase to follow it, whereas the verb have requires an NP complement. As mentioned earlier, the basic idea for handling auxiliaries is to treat them as verbs that take a VP as a complement. This VP may itself consist of another auxiliary verb and another VP, or be a VP headed by a main verb. Thus, extending Grammar 4.7 with the following rule covers much of the behavior: VP -> (AUX COMPFORM ?s) (VP VFORM ?s) The COMPFORM feature indicates the VFORM of the VP complement. The values of this feature for the auxiliaries are shown in Figure 5.1. There are other restrictions on the auxiliary sequence. In particular, auxiliaries can appear only in the following order: Modal + have+ be (progressive) + be (passive) The song might have been being played as they left. To capture the ordering constraints, it might seem that you need eight special rules: one for each of the four auxiliary positions plus four that make each optional. But it turns out that some restrictions can fall out from the feature restrictions. For instance, since modal auxiliaries do not have participle forms, a modal auxiliary can never follow have or be in the auxiliary sequence. For example, the sentence *He has might see the movie already. violates the subcategorization restriction on have. You might also think that the auxiliary have can never appear in its participle forms, as it must either be first in the sequence (and be finite) or follow a modal and be an infinitive. But this is only true for simple matrix clauses. If you consider auxiliary sequences appearing in VP complements for certain verbs, such as regret, the participle forms of have as an auxiliary can be required, as in I regret having been chosen to go. As a result, the formulation based on subcategorization features over-generates. While it accepts any legal sequence of auxiliaries, it would also accept *I must be having been singing. This problem can be solved by adding new features that further restrict the complements of the auxiliary be so that they do not allow additional auxiliaries (except be again for the passive). A binary head feature MAIN could be introduced that is + for any main verb, and - for auxiliary verbs. This way we can restrict the VP complement for be as follows: VP -> AUX[be] VP[ing, +main] The lexical entry for be would then have to be changed so that the original auxiliary rule does not apply. This could be done by setting its COMPFORM feature to -. The only remaining problem is how to allow the passive. There are several possible approaches. We could, for instance, treat the be in the passive construction as a main verb form rather than an auxiliary. Another way would be simply to add another rule allowing a complement in the passive form, using a new binary feature PASS, which is + only if the VP involves passive: VP -» A UX[be ] VP[ing, +pass] The passive rule would then be: VP[+pass] ->AUX[be] VP[pastprt, main] While these new rules capture the ordering constraints well, for the sake of keeping examples short, we will generally use the first rule presented for handling auxiliaries throughout. While it overgenerates, it will not cause any problems for the points that are illustrated by the examples. 126 CHAPTER 5 Grammars for Natural Language 127 can: (cat AUX could: (cat AUX modal + modal + worm pres vform {pres past} agr,{ls2s3slp2p3p} agr {Is 2s 3s lp 2p 3p) compform base) compform base) do: (cat AUX did: (cat AUX modal + modal + worm pres vform past agr{ls2slp2p3p} agr {Is 2s 3s Ip2p3p) compform base) compform base) be: (cat AUX have: (cat AUX worm base worm base root be root have compform ing) compform pastprt) Figure 5.2 Lexicon entries for some auxiliary verbs A lexicon for some of the auxiliary verbs is shown in Figure 5.2. All the irregular forms would need to be defined as well. o Passives The rule for passives in the auxiliary sequence discussed in the previous seclion solves only one of the problems related to the passive construct. Most verbs that include an NP in their complement allow the passive form. This form involves using the normal "object position" NP as the first NP in the sentence and either omitting the NP usually in the subject position or putting it in a PP with the preposition by. For example, the active voice sentences I will hide my hat in the drawer. I hid my hat in the drawer. I was hiding my hat in the drawer. can be rephrased as the following passive voice sentences: My hat will be hidden in the drawer. My hat was hidden in the drawer. My hat was being hidden in the drawer. The complication here is that the VP in the passive construction is missing the object NP. One way to solve this problem would be to add a new grammatical rule for every verb subcategorization that is usable only for passive forms, namely all rules that allow an NP to follow the verb. A program can easily be written that would automatically generate such passive rules given a grammar. A 1. S[-inv] -> (NP agr ?a) (VP \fin] AGR ?a) 1 VP —> (AUX COMPFORM ?v) (VP worm ?v) 3. VP -> AUX[ be ] VPfing, +main] 4. VP -> AUX[be] VP[ing, +pass] 5. VP[+pass] -> AUX [be] VPLpastprt, main, +passgap] 6. VP[-passgap, +mainl —> V[_none] 7. VP[-passgap, +main] -> V[_np] NP 8. VP[+passgap, +main] -> V[_np] 9. NP -» (ART agr ?a) (NAGR ?a) 10. NP -> NAME 11. NP->PKO Head features for S, VP: AGR and VFORM. Head features for NP: AGR Figure 5.3 A fragment handling auxiliaries including passives--------- new binary head feature, PASSGAP, is defined that is + only if the constituent is missing the object NP. As usual, this feature would default to - if it is not specified in the left-hand side of a rule. The rule for a simple _np subcategorization in Grammar 4.5 would then be realized as two rules in the new grammar: VP[-passgap] -» V[_np] NP VP[+passgapl -» V[_np] Since the PASSGAP feature defaults to the only way the PASSGAP feature can become + is by the use of passive rules. Similarly, VP rules with lexical heads but with no NP in their complement are -PASSGAP, because they cannot participate in the passive construction. Figure 5.3 shows a fragment of Grammar 4.5 extended to handle auxiliaries and passives. The rules are as developed above except for additional variables in each rule to pass features around appropriately. Figure 5.4 shows the analysis of the active voice sentence Jack can see the dog and the passive sentence Jack was seen. Each analysis uses rule 1 for the S and rule 7 for the VP. The only difference is that the active sentence must use auxiliary rule 2 and NP rule 9, while the passive sentence must use auxiliary rule 5 and NP rule 8. 5.2 Movement Phenomena in Language Many sentence structures appear to be simple variants of other sentence stmc -tures. In some cases, simple words or phrases appear to be locally reordered; ■ sentences are identical except that a phrase apparently is moved from its expected 128 CHAPTER 5 S[-inv] S[-inv] / \ / \ NP VP [pres, -pass] NP VP [past, +pass] 1 / \ | / \ Jack AUX[+modal] VP [base, -passgapl the dog AUX[be] VP [pastprt, +passgapl 1 / \ can V L_np» base] NP was V [_np, pastprt] see the dog seen Figure 5.4 An active and a passive form sentence position in a basic sentence. This section explores techniques for exploiting these generalities to cover questions in English. As a starting example, consider the structure of yes/no questions and how they relate to their assertional counterpart. In particular, consider the following examples: Jack is giving Sue a back rub. Is Jack giving Sue a back rub? He will run in the marathon next year. Will he run in the marathon next year? As you can readily see, yes/no questions appear identical in structure to their assertional counterparts except that the subject NPs and first auxiliaries have swapped positions. If there is no auxiliary in the assertional sentence, an auxiliary of root do, in the appropriate tense, is used: John went to the store. Did John go to the store? Henry goes to school every day. Does Henry go to school every day? Taking a term from linguistics, this rearranging of the subject and the auxiliary is called subject-aux inversion. Informally, you can think of deriving yes/no questions from assertions by moving the constituents in the manner just described. This is an example of local (or bounded) movement. The movement is considered local because the rearranging of the constituents is specified precisely within the scope of a limited number of rules. This contrasts with unbounded movement, which occurs in wh-questions. In cases of unbounded movement, constituents may be moved arbitrarily far from their original position. For example, consider the wh-questions that are related to the assertion The fat man will angrily put the book in the corner. If you are interested in who did the action, you might ask one of these questions: Grammars for Natural Language 129 Which man will angrily put the book in the corner? Who will angrily put the book in the corner? On the other hand, if you are interested in how it is done, you might ask one of the following questions: How will the fat man put the book in the corner? In what way will the fat man put the book in the comer? If you are interested in other aspects, you might ask one of these questions: What will the fat man angrily put in the corner? Where will the fat man angrily put the book? In what comer will the fat man angrily put the book? What will the fat man angrily put the book in? Each question has the same form as the original assertion, except that the part being questioned is removed and replaced by a wh-phrase at the beginning of the sentence. In addition, except when the part being queried is the subject NP, the subject and the auxiliary are apparently inverted, as in yes/no questions. This similarity with yes/no questions even holds for sentences without auxiliaries. In both cases, a do auxiliary is inserted: I found a bookcase. Did I find a bookcase? What did I find? Thus you may be able to reuse much of a grammar for yes/no questions for wh-questions. A serious problem remains, however, concerning how to handle the fact that a constituent is missing from someplace later in the sentence. For example, consider the italicized VP in the sentence What will the fat man angrily put in the corner! While this is an acceptable sentence, angrily put in the comer does not appear to be an acceptable VP because you cannot allow sentences such as */ angrily put in the corner. Only in situations like wh-questions can such a VP be allowed, and then it is allowed only if the wh-constituent is of the right form to make a legal VP if it were inserted in the sentence. For example, What will the fat man angrily put in the corner? is acceptable, but '"Where will the fat man angrily put in the corner? is not. If you constructed a special grammar for VPs in wh-questions, you would need a separate grammar for each form of VP and each form of missing constituent. This would create a significant expansion in the size of the grammar. This chapter describes some techniques for handling such phenomena concisely. In general, they all use the same type of approach. The place where a subconstituent is missing is called the gap, and the constituent that is moved, is called the filler. The techniques that follow all involve ways of allowing gaps in constituents when there is an appropriate filler available. Thus the analysis of the 130 CHAPTER 5 Grammars for Natural Language 131 VP in a sentence such as What will the fat man angrily put in the comer? is parsed as though it were angrily put what in the corner, and the VP in the sentence What will the fat man angrily put the book in? is parsed as though the VP were angrily put the book in what. There is further evidence for the correctness of this analysis. In particular, all the well-formedness tests, like subject-verb agreement, the case of pronouns (who vs. whom), and verb transitivity, operate as though the wh-term were actually filling the gap. For example, you already saw how it handled verb transitivity. The question What did you put in the cupboard? is acceptable even though put is a transitive verb and thus requires an object. The object is a gap filled by the wh-term, satisfying the transitivity constraint. Furthermore, a sentence where the object is explicitly filled is unacceptable: *What did you put the bottle in the cupboard? This sentence is unacceptable, just as any sentence with two objects for the verb put would be unacceptable. In effect it is equivalent to *You put what the bottle in the cupboard? Thus the standard transitivity tests will work only if you assume the initial wh-term can be used to satisfy the constraint on the standard object NP position. Many linguistic theories have been developed that are based on the intuition that a constituent can be moved from one location to another. As you explore these techniques further, you will see that significant generalizations can be made that greatly simplify the construction of a grammar. Transformational grammar (see Box 5.1) was based on this model. A context-free grammar generated a base sentence; then a set of transformations converted the resulting syntactic tree into a different tree by moving constituents. Augmented transition networks offered a new formalism that captured much of the behavior in a more computationally effective manner. A new structure called the hold list was introduced that allowed a constituent to be saved and used later in the parse by a new arc called the virtual (VIR) arc. This was the predominant computational mechanism for quite some time. In the early 1980s, however, new techniques were developed in linguistics that strongly influenced current computational systems. The first technique was the introduction of slash categories, which are complex nontenninals of the form X/Y and stand for a constituent of type X with a subconstituent of type Y missing. Given a context-free grammar, there is a simple algorithm that can derive new rules for such complex constituents. With this in hand, grammar writers have a convenient notation for expressing the constraints arising from unbounded movement. Unfortunately, the size of the resulting grammar can be significantly larger than the original grammar. A better approach uses the feature system. Constituents are stored in a special feature called GAP and are passed from constituent to subconstituent to allow movement. In this analysis the constituent S/NP is shorthand for an S constituent with the GAP feature NP. the resulting system does not require expanding the size of BOX 5.1 Movement in Linguistics The term movement arose in transformational grammar (TG). TG posited two distinct levels of structural representation: surface structure, which corresponds to the actual sentence structure, and deep structure. A CFG generates the deep structure, and a set of transformations map the deep structure to the surface structure. For example, the deep structure of Will the cat scratch John? would be: NB / \ ART N The AUX will scratch John The yes/no question is then generated from this deep structure by a transformation expressed schematically as follows: subj-aux inversion NP 2 With this transformation the surface form will be S V AUX NP /\ ART N Will the scratch John In early TG (Chomsky, 1965), transformations could do many operations on trees, including moving, adding, and deleting constituents. Besides subj-aux inversion, there were transformational accounts of passives, wh-questions, and embedded sentences. The modern descendants of TG do not use transformations in the same way. In government-binding (GB) theory, a single transformation rule. Move-a, allows any constituent to be moved anywhere! The focus is on developing constraints on movement that prohibit the generation of illegal surface structures. the grammar significantly and appears to handle the phenomena quite well. The following sections discuss two different feature-based techniques for context-free grammars and the hold-list technique for ATNs in detail. 132 CHAPTER 5 Grammars for Natural Language 133 BOX 5.2 Different Types of Movement While the discussion in this chapter will concentrate on wh-questions and thus will examine the movement of wh-terms extensively, the techniques discussed are also needed for other forms of movement as well. Here are some of the most common forms of movement discussed in the linguistics literature. For more details, see a textbook on linguistics, such as Baker (1989). wh-movement—move a wh-tenn to the front of the sentence to form a wh-question topicalization—move a constituent to the beginning of the sentence for emphasis, as in I never liked this picture. This picture, I never likedT adverb preposing—move an adverb to the beginning of the sentence, as in I will see you tomorrow. Tomorrow, I will see you. extraposition—move certain NP complements to the sentence final position, as in A book discussing evolution was written. A book was written discussing evolution. As you consider strategies to handle movement, remember that constituents cannot be moved from any arbitrary position to the front to make a question. For example, The man who was holding the two balloons will put the box in the corner. is a well-formed sentence, but you cannot ask the following question, where the gap is indicated by a dash: *What will the man who was holding — put the box in the corner? This is an example of a general constraint on movement out of relative clauses. 5.3 Handling Questions in Context-Free Grammars The goal is to extend a context-free grammar minimally so that it can handle questions. In other words, you want to reuse as much of the original grammar as possible. For yes/no questions, this is easily done. You can extend Grammar 4.7 with one rule that allows an auxiliary before the first NP and handles most examples: S [+inv] -> (AUXAGR ?a SUBCAT ?v) (NP AGR ?a) (VP VFORM ?v) This enforces subject-verb agreement between the AUX and the subject NP. and ensures thai Ihc VP has the right VFORM to follow the AUX. This one rule is all that is needed to handle yes/no questions, and all of the original grammar for assertions can be used directly for yes/no questions. As mentioned in Section 5.2, a special feature GAP is introduced to handle wh-questions. This feature is passed from mother to subconstituent until the appropriate place for the gap is found in the sentence. At that place, the appropriate constituent can be constructed using no input. This can be done by introducing additional rales with empty right-hand sides. For instance, you might have a rule such as (NP GAP ((CAT NP) (AGR ?a)) AGR ?a) -4 e which builds an NP from no input if the NP sought has a GAP feature that is set to an NP. Furthermore, the AGR feature of this empty NP is set to the AGR feature of the feature that is the value of the GAP. Note that the GAP feature in the mother has another feature structure as its value. To help you identify the structure of such complex values, a smaller font size is used for feature structures acting as values. You could now write a grammar that passed the GAP feature appropriately. This would be tedious, however, and would also not take advantage of some generalities that seem to hold for the propagation of the GAP feature. Specifically, there seem to be two general ways in which the GAP feature propagates, depending on whether the head constituent is a lexical or nonlexical category. If it is a nonlexical category, the GAP feature is passed from the mother to the head and not to any other subconstituents. For example, a typical S rule with the GAP feature would be (S GAP ?g) -» (NP GAP -) (VP GAP ?g) The GAP can be in the VP, the head subconstituent, but not in the NP subject. For rules with lexical heads, the gap may move to any one of the nonlexical subconstituents. For example, the rule for verbs with a _np_vp:inf complement, VP -> V[_np_vp:inf\ NP PP would result in two rules involving gaps: (VP GAP ?g) -> V[_np_vp:inf] (NP GAP ?g) (PP GAP -) (VP GAP ?g) -4 V[_np_vp:inf] (NP GAP -) (PP GAP ?g) In other words, the GAP may be in the NP or in the PP, but not both. Setting the GAP feature in all but one subconstituent to - guarantees that a gap can be used in only one place. An algorithm for automatically adding GAP features to a grammar is shown in Figure 5.5. Note that it does not modify any rule that explicitly sets the GAP feature already, allowing the grammar designer to introduce rules that do not follow the conventions encoded in the algorithm. In particular, the rule for subject-aux inversion cannot allow the gap to propagate to the subject NP. 134 CHAPTER 5 Grammars For Natural Language 135 For each rule Y—> X[ ... H;... Xn with head constituent 1. If the rule specifies a GAP feature in some constituent already, then skip. 2. If the head Hj is not a lexical category, then add a GAP feature to the head and the mother, and -GAP to the other subconstituents, producing a rule of form: (Y GAP ?g) -> (X: GAP-)... (Hi GAP ?g)... (X„ GAP -) 3. If the head Ht is a lexical category, then for each nonlexical subconstituent Xj, add a rule of the form: (Y GAP ?g) -»(X[ GAP-)... (Xj GAP ?g)... (X„ GAP -) Figure 5.5 An algorithm for adding GAP features to a grammar Using this procedure, a new grammar can be created mat handles gaps. All that is left to do is analyze where the fillers for the gaps come from. In wh-questions, the fillers are typically NPs or PPs at the start of the sentence and are identified by a new feature WH that identifies a class of phrases that can introduce questions. The WH feature is signaled by words such as who, what, when, where, why, and how (as in how many and how carefully). These words fall into several different grammatical categories, as can be seen by considering what type of phrases they replace. In particular, who, whom, and what can appear as pronouns and can be used to specify simple NPs: Who ate the pizza? What did you put the book in? The words what and which may appear as determiners in noun phrases, as in What book did he steal? Words such as where and when appear as prepositional phrases: Where did you put the book? The word how acts as an adverbial modifier to adjective and adverbial phrases: How quickly did he run? Finally, the word whose acts as a possessive pronoun: Whose book did you find? The wh-words also can act in different roles. All of the previous examples show their use to introduce wh-questions, which will be indicated by the WH feature value Q. A subset of them can also be used to introduce relative clauses, which will be discussed in Section 5.4. These will also have the WH feature value R. To make the WH feature act appropriately, a grammar should satisfy the following constraint: If a phrase contains a subphrase that has the WH feature, then the larger phrase also has the same WH feature. This way, complex phrases what: (CAT PRO when: (CAT PP-WRD WH Q WH {QR} AGR{3s3p.}) PFORM TIME) what: (CAT QDET who: (CAT PRO WHQ WH {QR} AGR {3s 3p}) AGR {3s.3p}) which: (CAT QDET where: (CAT PP-WRD WHQ WH {QR} AGR{3s3p}) PFORM {LOC MOT}) which: (CAT PRO whose: (CAT PRO WHR WH {QR} AGR {3s 3p)) POSS + AGR {3s 3p}) Figure 5.6 A lexicon for some of the wh-words 1. (NP POSS ?p WH ?w) -> (PRO POSS ?p WH ?w) 2. (NP WH ?w) -» (DET WH ?w AGR la) (CNP AGR ?a) 3. CNP —> N 4. CNP -> ADJ N 5. DET -»ART 6. (DET WH ?w) -> (NP[+POSS] WH ?a) 7. (DET WH ?w) -> (QDET WH ?w) 8. (PPWH?w) ->P(NPWH?w) 9. (PP WH ?w) -» (PP-WRD WH ?w) Head feature for NP, DET and CNP: AGR Head feature for PP: PFORM Grammar 5.7 A simple NP and PP grammar handling wh-words containing a subconstituent with a WH word also can be used in questions. For example, the sentence In what store did you buy the picture? is a legal question since the initial PP in what store has the WH value Q because the NP what store has the WH value Q (because the determiner what has the WH value Q). With this constraint, the final S constituent will also have the WH value Q, which will be used to indicate that the sentence is a wh-question. Figure 5.6 gives some lexical entries for some wh-words. Note the introduction of new lexical categories: QDET for determiners that introduce wh-terms, and PP-WRD for words like when that act like prepositional phrases. Grammar 5.7 gives some rules for NPs and PPs that handle the WH feature appropriately. 136 CHAPTER 5 Grammars for Natural Language 137 10. 11. 12. 13. 14. 15. 16. 17. 18. '9-20. (S[-inv] WH ?w) ^ (NP WH?w agr ?a) (VP[/w] AGR?a) (S[+inv] WH?w gap ?g) —> (AUX. COMPFORM ?s AGR ?a) (NP WH ?w agr ?a gap -) (VP vform ?s gap ?g) S —> (NP[Q,-gap] agr ?a) (S[+mv] G4P (M> AGR ?a)) s'_» (PP[Q-gap] pform ?p) (S[-Hro>] G4P (PPPFORM ?p)) VP -> (AUX COMPFORM ?s) (VP vform ?s) VP H> V[_none] VP->V[_np]NP VP >1* r/>.w/] \T|mf. VP Vl_np_vp:inf] NP VP[inf] VP[ inP > TO VP[base] VP -4 V[_np_pp:loc] NP PP[loc] Head features for S, VP: VFORM, AGR Grammar 5.8 The unexpanded S grammar for wh-questions With the WH feature defined and a mechanism for modifying a grammar to handle the GAP feature, most wh-questions can be handled by adding only two new rules to the grammar. Here are the rules for NP- and PP-based questions: S ^ (NP[Q,-gap] agr ?a) (S[+inv] gap (NP agr ?a)) S _» (PP[Q,-gap] pform ?p) (S[+inv] gap (PPPFORM ?p)) Both of these rules set the value of the GAP feature to a copy of the initial WH constituent so that it will be used to fill a gap in the S constituent. Since die S constituent must have the feature +INV, it must also contain a subject-aux inversion. Given the NP and PP rules in Grammar 5.7, Grammar 5.8 provides the S and VP rules to handle questions. Grammar 5.9 shows all the derived rules introduced by the procedure that adds the GAP features. Rules 11, 12, and 13 are not changed by the algorithm as they already specify the GAP feature. Parsing with Gaps A grammar that allows gaps creates some new complications for parsing algorithms. In particular, rules that have an empty right-hand side, such as (NP agr ?a gap (NP agr ?a)) -> e 10. (S[-inv] WH ?w gap ?g) -> (NP WH ?w agr ?a) (VP\fin\ AGR ?a GAP ?g) (S[+inv] WH ?w gap ?g) (AUX COMPFORM ?s AGR ?a) (NP WH ?w agr ?a gap -) (VP vform ?s gap ?g) S -> (NP[Q-gap] agr ?a) (S[+inv] GAP (NP AGR ?a)) S h> (PP[Q,-gap] pform ?p) (S[+inv] GAP (PPPFORM ?p)) 14. (VP gap ?g) (AUX COMPFORM ?s) (VP vform ?s gap ?g) 15. VP -> V[_none] 16. (VP gap ?g) -4 V\_np] (NP gap ?g) (VP gap ?g)-> \\_vp:inf] (VPfinf] gap ?g) (VP gap ?g)-^ V[_np_vp:inf] (NP gap ?g) (VP[inf] gap -) 18'. (VP gap ?g)-> V[_np_vp:inf] (NP gap -) (VP[inf] gap ?g) 19. (VPtinf] gap ?g) TO (VP[base] gap ?g) 20. (VP gap ?g) V[_np_pp:loc] (NP gap ?g) (PP[loc] gap -) 20'. ((VP gap ?g) -> V[_np_pp:loc] (npgap -) (PPfloc] gap ?g) Head features for S, VP: VFORM, AGR 11 12. 13. 17. 18. Grammar 5.9 The S grammar for wh-questions with the GAP feature may cause problems because they allow an empty NP constituent anywhere. In a bottom-up strategy, for instance, this rule could apply at any position (or many times at any position) to create NPs that use no input. A top-down strategy fares better, in that the rule would only be used when a gap is explicitly predicted. Instead of using such rules, however, the arc extension algorithm can be modified to handle gaps automatically. This technique works with any parsing strategy. Consider what extensions are necessary. When a constituent has a GAP feature that matches the constituent itself, it must be realized by the empty constituent. This means that an arc whose next constituent is a gap can be extended immediately. For example, consider what happens if the following arc is suggested by the parser: (vp gap (NP agr 3s)) -> W_npj>p:loc\ o (NP gap (NP agr 3s)) PP[LOC] The next constituent needed is an NP, but it also has a GAP feature that must be an NP. Thus the constituent must be empty. The parser inserts the constituent (np agr 3s empty +) 138 CHAPTER 5 Grammars for Natural Language 139 BOX 5.3 The Movement Constraints In linguistics the principles that govern where gaps may occur are called island constraints. The term draws on the metaphor of constituent movement. An island is a constituent from which no subconstituent can move out (just as a person cannot walk off an island). Here are a few of the constraints that have been proposed. The A over A Constraint—No constiment of category A can be moved out of a constituent of type A. This means you cannot have an NP gap within an NP, a PP gap within a PP, and so on, and provides justification for not allowing non-null constituents of the form NP/NP, PP/PP, and so on. This disallows sentences such as *Whatbookj did you meet the author of — i? Complex-NP Constraint—No constituent may be moved out of a relative clause or noun complement. This constraint disallows sentences like *To whom! did the man who gave the book —i laughed? where the PP to whom would have been part of the relative clause who gave the book to whom (as in The man who gave the book to John laughed). Sentential Subject Constraint—No constituent can be moved out of a constituent serving as the subject of a sentence. This overlaps with the other constraints when the subject is an NP, but non-NP subjects are possible as well, as in the sentence For me to learn these constraints is impossible. This constraint eliminates the possibility of a question like *What i is for me to learn — i impossible? Wh-Island Constraint—No constituent can be moved from an embedded sentence with a wh-complementizer. For example, while Did they wonder whether I took the book? is an acceptable sentence, you cannot ask *What) did they wonder whether 1 took — j ? Coordinate Structure Constraint—A constituent cannot be moved out of a coordinate structure. For example, while Did you see John and Sam? is an acceptable sentence, you cannot ask *Who £ did you see John and — i ? Note that these constraints apply to all forms of movement, not just wh-questions. For example, they constrain topicalization and adverb preposing as well. to the chart, which can then extend the arc to produce the new arc (VP GAP (NP AGR 3s)) -» V[_np j>p:loc] (NP GAP (NP AGR 3s)) o PP[LOC] A more precise specification of the algorithm is given in Figure 5.10. Whenever an arc of the form X ... o (C Fl VI ... Fn Vn GAP (C Gl ?vgl ... Gm ?vgm))... is suggested by the parser, and the constituent pattern that is the GAP feature, that is, (C Gl ?vgl ... Gm ?vgm) matches the constituent itself (CF1 VI ... FnVn GAP (CGI VG1 ... GmVGm)) then add a new constituent (C Gl ?vgl... Gm ?vgm EMPTY +), with the variables bound as necessary, to the chart. Use this constituent to extend the original arc. Figure 5.10 The algorithm to insert empty constituents as needed Consider parsing Which dogs did he see? using the bottom-up strategy. Only the rules that contribute to the final analysis will be considered. Rule 7 in Grammar 5.7 applies to the constituent QDET1 (which) to build a constituent DET1, and rule 3 applies to the constituent Nl (dogs) to build a constituent CMP1, which then is combined with DET1 by rule 2 to build an NP constituent of the form NP1:(NP AGR 3p WHQ 1 QDET1 2 CNP1) This NP introduces an arc based on rule 12 in Grammar 5.8. The next word, did, is an AUX constituent, which introduces an arc based on rule 11. The next word, he, is a PRO, which creates an NP, NP2, using rule 1, and extends the arc based on rule 1. The chart at this stage is shown in Figure 5.11. The word see introduces a verb VI, which can extend rule 16. This adds the arc labeled (VP GAP ?g) -> VLnp] ° (NP GAP ?g) Since the GAP value can match the required NP (because it is unconstrained), an empty NP is inserted in the chart with the form EMPTY-NP1: (NP AGR ?a GAP (NP AGR ?a)) EMPTY +) This constituent can then be used to extend the VP arc, producing the constituent VP1: (VP VFORMinf GAP (NP AGR ?a) 1V1 2 EMPTY-NP 1) Now VP1 can extend the arc based on rule 11 to form the S structure: 140 CHAPTER 5 Grammars for Natural Language 141 NPl WHQ AGR 3p 1 DETl 2 CNPl DETl CNPl NP2 WHQ AGR 3p AGR 3s AGR 3p 1 Nl 1 PROl 1 QDET1 QDET1 WHQ AGR 3p Nl AGR3p AUX1 AGR 3s VFORM past SUBCAT base PROl AGR 3s Which dogs did he S -> NP[Q] o (S GAP (NP AGR 3p» ■- (S GAP ?g) -» AUX NP o (VP GAP ^) Figure 5.11 The chart after the word he SI: (S GAP (NPAGR ?a) INV + 1 AUX1 2NP2 3VP1) Finally, NPl and SI can combine using rule 12 to produce an S constituent. The final chart is shown in Figure 5.12. Thus, by using the WH and GAP features appropriately and by extending the parser to insert empty constituents as needed to fill gaps, the original grammar for declarative assertions can be extended to handle yes/no questions and wh-questions with only three new rules. The success of this approach depends on getting the right constraints on the propagation of the GAP feature. For instance, in a rule a gap should only be passed from the mother to exactly one of the subconstituents. If this constraint were not enforced, many illegal sentences would be allowed. There are some cases where this constraint appears to be violated. They are discussed in Exercise 6. With a bottom-up parsing strategy, many empty constituents are inserted by the parser, and many constants with gaps are added to the chart. Most of these are not used in any valid analysis of the sentence. The top-down strategy greatly reduces the number of empty constants proposed and results in a considerably smaller chart. S2 VFORM past 1 NPl 2 SI NPl WHQ AGR 3p 1 DETl 2 CNPl SI INV+ GAP (NP AGR 3p) VFORM past 1 AUX1 2 NP2 3 VP1 DETl WHQ . AGR 3p 1 QDETt CNPl AGR 3p 1 Nl NP2 AGR 3s 1 PROl VP1 VFORM inf GAP (NP AGR 3p) 1 VI 2 EMPTY-NP1 QDET1 WHQ AGR 3p Nl AGR 3p AUX1 AGR 3s . VFORM past SUBCAT base PROl AGR 3s Which dogs did he see Figure 5.12 The final chart for Which dogs did he see ? 5.4 Relative Clauses This section examines the structure of relative clauses. A relative clause can be introduced with a rule using a new category, REL: CNP -> CNP REL A large class of relative clauses can be handled using the existing grammar for questions as they involve similar constructs of a wh-word followed by an S structure with a gap. The main difference is that only certain wh-words are allowed (who, whom, which, when, where, and whose) and the S structure is not inverted. The special class of wh-words for relative clauses is indicated using the WH feature value R. Figure 5.6 showed lexicon entries for some of these words. Given this analysis, the following rules REL -» (NP WH R AGR ?a) (S[-inv, fin] GAP (NP AGR ?a)) REL -> (PP WH R PFORM ?p) (S[-inv, fin] GAP (PP PFORM ?p)) handle relative Clauses such as The man who we saw at the store The exam in which you found the error The man whose book you stole 142 CHAPTER 5 Grammars for Natural Language 143 BOX 5.4 Feature Propagation in GPSG GPSG (generalized phrase structure grammar) (Gazdar et al., 1985) formalizes all feature propagation in terms of a set of general principles. The GAP feature corresponds to the SLASH feature in GPSG. Their claim is that no special mechanism is needed to handle the SLASH feature beyond what is necessary for other features. In particular, the SLASH feature obeys two general GPSG principles. The head feature principle, already discussed, requires all head features to be shared between a head subconstituent and its mother constituent. The foot feature principle requires that if a foot feature appears in any subconstituent, it must be shared with the mother constituent. The constraints on the propagation of the SLASH feature result from it being both a head feature and a foot feature. GPSG uses meta-rules to derive additional rules from a basic grammar. Metarules only apply to rules with lexical heads. One meta-rule accounts for gaps by taking an initial rule, for example, VP -> V NP and producing a new rule VP/NP -»V NP[+NULL] where VP/NP is a VP with the SLASH feature NP, and the feature +NULL indicates that the constituent is "phonologically null," that is, not stated or present in the sentence. GPSG meta-rules are used for other phenomena as well. For example, one meta-rule generates rules for passive voice sentences from rules for active voice sentences, while another generates subject-aux inversion rules from noninverted S rules. Meta-rules play much the same role as transformations, but they are Umited to act on only a single rale at a time, and that rule must have a lexical head, whereas transformations may operate on arbitrary trees. The passive meta-rule looks something like this: VP -> V [TRAN S1T1VE] NP X => VP[PASSGAP] -4VX Here the symbol X stands for any sequence of constituents. Thus this meta-rule says that if there is a rule for a VP consisting of a transitive verb, an NP, and possibly some other constituents, then there is another rule for a passive VP consisting of the same verb and constituents, except for the NP. As long as there is an upper limit to the number of symbols the variable X can match, you can show that the language described by the grammar expanded by all possible meta-rule applications remains a context-free language. Because gaps don't propagate into the subject of a sentence, a new rule is needed to allow NPs like The man who read the paper, where the wh-word plays the role of subject. This can be covered by the rule REL -> NP[R] VPffin] In addition, there is a common class of relative clauses that consist of that followed by an S with an NP gap, or a VP, as in The man that we saw at the party and The man that read the paper. If you allow that to be a relative pronoun with WH feature R, then the above rules also cover these cases. Otherwise, additional rules must be added for that. Finally, there are relative clauses that do not start with an appropriate wh-phrasc but otherwise look like normal relative clauses. Examples are The paper John read The damage caused by the storm The issue creating the argument The latter two examples involve what are often called reduced relative clauses. These require two additional rules allowing a relative clause to be a finite S with an NP gap or a VP in a participle form: REL -» (S [fin] GAP (NP AGR 7a)) REL -» (VP VFORM {ing pastprt}) Notice that now there are two major uses of gaps: one to handle questions and the other to handle relative clauses. Because of this, you should be careful to examine possible interactions between the two. What happens to relative clauses within questions? They certainly can occur, as in Which dogi did the man who2 we saw —2 holding the bone feed — 1 ? The gaps are shown using dashes, and numbers indicate the fillers for the gaps. The existing grammar does allow this sentence, but it is not obvious how, since the GAP feature can only store one constituent at a time and the sentence seems to require storing two. The answer to this puzzle is that the way the GAP feature is propagated does not allow the initial Q constituent to move into the relative clause. This is because REL is not the head of the rule. In particular, the rule CNP -» CNP REL expands to the rule (CNP GAP ?g) -» (CNP GAP ?g) (REL GAP -) Thus the gap for the question cannot be found in the relative clause. As a result, when a new gap is proposed within the relative clause, there is no problem. So the mechanism works in this case, but does it capture the structure of English well? In particular, can a question gap ever appear within a relative clause? The generally accepted answer appears to be no. Otherwise, a question like the following would be acceptable: *Which dogi did the man2 we saw —2 petting — \ laughed? So the mechanism proposed here appears to capture the phenomena well. 144 CHAPTER 5 Grammars for Natural Language 145 ari iiuuu Arc NP/1 NP/2 NP/3 NP2/1 NP3/1 Test AGRnAGR, WH«nR Actions DET := * AGR := AGR, DET := * WH := Q AGR := AGR, PRO := * WH:= WH* AGR := AGR, HEAD := * AGR := AGR n AGR, MOD := * Grammar 5.13 An NP network including wh-words 5.5 The Hold Mechanism in ATNs Another technique for handling movement was first developed with the ATN framework. A data structure called the hold list maintains the constituents that are to be moved. Unlike GAP features, more than one constituent may be on the hold list at a single time. Constituents are added to the hold list by a new action on arcs, the hold action, which takes a constituent and places it on the hold list. The hold action can store a constituent currently in a register (for example, the action HOLD SUBJ holds the constituent that is in the SUBJ register). To ensure that a held constituent is always used to fill a gap, an ATN system does not allow a pop arc to succeed from a network until any constituent held by an action on an arc in that network has been used. That is, the held constituent must have been used to fill a gap in the current constituent or in one of its subconstituents. Finally, you need a mechanism to detect and fill gaps. A new arc called VIR (for virtual) that takes a constituent name as an argument can be followed if a constituent of the named category is present on the hold list. If the arc is followed successfully, the constituent is removed from the hold list and returned as the value of the arc in the identical form that a PUSH arc returns a constituent. 5\ 3|. \aux ' --v, verb^ (VP-AUX)- NP/ Arc S/l S/2 S/3 WH-S/1 WH-S/2 . WH-S/3 VP/1 VP/2 S-INV/1 VP-AUX/1 VCOMP/1 VCOMP/2 VCOMP/3 Test WH, n {QR} VFORM, n {past pres) VFORM, n {past pres) AGRsuBj n AGR, VFORM * n {past pres) AGRSUBJnAGR, VFORM* n {past pres} AGRAUX n AGR, SUBCAT n FORM, SUBCATjyi^iN.v n _np SUBCAT main-v n -nP SUBCAT main-v n _none Actions INV:=-SUBJ := * WH:=WH* HOLD * INV := + AUX := * SUBJ := * SUBJ := * AUX:=* MAIN-V := * AUX:=* SUBJ := * MAEM-V := * OBJ := * OBJ := * Grammar 5.14 An S network for questions and relative clauses Grammar 5.13 shows an NP network that recognizes NPs involving wh-words as pronouns or determiners. As with CFGs, the feature WH is set to the value Q to indicate the NP is a wh-NP starting a question. It also accepts relative clauses with a push to the S network, shown in Grammar 5.14. This S network handles yes/no questions, wh-questions, and relative clauses. Wh-questions and relative clauses are processed by putting the initial NP (with a WH feature, Q, or 146 CHAPTER 5 Grammars for Natural Language 147 R) onto the hold list, and later using it in a VIR arc. The network is organized so that all +INV sentences go through node S-INV, while all -INV sentences go through node VP. All wh-questions and relative clauses go through node WH-S and then are redirected back into the standard network based on whether or not the sentence is inverted. The initial NP is put on the hold list when arc S/2 is followed. For noninverted questions, such as Who ate the pizza?, and for relative clauses in the NP of form the man who ate the pizza, the held NP is immediately used by the VIR arc WH-S/2. For other relative clauses, as in the NP the man who we saw, arc WH-S/1 is used to accept the subject in the relative clause, and the held NP is used later on arc VCOMP/2. For inverted questions, such as Who do I see?, arc WH-S/3 is followed to accept the auxiliary, and the held NP must be used later. This network only accepts verbs with SUBCAT values _none and _np but could easily be extended to handle other verb complements. When extended, there would be a much wider range of locations where the held NP might be used. Figure 5.15 shows a trace of the sentence The man who we saw cried. In parsing the relative clause, the relative pronoun who is held in step 5 and used by a VIR arc in step 8. Note that this ATN would not accept *Who did the man see the boy?, as the held constituent who is not used by any VIR arc; thus the pop arc from the S network cannot be taken. Similarly, *The man who the boy cried ate the pie is unacceptable, as the relative pronoun is not used by a VIR arc in the S network that analyzes the relative clause. Comparing the Methods You have seen two approaches to handling questions in grammars: the use of the GAP feature and the use of the hold list in ATNs. To decide which is best, we must determine the criteria by which to evaluate them. There are three important considerations: coverage—whether the approach can handle all examples; selectivity—whether it rejects all ill-formed examples; and conciseness—how easily rules can be specified. Under reasonable assumptions both methods appear to have the necessary coverage, that is, a grammar can be written with either approach that accepts any acceptable sentence. Given that this is the case, the important issues become selectivity and conciseness, and here there are some differences. For instance, one of the underlying assumptions of the GAP feature theory is that a symbol such as NP with an NP gap must be realized as the empty string; that is. you could not have an NP with an NP gap inside it. Such a structure could be parsable using the ATN hold list mechanism. In particular, a sentence such as *Who did the man who saw hit the boy?, while not comprehensible, would be accepted by the ATN in Grammars 5.13 and 5.14 because the hold list would contain two NPs, one for each occurrence of who. In the relative clause, the relative pronoun would be taken as the subject and the initial query NP would be taken as the Trace of S Network Step Node Position 1. S 1 11. 12. VP VCOMP Arc Followed S/l (for recursive call see trace below) VP/1 VCOMP/3 succeeds since no words left Registers SUBJ <- (NP DET the HEAD man AGR 3s MOD (S who we saw)) MAIN-V <- cried returns (S SUBJ (NP DET the HEAD man AGR 3p MOD (S who we saw)) MAIN-V cried) Trace of First NP Call: Arc S/l Step Node Position Arc Followed Registers 1 NP 1 NP/1 DET <- the AGR <- {3s 3p) 3. NP2 2 NP2/1 HEAD <-man AGR <- 3s 4. NP3 3 NP3/1 (for recursive call see trace below) MOD<- (S WHR SUBJ we MAIN-V saw OBJ who) 10. NP4 6 NP4/1 pop returns (NP DET the Trace of Recursive Call to S on Arc NP3/1 Step Node Position Arc Followed 5. S 3 S/2 (call to NP network not shown) WH-S 4 VP 5 VCOMP 6 OBJ 6 WH-S/1 VP/1 VCOMP/2 (uses the NP on the hold list) OBJ/1 pop HEAD man AGR 3s MOD (S who we saw)) Registers WH<- {QR} HOLDING (NP PRO who WH {QR}) WH <— R SUBJ<- (NP PRO we ...) MAIN-V <- saw OBJ <- (NP PRO who...) returns (S WHR SUBJ we MAIN-V saw OBJ who) Figure 5.15 A trace of an ATN parse for l The 2 man 3 who 4 we 5 saw g cried 7 object. This sentence is not acceptable by any grammar using GAP features, however, because the GAP feature cannot be propagated into a relative clause since it is not the head constituent in the rule 148 CHAPTER 5 Grammars for Natural Language 149 CNP -> CNP REL Even if it were the head, however, the sentence would still not be acceptable, since the erroneous analysis would require the GAP feature to have two values when it starts analyzing the relative clause. The hold list mechanism in the ATN framework must be extended to capture these constraints, because there is no obvious way to prevent held constituents from being used anytime they are available. The only possible way to restrict it using the existing mechanism would be to use feature values to keep track of what held constituents are available in each context. This could become quite messy. You can, however, make a simple extension. You can introduce a new action—say, HIDE—that temporarily hides the existing constituents on the hold list until either an explicit action—say, UNHIDE—is executed, or the present constituent is completed. With this extension the ATN grammar to handle relative clauses could be modified to execute a HIDE action just before the hold action that holds the current constituent is performed. In this case, however, it is not the formalism itself that embodies the constraints. Rather, the formalism is used to state the constraints; it could just as easily describe a language that violates the constraints. The theory based on GAP features, on the other hand, embodies a set of constraints in its definition of how features can propagate. As a result, it would be much more difficult to write a grammar that describes a language that violates the movement constraints. While a true account of all the movement constraints is still an area of research, this simple analysis makes an important point. If a formalism is too weak, you cannot describe the language at all. If it is too strong, the grammar may become overly complicated in order to eliminate sentence structures that it can describe but that are not possible. The best solution, then, is a formalism that is just powerful enough to describe natural language. In such a language many of the constraints that first appear to be arbitrary restrictions in a language might turn out to be a consequence of the formalism itself and need no further consideration. While the technique of GAP feature propagation was introduced with context-free grammars, and the hold list mechanism was introduced with ATNs, these techniques are not necessarily restricted to these formalisms. For instance, you could develop an extension to CFGs that incorporates a hold list and uses it for gaps. Likewise, you might be able to develop some rules for GAP feature propagation in an ATN. This would be more difficult, however, since the GAP feature propagation rules depend on the notion of a head subconstituent, which doesn't have a direct correlate in ATN grammars. 5.6 Gap Threading A third method for handling gaps combines aspects of both the GAP feature, approach and the hold list approach. This technique is usually called gap threading. It is often used in logic grammars, where two extra argument 1. s(ln, Out, Fillersln, FillersOut):- np(ln, Ih1, Fillersln, Fillers"!), vp(ln1, Out, Fillersl, FillersOut) 2. vp(ln, Out, Fillersln, FillersOut):- v(ln, In1) 3. vp(ln, Out, Fillersln, FillersOut):- v(ln, In1), np(ln1, Out, Fillersln, FillersOut) 4. np(ln, Out, Fillers, Fillers):- art(ln, In1), cnp(ln1, Out) 5. np(ln, Out, Fillers, Fillers):- pro(ln, Out) 6. cnpfln, Out):- n(ln, In1), np-comp(ln1, Out) 7. np-cpmp(ln, In) :- (This covers the case where there is no NP complement.) 8. np-comp(ln, Out):- rel-intro(ln, In1, Filler), s(ln1. Out, (Filler nil), nil) (Here we hold the Rel-Intro constituent, and must use it in the following S.) 9. rel-intro(ln, Out, [IMP]):- reipro(ln, Out) (where relpro accepts any pronoun with WH feature R) 10. np(ln. In, [NP I Fillers], Fillers) :- (This rule builds an empty np from a filler.) Grammar 5.16 A logic grammar using gap threading positions are added to each predicate—one argument for a list of fillers that might be used in the current constituent, and one for the resulting list of fillers that were not used after the constituent is parsed. Thus the predicate s {position-in, position-out, fillers-in, fillers-out) is true only if there is a legal S constituent between position-in and position-out of the input. If a gap was used to build the S, its filler will be present in fillers-in, but not in fillers-out. For example, an S constituent with an NP gap would correspond to the predicate s(ln, Out, [NP], nil). In cases where there are no gaps in a constituent, the fillers-in and fillers-out will be identical. Consider an example dealing with relative clauses. The rules required are shown in Grammar 5.16. The various feature restrictions that would be needed to enforce agreement and subcategorization are not shown so as to keep the example simple. To see these rules in use, consider the parse of the sentence The man who we saw cried in Figure 5.17. The relative clause is analyzed starting with step 7. Using rule 9, the word who is recognized as a relative pronoun, and the variable Filler is bound to the list [NP]. This filler is then passed into the embedded S (step 9), to the NP (step 10), and then on to the VP (step 12), since it is not used in the NP. From there it is passed to the NP predicate in step 14, which uses the filler according to rule 10. Note that no other NP rule could have applied at this point, because the filler must be used since the FillersOut variable is nil. Only rules that consume the filler can apply. Once this gap is used, the entire NP from positions 1 to 6 has been found and the rest of the parse is straightforward. Step State Next Operation 1. S(1, 7, nil, nil) applying rule 1 2. np(1, In1,nil, Fillersl) vp(ln1, 7, Fillers'!, nil) applying rule 4 3. art(1, In2)cnp(ln2, In1) proved art(1,2) 4. cnp(2,ln1) applying rule 6 5. n(2, In3) np-comp(ln3, In1) proved n(2,3) 6. np-comp(3, In1) applying rule 8 7. rel-intro(3, In4, Filler) s(ln4, In1, Filler, nil) applying rule 9 8. relpro(3, In4) proved relpro(3,4) proved rel-intro(3,4,[NP]) 9. s(4, In3, [NP], nil) applying rule 1 10. np(4, In5, [NP], Fillersl) vp(ln5, In1, Fillersl nil) applying rule 5 11. pro(4, In5) proved pro(4,5) proved np(4, 5, [NP], [NP]) 12. vp(5, ln1,[NP], nil) applying rule 4 13. v(5, In6) np(ln6, In1, [NP], nil) proved v(5, 6) 14. np(6, In1, [NP], nil) proved np(6, 6, [NP], nil) proved vp(5, 6, [NP], nil) proved s(4, 6, [NP], nil) proved np-comp(3, 6) proved cnp(2, 6) proved np(1, 6, nil, nil) 15. vp(6, 7, nil, nil) applying rule 2 16. v(6, 7) proved v(6, 7) proved vp(6, 7, nil, nil) proved s(1, 7, nil, nil) Figure 5.17 A trace of the parse of j The 2 man 3 who 4 we 5 saw 6 cried 7 Just as a convenient notation was designed for definite clause grammars, which then could be simply translated into a PROLOG program, a notation has been designed to facilitate the specification of grammars that handle gaps. In particular, there is a formalism called extraposition grammar, which, besides allowing normal context-free rules, allows rules of the form REL-MARK ... TRACE -> REL-PRO which essentially says that the constituent REL-MARK, plus the constituent TRACE later in the sentence, can be rewritten as a REL-PRO. Such a rule violates the tree structure of syntactic forms and allows the analysis shown in Figure 5.18 of the NP the mouse that the cat ate. Such rules can be compiled into a logic grammar using the gap-threading technique. ART N REL-PRO I The mouse that the cat ate Figure 5.18 A parse tree in extraposition grammar Consider how gap threading compares with the other approaches. Of course, by using additional arguments on predicates, any of the approaches could be implemented. If viewed as simply an implementation technique, then the gap-threading approach looks like a way to implement a hold list in a logic grammar. Since the grammar designer can decide to propagate the hold list or not in the grammar, it provides the flexibility to avoid some of the problems with the simple hold list mechanism. For instance, Grammar 5.16 does not pass the fillers from outside a noun phrase into its relative clause. So problems like those discussed in the last section can be avoided. It also would be possible to adopt the GAP feature introduction algorithm described in Section 5.3 so that it introduces gap-threading rules into a logic grammar. Thus, the approach provides the grammar writer with great flexibility. Like the hold list approach, however, the propagation constraints must be explicitly enforced rather than being a consequence of the formalism. Summary Many of the complex aspects of natural language can be treated as movement phenomena, where a constituent in one position is used to satisfy the constraints at another position. These phenomena are divided into two classes: bounded movement, including yes/no questions and passives; and unbounded movement, including wh-questions and relative clauses. The computational techniques developed for handling movement allow significant generalities to be captured 152 CHAPTER 5 Grammars for Natural Language 153 between all of these different sentential forms. By using the feature system carefully, a basic grammar of declarative sentences can be extended to handle the other forms with only a few rules. Three different techniques were introduced to handle such phenomena. The first was the use of the special feature propagation rules using a feature called GAP. The second was the hold list mechanism in ATNs, which involves adding a new action (the hold action) and a new arc type (the VTR arc). The third involved gap threading, with a hold-list-like structure passed from constituent to constituent using features. With suitable care, all of these approaches have been used successfully to build grammars of English with substantial coverage. Related Work and Further Readings The phenomena of unbounded dependencies has motivated a significant amount of research into grammatical formalisms. These fall roughly into several categories, depending on how close the grammar remains to the context-free grammars. Theories such as transformational grammar (Chomsky, 1965; Radford, 1981), for example, propose to handle unbounded dependencies completely outside the CFG framework. Theories such as lexical functional grammar (Kaplan and Bresnan, 1982) and generalized phrase structure grammar (GPSG) (Gazdar, 1982; Gazdar et al., 1985) propose methods of capturing the dependencies with the CFG formalism using features. There has also been considerable work in defining new formalisms that are slightly more powerful than context-free grammars and can handle long-distance dependencies such as tree-adjoining grammars (TAGs) (see Box 5.5) and combinatory categorial grammars (Steedman, 1987). The first reasonably comprehensive computational study of movement phenomena was in the ATN framework by Woods (1970). He went to some length to show that much of the phenomena accounted for by transformational grammar (Chomsky, 1965) could be parsed in ATNs using register testing and setting together with a hold list mechanism. The ATN section of this chapter is a simplified and cleaned-up account of that paper, incorporating later work by Kaplan (1973). Alternative presentations of ATNs can also be found in Bates (1978) and Winograd (1983). Similar augmentation techniques have been adapted to systems based on CFGs, although in practice many of these systems have accepted many structures that are not reasonable sentences. This is because the grammars are built so that constituents are optional even in contexts where there could be no movement. The parser then depends on some ad hoc feature manipulation to eliminate some of these false positives, or on semantic interpretation to reject the ill-formed sentences that were accepted. A good example of how much can be done with this approach is provided by Robinson (1982). BOX 5.5 Tree-Adjoining Grammar Another approach to handling unbounded dependencies is tree-adjoining grammars (TAGs) (Joshi, 1985). There are no grammar rules in this formalism. Rather, there is a set of initial tree structures that describe the simplest sentences of the language, and a tree operation, called adjoining, that inserts one tree into another to create a more complex structure. For example, a simple initial tree is VP More complex sentences are derived using auxiliary trees, which capture the minimal forms of recursion in the language. For example, an auxiliary tree for allowing an adverbial in a verb phrase could be The adjunction operation involves inserting an auxiliary tree-capturing recursion of some constituent C into another tree that contains a constituent C. Adjoining the auxiliary tree for VP above into the initial tree for S produces the new tree VP vp v np By basing the formalism on trees, enough context is provided to capture longdistance dependencies. In this theory there is no movement. Instead the constituents start off being close to each other, and then additional structure is inserted between them by the adjoining operation. The result is a formalism of slightly greater power than CFGs but definitely weaker than context-sensitive grammars. The work on handling movement using gap threading and the definition of extraposition grammars can be found in Pereira (1981). Similar techniques can be found in many current parsers (such as Alshawi, 1992). 154 CHAPTERS Grammars for Natural Language 155 The section on using GAP feature propagation to handle movement is motivated from GPSG (Gazdar et al., 1985). In GPSG the propagation of features is determined by a set of general principles (see Box 5.4). Head-driven phrase structure grammar (Pollard and Sag, 1987; 1993) is a descendant of GPSG that is more oriented toward computational issues. It uses subcategorization information on the heads of phrases extensively and, by so doing, greatly simplifies the context-free grammar at the expense of a more complex lexicon. Exercises for Chapter 5_ 1. (easy) Distinguish between bounded (local) movement and unbounded (nonlocal) movement. What extensions were added to the augmentation system to enable it to handle unbounded movement? Why is wh-movement called unbounded movement? Give examples to support your claim. 2. (easy) Expand the following abbreviated constituents and rales into their full feature structure format. (S[-inv, Q] AGR ?a) NP[R, -gap] VP -> V[_np_s:inf] NP S[infj 3. (medium) Using the grammar developed in Section 5.3, show the analyses of the following questions in chart form, as shown in Figure 5.12: In which town were you born? Where were you born? When did they leave? What town were you born in? 4. (medium) GPSG allows certain rules to have multiple head subconsti-tuents. For instance, the rule for a conjunction between two verb phrases would contain two heads: VP -> VP and VP a How does the presence of multiple heads affect the algorithm that produces the propagation of the GAP feature? In order to answer this question, consider some examples, such as the following sentences: Who did you see and give the book to? What man did Mary hate and Sue love? Also consider that the following sentences are ill-formed: *Who did you see and give the book to John? *What man did Mary hate John and Sue love? b. Write out the VP rule showing the GAP features, and then draw the chart for the sentence Who did Mary see and Sue see? using Grammar 5.8 augmented with your rule. You need only show the constituents that are used in the final analysis, but be sure to show all the feature values for each constituent. (medium) Another phenomenon that requires a movement analysis is topi-calization, which allows sentences of the form John, Mary saw. To John, Mary told the story. a Argue that the same type of constraints that apply to wh-questions apply to topicalization. This establishes that the GAP feature propagation technique can be used. b. Do you need to introduce a new GAP-like feature to handle topicalization, or can the same GAP feature used for questions and relative clauses be re-used? c. Extend Grammar 5.8 to accept the forms of topicalization shown in the examples. (medium) Consider the following ATN grammar for sentences. On arc NP3/1 there is the action of holding the current noun phrase. Give a sequence of legal sentences that can be made arbitrarily long and still be accepted by this ATN. Which of the following sentences would be accepted by this ATN (assuming appropriate word categories for words)? For those that are accepted, outline the structure of the sentence as parsed by the network (that is, a plausible register value structure). i. The man Mary hit hit the ball. ii. The man the man hit the man hit. iii. The man hit hit. iv. Hit the man hit the man. 156 CHAPTER 5 (hard) The following noun phrases arise in some dialogues between a computer operator and various users of the computer system. Could you mount a magtape for me? No ring please. I am not exactly sure of the reason, but we were given a list of users we are not supposed to mount magtapes for, and you are on it. Could you possibly retrieve the following two files'! I think they were on our directory last night. Any chance I can recover from the most recent system dump ? Extend the grammar developed in Sections 5.3 and 5.4 so that it accepts these noun phrases. Give lexicon entries for all the new words in the noun phrases, and show the final charts that result from parsing each NP. 8. (hard) The grammar for parsing the auxiliary verb structure in English developed in Section 5.1 cannot handle phrases of the following form: Jack has to see a doctor. The cat had to be found. Joe has to be winning the race. The book would have had to have been found by Jack. Extend the grammar such that it accepts the preceding auxiliary sequences yet does not accept unacceptable sentences such as *Jack has to have to see a doctor. * Janet had to played the violin. *Will would to go to the movies. Perform a similar analysis, showing examples and counterexamples, of the use of phrases of the form be going to within the auxiliary system. CHAPTER Toward Efficient Parsing 6.1 Human Preferences in Parsing 6.2 Encoding Uncertainty: Shift-Reduce Parsers 6.3 A Deterministic Parser 6.4 Techniques for Efficient Encoding of Ambiguity 6.5 Partial Parsing Toward Efficient Parsing 159 All the parsing frameworks discussed so far have depended on complete search techniques to find possible interpretations of a sentence. Such search processes do not correspond well with many people's intuitions about human processing of language. In particular, human parsing seems closer to a deterministic process— that is, a process that doesn't extensively search through alternatives but rather uses the information it has at the time to choose the correct interpretation. While intuitions about our own perceptual processes are known to be highly unreliable, experimental evidence also suggests that people do not perform a complete search of a grammar while parsing. This chapter examines some ways of making parsers more efficient, and in some cases, deterministic. If syntactic parsing is not your focus, this entire chapter can be treated as optional material. There are two different issues that are of concern. The first involves improving the efficiency of parsing algorithms by reducing the search but not changing the final outcome, and the second involves techniques for choosing between different interpretations that the parser might be able to find. Both will be examined in this chapter. Section 6.1 explores the notion of using preferences in parsers to select certain interpretations, sometimes at the cost of eliminating other possible interpretations. Evidence for this type of model is drawn from psycholinguistic theories concerning the human parsing process. This provides the background for the techniques discussed in the rest of the chapter. Section 6.2 explores one method for improving the efficiency of parsers by pre-compiling much of the search into tables. These techniques are used in shift-reduce parsers, which are used for parsing programming languages. These techniques can be generalized to handle ambiguity and used for natural languages as well. Section 6.3 describes a fully deterministic parsing framework that parses sentences without a recourse to backtracking. This means that the parser might make mistakes and fail to find interpretations for some sentences. If the technique works correctly, however, only sentences that people tend to misparse will be misparsed by the system. Section 6.4 looks at a variety of techniques to efficiently represent ambiguity by "collapsing" interpretations, or by eliminating ambiguity by redefining the goals of syntactic parsing. Section 6.5 looks at techniques of partially parsing the input by analyzing only those aspects of a sentence that can be determined reliably. 6.1 Human Preferences in Parsing So far this book has discussed parsing in the abstract, without regard to any psychological evidence concerning how people parse sentences. FJycholinguists have conducted many investigations into this issue using a variety of techniques, from intuitions about preferred interpretations to detailed experiments monitoring moment-by-moment processing as subjects read and listen to language. These studies have revealed some general principles concerning how people resolve ambiguity. 160 CHAPTER 6 Toward Efficient Parsing 161 1.1 S^NPVP 1.4 NP -» ARTN 1.2 VP -> V NP PP 1.5 NP -> NP PP 1.3 VP -> VNP 1.6 PP-> PNP Grammar 6.1 A simple CFG NP the man VP NP PP NP the man VP NP kept the dog in the house kept NP PP the dog in the house Figure 6.2 The interpretation on the left is preferred by the minimal attachment principle The most basic result from these studies is that people do not give equal weight to all possible syntactic interpretations. This can be illustrated by sentences that are temporarily ambiguous, which cause a conscious feeling of having pursued the wrong analysis, as in the sentence The raft floated down the river sank. When you read the word sank you realize that the interpretation you have constructed so far for the sentence is not correct. In the literature, such sentences are often called garden-path sentences, based on the expression about leading someone down a garden path. Here are a few of the general principles that appear to predict when garden paths will arise. Minimal Attachment The most general principle is called the minimal attachment principle, which states that there is a preference for the syntactic analysis that creates the least number of nodes in the parse tree. Thus, given Grammar 6.1. the sentence 77k-man kept the dog in the house would be interpreted with the PP in the house modifying the verb rather than the NP the dog. These two interpretations are shown in Figure 6.2. The interpretation with the PP attached to the VP is derived using rules 1.1, 1.2, and 1.6 and three applications of rule 1.4 for the NPs. The parse tree has a total of 14 nodes. The interpretation with the PP attached to the NP is derived using rules 1.1, 1.3, 1.5, and 1.6 and three applications of rule 1.4, producing a total of 15 nodes in the parse tree. Thus this principle predicts that the first interpretation is preferred, which probably agrees with your intuition. This principle appears to be so strong that it can cause certain sentences to be almost impossible to parse correctly. One example is the sentence We painted all the walls with cracks. which, against all common sense, is often read as meaning that cracks were painted onto the walls, or that cracks were somehow used as an instrument to paint the walls. Both these anomalous readings arise from the PP being attached to the VP {paint) rather than the NP (the walls). Another classic example is the sentence The horse raced past the barn fell. which has a reasonable interpretation corresponding to the meaning of the sentence The horse that was raced past the barn fell. In the initial sentence, however, creating a reduced relative clause when the word raced is encountered introduces many more nodes than the simple analysis where raced is the main verb of the sentence. Of course, this second interpretation renders the sentence unanalyzable when the word fell is encountered. Right Association The second principle is called right association or late closure. This principle states that, all other things being equal, new constituents tend to be interpreted as being part of the current constituent under construction (rather than part of some constituent higher in the parse tree). Thus, given the sentence George said that Henry left in his car. the preferred interpretation is that Henry left in the car rather than that George spoke in the car. Both interpretations are, of course, syntactically acceptable analyses. The two interpretations are shown in Figure 6.3. The former attaches the PP to the VP immediately preceding it, whereas the latter attaches the PP to the VP higher in the tree. Thus the right association principle prefers the former. Similarly, the preferred interpretation for the sentence / thought it would rain yesterday is that yesterday was when it was thought to rain, rather than the time of the thinking. Lexical Preferences In certain cases the two preceding principles seem to conflict with each other. In the sentence The man kept the dog in the house, the principle of right association appears to favor the interpretation in which the PP modifies the dog, while the minimal attachment principle appears to favor the PP modifying the VP. You 162 CHAPTER 6 Toward Efficient Parsing 163 Henry Henry left left Figure 6.3 Two interpretations of George said that Henry left in his car. might suggest that minimal attachment takes priority over right association in such cases; however, the relationship appears to be more complex than that. Consider the sentences 1. I wanted the dog in the house. 2. I kept the dog in the house. 3. I put Hie dog in the house. The PP in the house in sentence 1 seems most likely to be modifying dog (although the other interpretation is possible, as in the sense / wanted the dog to be in the house). In sentence 2, the PP seems most likely to be modifying the VP (although modifying the NP is possible, as in I kept the dog that was in the house). Finally, in sentence 3, the PP is definitely attached to the VP, and no alternative reading is possible. These examples demonstrate that lexical items, in this case the verb used, can influence parsing preferences. In many cases, the lexical preferences will override the preferences based on the general principles. For example, if a verb subcategorizes for a prepositional phrase, then some PP must be attached to die VP. Other PPs might also be identified as having a strong preference for attachment within the VP. If neither of these cases holds, the PP will be attached according to the general principles. Thus, for the preceding verbs, want has no preference for any PPs, whereas keep might prefer PPs with prepositions in, on, or by to be attached to the VP. Finally, the verb put requires (subcategorizes for) a PP beginning with in, on, by. and so on, which must be attached to the VP. This approach has promise but is 2.1 S-> NPVP 2.3 VP -> AUX V NP 2.2 NP -> ARTN 2.4 VP -> VNP i Grammar 6.4 A simple grammar with an AUX/V ambiguity hard to evaluate without developing some formal framework that allows such information to be expressed in a uniform way. Such techniques will be addressed in the next chapter. 6.2 Encoding Uncertainty: Shift-Reduce Parsers One way to improve the efficiency of parsers is to use techniques that encode uncertainty, so that the parser need not make an arbitrary choice and later backtrack. Rather, the uncertainty is passed forward through the parse to the point where the input eliminates all but one of the possibilities. If you did this explicitly at parse time, you would have an algorithm similar to the breadth-first parser described in Chapter 3. The efficiency of the technique described in this section arises from the fact that all the possibilities are considered in advance, and the information is stored in a table that controls the parser, resulting in parsing algorithms that can be much faster than described thus far. These techniques were developed for use with unambiguous context-free grammars—grammars for which there is at most one interpretation for any given sentence. While this constraint is reasonable for programming languages, it is clear that there is no unambiguous grammar for natural language. But these techniques can be extended in various ways to make them applicable to natural language parsing. Specifying the Parser State Consider using this approach on the small grammar in Grammar 6.4. The technique involves predetermining all possible parser states and determining the transitions from one state to another. A parser state is defined as the complete set of dotted rules (that is, the labels on the active arcs in a chart parser) applicable at that position in the parse. It is complete in the sense that if a state contains a rule of the form Y —>... ° X where X is a nonterminal, then all rules for X are also contained in the state. For instance, the initial state of the parser would include the rule S -> ° NP VP as well as all the rules for NP, which in Grammar 6.4 is only NP -> 0 ART N 164 CHAPTER 6 Toward Efficient Parsing 165 Thus the initial state, SO, could be summarized as follows: Initial State SO: S -> ° NP VP NP -> ° ART N In other words, the parser starts in a state where it is looking for an NP to start building an S and looking for an ART to build the NP. What states could follow this initial state? To calculate this, consider advancing the dot over a terminal or a nonterminal and deriving a new state. If you pick the symbol ART, the resulting state is State SI: NP -> ART ° N If you pick the symbol NP, the rule is S —> NP ° VP in the new state. Now if you expand out the VP to find all its possible starting symbols, you get the following: State S2: S -» NP 0 VP VP -» ° AUX V NP VP -v ° VNP Now, expanding SI, if you have the input N, you get a state consisting of a completed rule: State SI': NP -> ART N ° Expanding S2, a V would result in the state State S3: VP -> V ° NP NP -> ° ART N An AUX from S2 would result in the state State S4: VP AUX ° VNP and a VP from S2 would result in the state State S2': S -» NP VP ° Continuing from state S3 with an ART, you find yourself in state SI again, as you would also if you expand from SO with an ART. Continuing from S3 with an NP, on the other hand, yields the new state State S3': VP -> V NP ° Continuing from S4 with a V yields State S5: VP -> AUX V ° NP NP -4 ° ART N Figure 6.5 A transition graph derived from Grammar 6.1 and continuing from S5 with an ART would produce state SI again. Finally, continuing from S5 with an NP would produce the state State S5': VP -4 AUXVNP0 Now that this process is completed, you can derive a transition graph that can be used to control the parsing of sentences, as is shown in Figure 6.5. A Shift-Reduce Parser These states can be used to control a parser that maintains two stacks: the parse stack, which contains parse states (that is, the nodes in Figure 6.5) and grammar symbols; and the input stack, which contains the input and some grammar symbols. At any time the parser operates using the information specified for the top state on the parse stack. The states are interpreted as follows. The states that consist of a single rule with the dot at the far right-hand side, such as S2', S -> NP VP 0 indicate that the parser should rewrite die top symbols on the parse stack according to this rule. This is called a reduce action. The newly derived symbol (S in this case) is pushed onto the top of the input stack. Any other state not containing any completed rules is interpreted by the transition diagram. If the top input symbol matches an arc, then it and the new 166 CHAPTER 6 Toward Efficient Parsing 167 State Top Input Symbol Action GoTo SO ART Shift si so NP Shift S2 so S Shift SO' SO' e Succeed SI N Shift SI' sr Reduce by rule 2.2 S2 V Shift S3 S2 AUX Shift S4 S2 VP Shift S2' S2' Reduce by rule 2.1 S3 ART Shift SI S3 NP Shift S3' S3' Reduce by rule 2.4 S4 V Shift S5 S5 ART Shift SI S5 NP Shift S5' S5' Reduce by rule 2.3 Figure 6.6 The oracle for Grammar 6.4 state (at the end of the arc) are pushed onto the parse stack. This is called the shift action. Using this interpretation of states you can construct a table, called the oracle, that tells the parser what to do in every situation. The oracle for Grammar 6.4 is shown in Figure 6.6. For each state and possible input, it specifies the action and the next state. Reduce actions can be applied regardless of the next input, and the accept action only is possible when the input stack is empty (that is, the next symbol is the empty symbol e). The parsing algorithm tor using an oracle is specified in Figure 6.7. Consider parsing The man ate the carrot. The initial state of the parser is Parse Stack Input Stack (SO) (The man ate the carrot) Looking up the entry in the table in Figure 6.6 for state SO for the input ART (the category of the word the), you see a shift action and a move to state SI: Input Stack Parse Stack (SI ART SO) (man ate the carrot) Looking up the entry for state SI for the input N, you sec a shift action and a move to state SI': This algorithm uses the following information: • Action(S, W)—a function that maps a state and an input constituent to one of the values shift, reduce i, or accept • GoTo (S, W)—a function that maps a state and an input constituent to a new state • a parse stack of form (Sn C„ ... Si C[ S0), where S; are parse states and Q are constituents • an input stack of form (Wj .:. W„), where W, is a constituent symbol or word The parser operates by continually executing the following steps until success or failure: 1. If Action(S„, Wj)= Shift, and GoTo(Sn, Wj) = S, then remove W, from the input stack and push it on the parse stack, and then push S onto the parse stack, resulting in the following stacks: parse stack: (S W] Sn C„ ... St Ct S0) input stack: (W2 ...Wn) 2 If Action (S„,Wj)= Reduce i and grammar rule i has n constituents on its right-hand side, then remove 2n elements from the parse stack, and push the left-hand side of rule i onto the input stack. For example, if rule i were NP —> ART N, then the new state would be parse stack: (Sn_2 Cn_2... Sj Cl S0) input stack: (NP ... Wn) 3. If Action(S„, Wi) = Accept, then the parser has succeeded. 4. If Action (Sn, W;) is not defined, then the parser has failed. Figure 6.7 The parsing algorithm for a shift-reduce parser Parse Stack (ST N SI ART SO) Input Stack (ate the carrot) Looking up the entry for state SI', you then reduce by rule 2.2, which removes the SI', N, SI, and ART from the parse stack and adds NP to the input stack: Parse Stack (SO) Input Stack (NP ate the carrot) Again, consulting the table for state SO with input NP, you now do a shift and move to state S2: Parse Stack (S2.NP SO) Input Stack (ate the carrot) Next, the three remaining words all cause shifts and a move to a new state, ending up with the parse state: Parse Stack (Si' N SI ART S3 V S2 NP SO) Input Stack . . ( ) 168 CHAPTER 6 Toward Efficient Parsing 169 The reduce action by rule 2.2 specified in state SI' pops the N and ART from the stack (thereby popping SI and SI' as well), producing the state: Parse Stack Input Stack (S3 V S2 NP SO) (NP) You are now back at state S3, with an NP in the input, and after a shift to state; S3', you reduce by rule 2.4, producing: Parse Stack Input Stack (S2 NP SO) (VP) Finally, from state S2 you shift to state S2' and reduce by rule 2.1, producing: Parse Stack Input Stack (SO) (S) From this state you shift to state SO' and are in a position to accept the sentence. You have parsed the sentence without ever trying a rule in the grammar incorrectly and without ever constructing a constituent that is not part of the final analysis. Shift-Reduce Parsers and Ambiguity Shift-reduce parsers are efficient because the algorithm can delay decisions about which rule to apply. For example, if you had the following rules for parsing an NP, 1. NP -> ART N REL-PRO VP 2. NP -> ART N PP a top-down parser would have to generate two new active arcs, both of which would be extended if an ART were found in the input. The shift-reduce parser, on the other hand, represents both using a single state, namely NP1: NP -> ° ART N REL-PRO VP NP -> ° ART N PP If an ART is found, the next state will be NP2: NP -> ART ° N REL-PRO VP NP -> ART ° N PP Thus the ART can be recognized and shifted onto the parse stack without committing to which rule it is used in. Similarly, if an N is seen at state NP2, the state is NP3: NP -> ART N ° REL-PRO VP NP -> ART N ° PP PP —> ° P NP Now from NP3, you finally can distinguish the cases. If a REL-PRO is found next, the state is NP4: NP -» ART N REL-PRO ° VP VP -» ° VNP which leads eventually to a reduction by rule 1. If a P is found, you move to a different state, which eventually builds a PP that is used in the reduction by rule 2. Thus the parser delays the decision until sufficient information is available to choose between rules. Lexical Ambiguity This ability to postpone decisions can be used to deal with some lexical ambiguity by a simple extension to the parsing process. Whereas earlier you classified a lexical entry when it was shifted (that is, carrot was converted to N during the shift), you now allow ambiguous words to be shifted onto the parse stack as they are, and delay their categorization until a reduction involving them is made. To accomplish this extension, you must expand the number of states to include states that deal with ambiguities. For instance, if can could be a V or an AUX, the oracle cannot determine a unique action to perform from state S2—if it were a V you would shift to state S3 and if it were an AUX you would shift to state S4. Such ambiguities can be encoded, however, by generating a new state from S2 to cover both possibilities simultaneously. This new state will be the union of states S3 and S4: S3^1: VP -> AUX ° VNP VP -> V ° NP NP -> ° ART N In this case the next input should resolve the ambiguity. If you see a V next, you will move to S5 (just as you would from state S4). If you see an ART next, you will move to SI, and if you see an NP next, you will move to S3' (just as you would from S3). Thus the new state maintains the ambiguity long enough for succeeding words to resolve the problem. Of course, in general, the next word might also be ambiguous, so the number of new states could be quite large. But if the grammar is unambiguous—that is, there is only one possible interpretation of the sentence once it is completely parsed—this technique can be used to delay the decisions as long as necessary. (In fact, for those who know automata theory, this process is simply the construction of a deterministic simulation of a nondeter-ministic finite automata.) Ambiguous Parse States There arc. however, other forms of ambiguity that this parser, as it now stands, is not able to handle. The simplest examples involve the ambiguity that arises when one rule is a proper prefix of another, such as the rules 170 CHAPTER 6 Toward Efficient Parsing 171 NP NP ARTN ART N PP With these two rules, we will generate a parse state containing the two dotted rules NP -> ART ° N NP -> ART ° NPP If the next input in this state is an N, you would move to a state consisting of NP5: NP NP PP ■ ARTN 0 ■ ART N ° PP ° PPP But now there is a problem. If the next input is a P, eventually a PP will be built and you return to state NP5. But the parser cannot decide whether to reduce by rule 3, leaving the PP to be attached later, or to shift the PP (and then reduce by rule 4). In any reasonable grammar of English, of course, both choices might lead to acceptable parses of the sentence. There are two ways to deal with this problem. The first strategy maintains a deterministic parser and accepts the fact that it may misparse certain sentences. The goal here is to model human parsing performance and fail on only those that would give people trouble. One recent suggestion claims the following heuristics favor more intuitive interpretations over less intuitive ones: • favor shift operations over reduce operations • resolve all reduce-reduce conflicts in favor of the longest rule (that is, the one that uses the most symbols from the stack) Using these strategies, you could build a deterministic parser that picks the most favored rule and ignores the others. It has been claimed that the first heuristic corresponds to the right-association preference and the second to the minimal-attachment preference. While convincing examples can be shown to illustrate this, remember that the strategies themselves, and thus the examples, are highly sensitive to the form of the grammar used. The second approach to dealing with the ambiguous states is to abandon the deterministic requirement and reintroduce a search. In this case you could consider each alternative by either a depth-first search with backtracking or a breadth-first search maintaining several interpretations in parallel. Both of these approaches can yield efficient parsers. The depth-first approach can be integrated with the preference strategies so that the preferred interpretations are tried first. 6.3 A Deterministic Parser A deterministic parser can be built that depends entirely on matching parse states to direct its operation. Instead of allowing only shift and reduce actions, however, a richer set of actions is allowed that operates on an input stack called the buffer. The Parse Stack Top -> (S SUBJ (NP DET the HEAD cat)) The Buffer the fish Figure 6.8 A situation during a parse Rather than shifting constituents onto the parse stack to be later consumed by a reduce action, the parser builds constituents incrementally by attaching buffer elements into their parent constituent, an operation similar to feature assignment. Rather than shifting an NP onto the stack to be used later in a reduction S -> NP VP, an S constituent is created on the parse stack and the NP is attached to it. Specifically, this parser has the following operations: • Create a new node on the parse stack (to push the symbol onto the stack) • Attach an input constituent to the top node on the parse stack • Drop the top node in the parse stack into the buffer The drop action allows a completed constituent to be reexamined by the parser, which will then assign it a role in a higher constituent still on the parse stack. This technique makes the limited lookahead technique surprisingly powerful. To get a feeling for these operations, consider the situation in Figure 6.8, which might occur in parsing the sentence The cat ate the fish. Assume that the first NP has been parsed and assigned to the SUBJ feature of the S constituent on the parse stack. The operations introduced earlier can be used to complete the analysis. Note that the actual mechanism for deciding what operations to do has not yet been discussed, but the effect of the operations is shown here to provide intuition about the data structure. The operation Attach to MAIN-V would remove the lexical entry for ate from the buffer and assign it to the MAIN-V feature in the S on the parse stack. Next the operation Create NP would push an empty NP constituent onto the parse stack, creating the situation in Figure 6.9. Next the two operations Attach to DET Attach to HEAD would successfully build the NP from the lexical entries for the and fish. The input buffer would now be empty. 172 CHAPTER 6 Toward Efficient Parsing 173 The Parse Stack Top -»(NP) The Buffer (S SUBJ (NPDET the HEAD cat) MAIN-V ate) the fish Figure 6.9 After creating an NP The Parse Stack Top -> (S SUBJ (NPDET the HEAD cat) MAIN-V ate) The Buffer (NPDET the HEAD fish) Figure 6.10 After the drop action The operation Drop pops the NP from the parse stack and pushes it back onto the buffer, creating the situation in Figure 6.10. The parser is now in a situation to build the final structure with the operation Attach to OBJ which takes the NP from the buffer and assigns it to the OBJ slot in the S constituent. Three other operations prove very useful in capturing generalizations in natural languages: • Switch the nodes in the first two buffer positions • Insert a specific lexical item into a specified buffer slot • Insert an empty NP into the first buffer slot Each rule has a pattern that contains feature checks on the buffer to determine the applicability of the rule. Rules are organized into packets, which may be activated or deactivated during the parse. Additional actions are available for changing the parser state by selecting which packets to use. In particular, there are actions to • Activate a packet (that is, all its rules are to be used to interpret the next input) • Deactivate a packet Pattern Actions Priority Packet BUILD-AUX: 1. <=AUX, HAVEx=V, pastprt> Attach to PERF 10 2 <=AUX, BEx=V, ing> Attach to PROG 10 3. <=AUX, BE> <=V, pastprt> Attach to PASSIVE 10 4. <=AUX, -t-modal> <=V, inf> Attach to MODAL 10 5. <=AUX, DOx=V, inf> Attach to DO 10 6. Drop 15 Grammar 6.11 The rales for packet BUILD-AUX The active packets are associated with the symbols on the parse stack. If a new constituent is created (that is, pushed on the stack), all the active packets associated with the previous top of the stack become inactive until that constituent is again on the top of the stack. Consequently, the drop action will always deactivate the rules associated with the top node. Packets play a role similar to the states in the shift-reduce parser. Unlike the states in the shift-reduce parser, more than one packet may be active at a time. Thus there would be no need to create packets consisting of the union of packets to deal with word ambiguity, since both can be active simultaneously. Consider the example rules shown in Grammar 6.11, which deals with parsing the auxiliary structure. The pattern for each rule indicates the feature tests lhat must succeed on each buffer position for the rule to be applicable. Thus the pattern <=AUX, HAVE> <=V, pastprt> is true only if the first buffer is an AUX structure with ROOT feature value HAVE, and the second is a V structure with the VFORM feature pastprt. The priority associated with each rule is used to decide between conflicting rules. The lower the number, the higher the priority. In particular, rule 6, with the pattern , will always match, but since its priority is low, it will never be used if another one of the rules also matches. It simply covers the case when none of the rules match, and it completes the parsing of the auxiliary and verb structure. Figure 6.12 shows a parse state in which the state BUILD-AUX is active. It contains an AUXS structure on the top of the stack with packet BUILD-AUX active, and an S structure below with packets PARSE-AUX and CPOOL that will become active once the AUXS constituent is dropped into the buffer. Given this situation and the rules in Figure 6.11, the parser's next action is determined by seeing which rules match. Rules 1 and 6 succeed, and 1 is chosen because of its higher priority. Applying the actions of rule 1 produces the state in Figure 6.13. Now the rules in BUILD-AUX are applied again. This time only rule 6 succeeds, so the next action is a drop, creating the state in Figure 6.14. At this stage the rules in packets PARSE-AUX and CPOOL are active; they compete to determine the next move of the parser (which would be to attach the AUXS structure into the S structure). 174 CHAPTER 6 Toward Efficient Parsing 175 The Parse Stack Nodes Top -> (AUXS) (S MOOD DECL SUBJ (NP NAME John)) Active Packets (BUILD-AUX) (PARSE-AUX CPOOI.) The Input Buffer (AUX ROOT HAVE (V ROOT SEE (ART ROOT A FORM pres FORM en) NUM {3s}) NUM {3s)) Figure 6.12 A typical state of the parser The Parse Stack Nodes Top -> (AUXS PERF has) (S MOOD DECL (S SUBJ (NP NAME John)) The Input Buffer Active Packets (BUILD-AUX) (PARSE-AUX CPOOL) (V ROOT SEE (ART ROOT A CN ROOT DAY VFORM pastprt) NUM {3s}) NUM {3s}) Figure 6.13 After rule 1 is applied The Parse Stack Nodes Top -»(S MOOD DECL SUBJ (NP NAME John)) The Input Buffer Active Packets (PARSE-AUX CPOOL) (AUXS PERF has) (V ROOT SEE (ART ROOT A VFORM pastprt) NUM {3s}) Figure 6.14 The parse slate after a drop action The lookahead rules are restricted to examining at most the first three buffer elements, with one exception that occurs when an NP subconstituent needs to be constructed. If the NP starts in the second or third buffer position, then while it is being parsed, that position is used as though it were the first buffer. Called attention shifting, this circumstance is the only exception to the three-buffer restriction. With this qualification a rule may still inspect only three buffer positions, but the starting position may be shifted. Attention shifting is restricted so that under no circumstances can a rule inspect a position beyond the first five. One of the more interesting things about this parser is the way it captures many linguistic generalizations in an elegant fashion by manipulating the input buffer. For example, consider what is needed to extend a grammar that parses assertions into one that parses yes/no questions as well. In Chapter 5 you handled yes/no questions by adding a few rules to the grammar that handled the initial AUX and NP, and then connecting back into the grammar for assertions. This present parser actually reuses the rules for the initial subject and auxiliary as well. In particular, it uses the switch action to reorder the subject NP and the AUX in the buffer, directly capturing the intuition of subject-aux inversion. Psychological Validity Because of the limits on the lookahead and the deterministic nature of this parser, you can examine in detail its limitations as well as its coverage. Some researchers have argued that the limitations of the mechanism itself account for various constraints on the form of language, such as the complex-NP constraint described in Chapter 5. Rather than having to impose such a constraint in the grammar, this parser, because of its limitations, could not operate in any other way. Because of the limited lookahead, this mechanism must commit to certain structural analyses before the entire sentence has been examined. In certain cases a sentence may begin in such a way that the wrong decision is made and the sentence becomes unparsable. These examples were referred to earlier as garden-path sentences. The interesting thing about this phenomenon is that it provides a concrete proposal that can be experimentally validated. In particular, you might ask if the sentence structures with which this parser has difficulty are the same ones with which people have trouble. While the answer is not yet clear, the fact that the question can be asked means that this framework can be investigated experimentally. In essence, the theory is that any sentence that retains an ambiguity over more than a three-constituent window may cause trouble with a reader. Note that the lookahead is three constituents, not words; thus an ambiguity might be retained for quite some time without causing difficulty. For example, the following two sentences are identical for the first seven words: Have the students who missed the exam take it today. Have the students who missed the exam taken it today? 176 CHAPTER 6 Toward Efficient Parsing 177 VP /\ V NP /\ NP PP VP VP /\ /\ V NP VP PP P NP NP PP /\ NP PP NP PP V NP NP PP VP /\ VP PP /\ VP PP /\ V NP Figure 6.15 Five interpretations of saw the man in the house with a telescope The ambiguity between being an imperative or a sentence versus a yes/no question, however, never extends beyond three constituents, because six of the words are in a single noun phrase. Thus the parser will reach a state where the following two tests can easily distinguish the cases: <=havex=NPx=V, VFORM=base> -> imperative <=have> <=NP> <=V, VFORM=pastprt> -> yes/no question On the other hand, a sentence such as Have the soldiers given their medals by their sweethearts. cannot be disambiguated using only three constituents, and this parser, like most people, will initially misinterpret the sentence as a yes/no question that is ill formed, rather than recognize the appropriate imperative reading corresponding to Have their sweethearts give the soldiers their medals. 6.4 Techniques for Efficient Encoding of Ambiguity Another way to reduce ambiguity is to change the rules of the game by redefining the desired output. For instance, a significant amount of the ambiguity in NP I N2 /\ N2 N2 . N2 N2 /\ N2 N2 NP I N2 / \ N2 N2 /\ N2 N2 /\ N2 N2 NP N2 ■/■- \ N2 N2 N2 N2 N2 N2 N NNN NN NN NNNN NP N2 / \ N2 N2 /\ N2 N2 /\ N2 N2 I I N ' N N N NP I N2 /\ N2 N2 /\ N2 N2 /\ N2 N2 I I NNN N Figure 6.16 Five interpretations of pot holder adjustment screw sentences results from issues like attachment ambiguity and constructs such as coordination that often have many structural interpretations of essentially the same information. Figure 6.15 shows an example of the alternative interpretations that arise from PP attachment ambiguity in the VP saw the man in the house with a telescope. There are five interpretations arising from the different attachments, each having a different semantic interpretation. Figure 6.16 shows an example of ambiguities in noun-noun modification, such as pot holder adjustment screw, that would arise from a grammar such as 5. NP^N2 6. N2-^N 7. N2 -> N2 N2 These complications of course interact with each other, so a VP with two prepositional phrases and an NP with a sequence of four nouns would have 25 interpretations, and so on. It is not unusual for a moderate-size sentence, say of 12.words, with a reasonable grammar, to have over 1000 different structural 178 CHAPTER 6 Toward Efficient Parsing 179 VP 21(1.13) VP 20(1,12) VP 19(16,8) NP 18(15,8) VP 17(14,10) VP 16(14,7) VP 15(1,11) VP 14(1,2) NP 13(11.8) NP 12(2.10) NP 11(2,7) PP 10(3,9) NP 9(4,8) PP 7(3,4) PP 8(5,6) V 1 NP 2 P 3 NP 4 P 5 NP 6 saw the man in the house with a telescope Figure 6.17 The chart for the interpretations of saw the man in the house with a telescope interpretations! Clearly some techniques must be introduced to help manage such complications. This section briefly examines techniques for representing such large quantities of interpretations efficiently. In fact, the chart data structure used so far is already a significant step towards this goal, as it allows constituents to be shared across interpretations. For example, the PP attachment interpretations shown in Figure 6.15 are represented in the chart shown in Figure 6.17. Each chart entry is numbered, and the subconstituents are listed in parentheses. For example, the NP covering the man in the house is constituent 11 and has subconstituents 2 and 7. While Figure 6.15 uses 32 nonlexical nodes, the chart representation of the same five interpretations uses only 21 nonlexical nodes. While the savings are considerable, and grow as the amount of ambiguity grows, they are often not sufficient to handle the problem. Another technique that is used is called packing. This technique takes advantage of a property that may have occurred to you in looking at Figure 6.17, VP 15(13,10) (14,8) (1,12) VP 14(1,11) (13,7) VP 13(1,2) NP 12(2,10) (11,8) NP 11(2,7) PP 10(3,9) NP 9(4,8) PP 7(3,4) PP 8(5,6) V 1 NP 2 P 3 NP 4 P 5 NP 6 Figure 6.18 A packed chart for the interpretations of saw the man in the house with a telescope namely that there are many constituents of the same type that cover exactly the same input. To a parser without features, each of these constituents would be treated identically by any rule that uses them as a subconstituent. A packed chart representation stores only one constituent of any type over the same input, and any others found are collapsed into the existing one. You can still maintain a record for each derivation found in the single constituent. The chart for the same sentence using the packing technique is shown in Figure 6.18 and uses only 15 nonlexical nodes. Packing produces quite efficient representations of interpretations without losing information. With grammars that use feature equations, however, the situation is more complicated. In particular, two VP interpretations covering the same input but having different feature values cannot be merged, because they might differ in how they combine to form larger constituents. You could modify the feature checking to allow success if any one of the possible constituents meets the requirements, but then the chart might overgenerate interpretations, and you would have to reparse the possible solutions at the end to verify that they are valid. Another way to reduce interpretations is to modify the grammar so that the ambiguities become semantic ambiguities rather than syntactic ambiguities. For example, with noun-noun modification, you might not use rules 5, 6, and 7 mentioned previously, which impose a binary structure onto the noun modifiers, but rather just use a rule that allows a sequence of noun modifiers in a flat structure. In an extended grammar that allows the Kleene + operator, one ruk would suffice: N2 -> N+ 180 CHAPTER 6 Toward Efficient Parsing 181 The N+ notation allows one or more repetitions of the symbol N. Without the + notation, the grammar would need a series of rules, up to some fixed number of modifiers that you would allow, such as N2 -> N N2 -> N N N2 N N N N2-4 NNNN While such a grammar looks more complicated, it is much more efficient to parse, as the structural ambiguity is eliminated. In particular, there is only one interpretation of the NP pot holder adjustment screw. Similar approaches have been suggested for attachment ambiguity problems. Rather than generating all interpretations, a single interpretation is constructed, say the one where the PP takes the widest scope. This is sometimes called the canonical interpretation. The semantic interpreter then needs to be designed so that it can attach the PP lower in the tree, as seems appropriate based on the semantic constraints. Another approach is called the D-theory or description-theory approach. In D-theory, the meaning of the output of the parser is modified. In the traditional view, used throughout this book, constituents are built by defining their immediate subconstituents. The mother constituent immediately dominates its sub-constituents. D-theory defines a transitive relation dominates as follows: A constituent M dominates a constituent C if C is an immediate subconstituent of M, of a subconstituent of a subconstituent of M, and so on. With this new terminology, you can say that the output of a traditional parser is a set of immediate dominance relationships that define the parse tree. In D-theory the output is taken as a set of dominance relationships. Thus, for example, the analysis might say that VP1 dominates PP1. This means that PP1 may be a subconstituent of VP1 but also allows that it might be a subconstituent of a subconstituent of VP1. Thus by changing the nature of the output, many forms of ambiguity can be captured concisely. 6.5 Partial Parsing The most radical approach to the ambiguity problem is to give up on producing complete parses of sentences and only look for fragments that can be reliably identified. A simple way to approximate such a parser would be to remove rules from a general grammar that lead to ambiguity problems. For instance, if you remove the rules VP -> VPPP NP -> NPPP from a grammar, the PP attachment problem basically disappears. Of course, you also cannot produce a full parse for most sentences. With this simplified gram- 1 1 VP 7(1,2) PP 8(3,4) PP 9(5.6) V 1 NP 2 P. 3 NP 4 P 5 NP 6 Figure 6.19 Chart for saw the man in the home with a telescope showing partial analysis mar, a bottom-up chart parser would construct a sequence of syntactic fragments of the sentence that could be useful for later processing. Figure 6.19 shows the final chart for the verb phrase saw the man in the park with a telescope using a grammar with the PP modifier rules removed. While much of the global structure is missing, all the local structure is unambiguously identified and the chart contains a lot of useful syntactic information. Of course, this approach in itself doesn't solve the problem; rather, it delays it until the semantic-interpretation phase. But it may be that the ambiguity problem is easier to solve at the semantic level, as there are more sources of information available. Certainly in limited domain systems this appears to be the case. But a detailed discussion of this must wait until semantic interpretation is introduced. These issues will be developed further in Chapter 10. For the moment, consider how much structure can be identified reliably. From experience in building such systems, we know that certain structures can be identified quite reliably. These include the noun group—consisting of noun phrases from the initial determiner through prenominal modifiers to the head noun, but not including postmodifiers such as prepositional phrases the verb group—consisting of the auxiliary verb sequence, some adverbials, up to the head verb proper noun phrases—including simple names such as John as well as more complex names like the New York Times. This is especially easy if the input follows standard capitalization conventions because, with capitalization, names can be parsed even if they are not in the lexicon. It might appear that other structures could be identified reliably as well, such as prepositional phrases, and verb phrases consisting of obligatory sub-categorized constituents. Such phrases can be generated, but they may not be truly part of the full parse of the sentence because of the limitations in handling noun phrases. In particular, a partial parser might suggest that there is a prepositional phrase by the leader in the sentence We were punished by the leader of the group. But the full parse of the sentence would not contain such a PP, as it would have a PP with the object of the preposition being the leader of the group. Since the partial parser is not able to make attachment decisions, it cannot create the appropriate reading. This leaves you with a choice. Some systems identify the 182 CHAPTER 6 Toward Effieient Pacing 183 (PRO "We") (VP (V "saw") (NP "the house boats")) (PP (P "near") (NP "the lake")) (V "sink") (?? "unexpectedly") (PP (P "at") (NP "dawn")) Figure 6.20 A partial parse of We saw the home boats near the lake sink unexpectedly at dawn. prepositions but leave the PPs unanalyzed, while others construct the incorrect interpretation and depend on the semantic interpretation process to correct it. Because of the limitations of their coverage, partial parsing systems can be based on regular grammars (or equivalently, finite state machines) rather than using the full power of context-free grammars. They also often depend on a subsystem to accurately predict the correct part of speech for each word. Such part-of-speech tagging systems are discussed in the next chapter. For the moment, you can simply assume that the correct part of speech is given with the input. The output of such systems is then a sequence of syntactic fragments, some as small as the lexical category of a function word, and others fairly complex verb sequences and NPs. When faced with an unknown word, the partial parser simply leaves it unanalyzed and moves on. Even with a limited grammar there is potential ambiguity, and partial parsers generally use heuristics to reduce the possibilities. One popular technique is to favor longer constituents over shorter ones of the same type. The heuristic would rather interpret The house boats as one NP than two {The house and boats), even though the latter would be possible. Figure 6.20 shows the sequence of constituents that might be built for such a system given the input We saw the house boats near the lake sink unexpectedly at dawn. Summary There are several ways to build efficient parsers. Some methods improve the efficiency of search-based parsing models, while others specify models of parsing that are inherently deterministic or only partially analyze the sentences. The methods use several different techniques: • encoding ambiguity within parse states • using lookahead techniques to choose appropriate rules • using efficient encoding techniques for encoding ambiguity • changing the definition of the output so that it ignores ambiguity Related Work and Further Readings There is a large literature in psycholinguistics on parsing preferences, starting with the work by Kimball (1973). Lexical preferences were suggested by Ford, Bresnan, and Kaplan (1982). An excellent collection of recent work in the field is Clifton, Frazier, and Rayner (in press). Some of the articles in this volume focus on exactly the parsing principles and preferences discussed in Section 6.1. The researchers seem to be converging on the conclusion that the human processing system takes into account the frequency with which different words and structures appear in different syntactic environments, and then evaluates the most likely possibilities using semantic information and knowledge of the discourse context. In this view, principles such as minimal attachment and right association provide useful descriptions of the preferences, but the preferences actually arise because of other factors. Computational models that have explicitly involved parsing preferences include Shieber (1984), Pereira (1985), and Schubert (1986). Additional computational models will be discussed in Chapter 7. As mentioned in the text, shift-reduce parsers have traditionally been used for parsing programming languages with unambiguous grammars. A good text discussing this work is Aho et al. (1986), who would classify the shift-reduce parser described in Section 6.2 as an LR(1) parser (that is, a left-to-right parser using a one-symbol lookahead). These techniques have been generalized to handle ambiguity in natural language by various techniques. Shieber (1984) introduces the techniques, described in Section 6.2, that generalize the treatment of terminal symbols to delay lexical classification of words, and that use parsing preferences to select between rules when more than one rule appears to be applicable. Tomita (1986) reincorporates a full search with an optimized parse-tree representation while taking advantage of the speed of the oracle-driven shift-reduce parsing. The section on deterministic parsing is based on Marcus (1980). Charniak (1983) developed a variant in which the parse states were automatically generated from a context-free grammar. The most comprehensive grammar within this formalism in the Fidditch parser, of which it is hard to find a description in the literature, but see Hindle (1989) for a brief description. Berwick (1985) used this framework in an investigation of language learning. The technique of packing is used in many systems. Two recent examples are Tomita (1986) and Alshawi (1992). The extent of ambiguity in natural language is explored systematically by Church and Patil (1982), who discuss techniques such as canonical forms and packing in detail. D-theory is described in Marcus et al. (1983) and has been formalized for use within the TAG formalism by Vijay-Shankar (1992). Fidditch (Hindle, 1989) is one of the best examples of a parser that only analyzes structure that can be reliably identified. Its output is a set of fragments rather than a complete interpretation. 184 CHAPTER 6 Toward Efficient Parsing 185 Exercises for Chapter 6 1. (easy) Give a trace of the shift-reduce parser using the oracle in Figure 6.6 as it parses the sentence The dog was eating the bone. 2. (medium) List and define the three parsing strategies that this text says people use. Discuss how the following sentences are ambiguous, and state which reading is preferred in the framework of these parsing strategies. a It flew past the geese over the field. b. The artist paints the scene in the park. c. He feels the pain in his heart. 3. (medium) Build an oracle for the following grammar: S -> NP VP NP -> ART N NP -» ART N PP VP -> V NP VP -> V NP PP Identify places where there is an ambiguity that the technique described cannot resolve, and use the heuristics at the end of Section 6.2 to select an unambiguous action for the oracle. Give an example of a sentence that will be incorrectly parsed because of this decision. 4. (medium) a Construct the oracle for the following VP grammar involving to-infinitives and prepositions: VP -> V NP VP -> VINF VP -» VPP INF -» to VP pp _> PNP NP -> DETN NP -> NAME b. Can the oracle correctly analyze both of the following VPs without resorting to guessing? Trace the parse for each sentence starting at the word walked. Jack walked to raise the money. Jack walked to the store. Consider allowing word ambiguity in the grammar. In particular, the word sand can be either a noun or a verb. What does your parser do with the VPs 5. (medium) Design a deterministic grammar that can successfully parse both I know 7. 8. and I know that boy I NP that is I V boys I NP lazy. I ADJ are lazy. I I V ADJ COMPLEMENTIZER You may use as much of the grammar presented in this chapter as you wish. Trace the parser on each of these sentences in the style found in this chapter. Describe the important points of your analysis. Discuss how general your solution is in dealing with the various uses of the word that. Show at least one further example of a sentence involving that and outline how your grammar accounts for it (them). (medium) Modify an existing chart parser so that it uses a packed chart, as described in Section 6.4. Perform a set of experiments with a grammar without features and compare the size of the chart with and without packing. (medium) Construct an example that shows what problems arise when you use a packed chart with a grammar with features. Devise a strategy for dealing with the problems. (medium) Assume a chart parser using a packed representation. Develop an upper limit on the number of operations the parser will have to do to parse a sentence of length S using a grammar that uses N nonterminals and T terminal symbols. Jack turned to sand. Jack turned to sand the board. CHAPTER Ambiguity Resolution: Statistical Methods 7.1 Basic Probability Theory 7.2 Estimating Probabilities 7.3 Part-of-Speech Tagging 7.4 Obtaining Lexical Probabilities 7.5 Probabilistic Context-Free Grammars 7.6 Best-First Parsing 7.7 A Simple Context-Dependent Best-First Parser Ambiguity Resolution: Statistical Methods 189 Chapter 6 suggested employing heuristics to choose between alternate syntactic analyses. Creating such heuristics, however, is difficult and time consuming. Furthermore, there is no systematic method for evaluating how well the heuristic rules work in practice. This section explores some techniques for solving these problems based on probability theory. Such approaches have become very popular in the past few years because large databases, or corpora, of natural language data have become available. This data allows you to use statistically based techniques for automatically deriving the probabilities needed. The most commonly available corpus, the Brown corpus, consists of about a million words, all labeled with their parts of speech. More recently, much larger databases have become available in formats ranging from raw text format (as in databases of articles from the AP news wire and the Wall Street Journal) to corpora with full syntactic annotation (such as the Penn Treebank). The availability of all this data suggests new analysis techniques that were previously not possible. Section 7.1 introduces the key ideas in probability theory and explores how they might apply to natural language applications. Section 7.2 considers some techniques for estimating probabilities from corpora, and Section 7.3 develops techniques for part-of-speech tagging. Section 7.4 defines the technique for computing lexical probabilities. Section 7.5 introduces probabilistic context-free grammars, and Section 7.6 explores the issue of building a probabilistically driven best-first parser. The final section introduces a context-dependent probability estimate that has significant advantages over the context-free method. 7.1 Basic Probability Theory Intuitively, the probability of some event is the likelihood that it will occur. A probability of 1 indicates that the event is certain to occur, while a probability of 0 indicates that the event definitely will not occur. Any number between 0 and 1 indicates some degree of uncertainty. This uncertainty can be illuminated by considering the odds of the event occurring, as you would if you were going to bet on whether the event will occur or not. A probability of .5 would indicate that the event is equally likely to occur as not to occur, that is, a "50/50" bet. An event with probability .5 would occur exactly half of the time. An event with probability .1 would occur once in every 10 opportunities (1/10 odds), whereas an event with probability .75 would occur 75 times out of 100 (3/4 odds). More formally, probability can be defined in terms of a random variable, which may range over a predefined set of values. While random variables may range over infinite sets and continuous values, here we will use only random variables that range over a finite set of values. For instance, consider tossing a coin. The random variable TOSS representing the result of a coin toss would range over two possible values: heads (h) or tails (t). One possible event is that the coin comes up heads—TOSS=h; the other is that the coin comes up tails— TOSS=t, No other value for TOSS is possible, reflecting the fact that a tossed coin always comes up either heads or tails. Usually, we will not mention the 190 CHAPTER 7 Ambiguity Resolution: Statistical Methods 191 random variable and will talk of the probability of TOSS=h simply as the probability of the event h. A probability function, PROB, assigns a probability to every possible value of a random variable. Every probability function must have the following properties, where ei,en are the possible distinct values of a random variable E: 1. PROB(e[)>0, for alii 2. PROB(&i) < 1, for all i 3. £i=lin«?0B(ei) = l Consider a specific example. A particular horse, Harry, ran 100 races in his career. The result of a race is represented as a random variable R that has one of two values, Win or Lose. Say that Harry won 20 times overall. Thus the probability of Harry winning the race is PROB(R=Win) = .2 and the probability of him losing is PROB(R=Lose) = .8. Note that these values satisfy the three constraints just defined. Of course, there may be many different random variables, and we are often interested in how they relate to each other. Continuing the racing example, consider another random variable W representing the state of the weather and ranging over the values Rain and Shine. Let us say that it was raining 30 times out of the 100 times a race was run. Of these 30 races, Harry won 15 times. Intuitively, if you were given the fact that it was raining—that is, that W=Rain— the probability that R=Win would be .5 (15 out of 30). This intuition is captured by the concept of conditional probability and is written as PROB (Win I Rain), which is often described as the probability of the event Win given the event Rain. Conditional probability is defined by the formula PROB(e\ e') = PROB(e & e') / PROB(e') where PROB(e & e') is the probability of the two events e and e' occurring simultaneously. You know that PROB(Rain) = .3 wdPROB(Win & Rain) = .15, and using the definition of conditional probability you can compute PROB(Wm I Rain) and see that it agrees with your intuition: PROB (Win I Rain) =PROB (Win & Rain) I PROB (Rain) = .15/.30 = .5 An important theorem relating conditional probabilities is called Bayes' rule. This rule relates the conditional probability of an event A given B to the -conditional probability of B given A: PROB (A I B) PROB(B\ A) * PROB(A) PROB(B) We can illustrate this rule using the race horse example. Using Bayes' rule we can compute the probability that it rained on a day that Harry won a race: PÄOS(RainlWin) : (PROB (Win I Rain) * PÄOß(Rain)) / PROB (Win) = (.5 * .3)/.2 = .75 which, of course, is the same value as if we calculated the conditional probability directly from its definition: PROB (Rain I Win) = PROB (Rain & Win) / PROB (Win) = .15/. 20 = .75 The reason that Bayes' rule is useful is that we usually do not have complete information about a situation and so do not know all the required probabilities. We can, however, often estimate some probabilities reasonably and then use Bayes' rule to calculate the others. Another important concept in probability theory is the notion of independence. Two events are said to be independent of each other if the occurrence of one does not affect the probability of the occurrence of the other. For instance, in the race horse example, consider another random variable, L, that indicates whether I took my lucky rabbit's foot to the race (value Foot) or not (Empty). Say I took my rabbit's foot 60 times, and Harry won 12 races. This means that the probability that Harry won the race on a day that I took the rabbit's foot is PROB(Win I Foot) = 12/60 = .2. Since this is the same as the usual probability of Harry winning, you can conclude that winning the race is independent of taking the rabbit's foot. More formally, two events A and B are independent of each other if and only if PROB(A\B) = PROB(A) which, using the definition of conditional probability, is equivalent to saying PROB (A & B) = PROB(A) * PROB(B) Note that the events of winning and raining are not independent, given that PROB(Wm & Rain) = .15 while PROB (Win) * PKOB(Rain) = .2 * .3 = .06. In other words, winning and raining occur together at a rate much greater than random chance. Consider an application of probability theory related to language, namely part-of-speech identification: Given a sentence with ambiguous words, determine the most likely lexical category for each word. This problem will be examined in detail in Section 7.3. Here we consider a trivial case to illustrate the ideas underlying probability theory. Say you need to identify the correct syntactic category for words that can be either nouns or verbs. This can be formalized using two random variables: one, C, that ranges over the parts of speech (N and V), and another, W, that ranges over all the possible words. Consider an example when W = flies. The problem can be stated as determining whether PROB(C=N I W'-flies) or PROB(C=V I W=flies) is greater. 1Mote that, as mentioned earlier, the 192 CHAPTER 7 Ambiguity Resolution: Statistical Methods 193 random variables will usually be omitted from the formulas. Thus PROB(C=N I 1W=flies) will usually be abbreviated as PROB(N I flies). Given the definition of conditional probability, the probabilities for the categories of the word flies are calculated as follows: PROB (N I flies) = PROB(flies & N) / PROB (flies ) PROB(V I flies) = PROB(flies & V) / PROB (flies) So the problem reduces to finding which of PROB (flies & N) and PROB (flies & V) is greater, since the denominator is the same in each formula. How might you obtain these probabilities? You clearly cannot determine the true probabilities since you don't have a record of all text ever written, let alone the text that will be processed in the future. But if you have a large enough sample of data, you can estimate the probabilities. Let's say we have a corpus of simple sentences containing 1,273,000 words. Say we find 1000 uses of the word flies, 400 of them in the N sense, and 600 in the V sense. We can approximate the probabilities by looking at the ratios of the frequencies of the words in the corpus. For example, the probability of a randomly selected word being the word flies is PROB(flies) = 1000 / 1,273,000 = .0008 The joint probabilities for flies as a noun and flies as a verb are PROBiflies & N) = 400 / 1,273,000 = .0003 PROB(flies & V) = 600 / 1,273,000 = .0005 Thus our best guess is that each use of flies is a verb. We can compute the probability that flies is a verb using the formula PROB(\ I flies) = PROB(V & flies) I PROB(flies) = .0005 / .0008 = .625 Using this method, an algorithm that always asserts flies to be a verb will be correct about 60 percent of the time. This is clearly a poor strategy, but better than guessing that it is a noun all the time! To get a better method, you must consider more context, such as the sentence in which flies occurs. This idea is developed in Section 7.3. 7.2 Estimating Probabilities If you have all the data that would ever be relevant to a problem, you can compute exact probabilities for that data. For instance, say Hairy, the horse in the last section, ran only 100 races and then was put out to pasture. Now you can compute the exact probability of Harry winning any particular race someone might choose of the 100 possibilities. But, of course, this is not how probability is generally used. Typically, you want to use probability to predict future behavior; that: is, you'd use information on Harry's past performance to predict how likely he is Results HH HT TH TT Estimate of ProbQJ) 1.0 5 5 0 Acceptable Estimate? NO YES YES NO Figure 7.1 Possible estimates of Prob(H) given two flips to win his 101st race (which you are going to bet on). This is a real-life application of probability theory. You are in the same position when using probabilities to help resolve ambiguity, since you are interested in parsing sentences that have never been seen before. Thus you need to use data on previously occurring sentences to predict the interpretation of the next sentence. We will always be working with estimates of probability rather than the actual probabilities. As seen in the last section, one method of estimation is to use the ratios from the corpus as the probability to predict the interpretation of the new sentence. Thus, if we have seen the word flies 1000 times before, and 600 of them were as a verb, then we assume that PROB(V I flies) is .6 and use that to guide our guess with the 1001st case. This simple ratio estimate is called the maximum likelihood estimator (MLE). If you have many examples of the events you are estimating, this can be a very reliable estimate of the true probability. In general, the accuracy of an estimate increases as the amount of data used expands, and there is a theorem of statistics—the law of large numbers—that states that estimates can be made as accurate as desired if you have unlimited data. Estimates can be particularly unreliable, however, when only a small number of samples are involved. Consider trying to estimate the true probability of a fair coin coming up heads when it is flipped. For the sake of discussion, since we know the actual answer is .5, let us say the estimate is accurate enough if it falls between .25 and .75. This range will be called the acceptable margin of error. If you only do two trials, there are four possible outcomes, as shown in Figure 7.1: two heads, heads then tails, tails then heads, or two tails. This means that, if you flip a coin twice, half the time you will obtain an estimate of 1/2 or .5. The other half of the time, however, you will estimate that the probability of coming up heads is 1 (the two-heads case) or 0 (the two-tails case). So you have only a 50 percent chance of obtaining an estimate within the desired margin of error. With three flips, the chance of getting a reliable enough estimate jumps to 75 percent, as shown in Figure 7.2. With four flips, there is an 87.5 percent chance of the estimate being accurate enough, with eight flips a 93 percent chance, with 12 flips a 95 percent chance, and so on. No matter how long you flip, there is always a chance that the estimate found is inaccurate, but you can reduce the probability of this occurring to as small a number as you desire if you can do enough trials. 194 CHAPTER 7 Ambiguity Resolution: Statistical Methods 195 Results Estimate of Prob(H) Acceptable Estimate? HHH 1.0 NO HHT .66 YES HTH .66 YES HTT .33 YES THH .66 YES THT .33 YES TTH .33 YES TTT 0 NO Figure 1.2 Possible estimates of Prob(H) given three flips Almost any method of estimation works well when there is a lot of data. Unfortunately, there are a vast number of estimates needed for natural language applications, and a large proportion of these events are quite rare. This is the problem of sparse data. For instance, the Brown corpus contains about a million words, but due to duplication there are only 49,000 different words. Given this, you might expect each word to occur about 20 times on average. But over 40,000 of the words occur five times or less. With such few numbers, our estimates of the part of speech for such words may be highly inaccurate. The worst case occurs if a low-frequency word does not occur at all in one of its possible categories. Its probability in this category would then be estimated as 0; thus no interpretation using the word in any category would be possible, because the probability of the overall sentence containing the word would be 0 as well. Unfortunately, rare events are common enough in natural language applications that reliable estimates for these low-frequency words are essential for Ihc algorithms. There are other estimation techniques that attempt to address the problem of estimating probabilities of low-frequency events. To examine them, let's introduce a framework in which they can be compared. For a particular random variable X, all techniques start with a set of values, Vj, computed from the count of the number of times X = xj. The maximum likelihood estimation technique uses Vj = Ixj I; that is, Vj is exactly the count of the number of times X = xj . Once Vj is determined for each xj, the probability estimates are obtained by the formula PROB(X = x;) = Vi/ZjVi Note that the denominator guarantees that the estimates obey the three properties of a probability function defined in Section 7.1. One technique to solve the zero probability problem is to ensure that no Vj has the value 0. We might, for instance, add a small amount, say .5. to every count, such as in Vj — Ixjl -t- .5. This guarantees no zero probabilities yet retains the relative likelihoods for the frequently occurring values. This estimation technique is called the expected likelihood estimator (ELE). To see the difference between this and the MLE, consider a word w that happens not to occur in the corpus, and consider estimating the probability that w occurs in one of 40 word classes Li,L40. Thus we have a random variable X, where X = Xj only if w appears in word category Lj. The MLE for PROB(X = xj) will not be defined because the formula has a zero denominator. The ELE, however, gives an equally likely probability to each possible word class. With 40 word classes, for instance, each Vj will be .5, and thus PROB(L{ I w) = .5/20 = .025. This estimate better reflects the fact that we have no information about the word. On the other hand, the ELE is very conservative. If w appears in the corpus five times, once as a verb and four times as a noun, then the MLE estimate of PROB(N I w) would be .8, while the ELE estimate would be 4.5/25 = .18, a very small value compared to intuition. Evaluation Once you have a set of estimated probabilities and an algorithm for some particular application, you would like to be able to tell how well your new technique performs compared with other algorithms or variants of your algorithm. The general method for doing this is to divide the corpus into two parts: the training set and the test set. Typically, the test set consists of 10-20 percent of the total data. The training set is then used to estimate the probabilities, and the algorithm is run on the test set to see how well it does on new data. Running the algorithm on the training set is not considered a reliable method of evaluation because it does not measure the generality of your technique. For instance, you could do well on the training set simply by remembering all the answers and repeating them back in the test! A more thorough method of testing is called cross-validation,' which involves repeatedly removing different parts of the corpus as the test set, training on the remainder of the corpus, and then evaluating on the new test set. This technique reduces the chance that the test set selected was somehow easier than you might expect. 7.3 Part-of-Speech Tagging Part-of-speech tagging involves selecting the most likely sequence of syntactic categories for the words in a sentence. A typical set of tags, used in the Penn Treebank project, is shown in Figure 7.3. In Section 7.1 you saw the simplest algorithm for this task: Always choose the interpretation that occurs most frequently in the training set. Surprisingly, this technique often obtains about a 90 percent success rate, primarily because over half the words appearing in most corpora are not ambiguous. So this measure is the stalling point from which to evaluate algorithms that use more sophisticated techniques. Unless a method does significantly better than 90 percent, it is not working very well. 196 chapter 7 Ambiguity Resolution: Statistical Methods 197 1. CC Coordinating conjunction 19. PP$ Possessive pronoun 2. CD Cardinal number 20. RB Adverb 3. DT Determiner 21. RBR Comparative adverb 4. FX Existential there 22. RBS Superlative Adverb 5. FW Foreign word " 23. RP Particle 6. IN Preposition / subord. conj 24. SYM Symbol (math or scientific) 7. JJ Adjective 25. TO to 8. JJR Comparative adjective 26. UH Interjection 9. JJS Superlative adjective 27. VB Verb, base form 10. LS List item marker 28. VBD Verb, past tense 11. MD Modal 29. VBG Verb, gerund/pres. participle 12. NN Noun, singular or mass 30. VBN Verb, past participle 13. NNS Noun, plural 31. VBP Verb, non-3s, present 14. NNP Proper noun, singular 32. VBZ Verb, 3s, present 15. NNPS Proper noun, plural 33. WDT Wh-determiner 16. PDT Predeterminer 34. WP Wh-pronoun 17. POS Possessive ending 35. WPZ Possessive wh-pronoun 18. PRP Personal pronoun 36. WRB Wh-adverb Figure 7.3 The Penn Treebank tagset The general method to improve reliability is to use some of the local context of the sentence in which the word appears. For example, in Section 7.1 you saw that choosing the verb sense of flies in the sample corpus was the best choice and would be right about 60 percent of the time. If the word is preceded by the word the, on the other hand, it is much more likely to be a noun. The technique developed in this section is able to exploit such information. Consider the problem in its full generality. Let wi,wt be a sequence of words. We want to find the sequence of lexical categories Ci, Cj that maximizes 1. PROB(C},Cj\w\,wt) Unfortunately, it would take far too much data to generate reasonable estimates for such sequences, so direct methods cannot be applied. There are, however, reasonable approximation techniques that produce good results. To develop them, you must restate the problem using Bayes' rule, which says that this conditional probability equals 2. (PROB(Cu CT) *PROB(wi,wT IC i,CT)) / PROBtyy,wT) As before, since we are interested in finding the Cj.....Cn that gives the maximum value, the common denominator in all these cases will not affect the answer. Thus the problem reduces to finding the sequence C], mizes the formula , Cn that maxi- 3. P/?OB(Ci,...,CT)*P/t05(wi,...,wTICi,...,CT) There are still no effective methods for calculating the probability of these long sequences accurately, as it would require far too much data. But the probabilities can be approximated by probabilities that are simpler to collect by making some independence assumptions. While these independence assumptions are not really valid, the estimates appear to work reasonably well in practice. Each of the two expressions in formula 3 will be approximated. The first expression, the probability of the sequence of categories, can be approximated by a series of probabilities based on a limited number of previous categories. The most common assumptions use either one or two previous categories. The bigram model looks at pairs of categories (or words) and uses the conditional probability that a category C; will follow a category Ci_i, written as PROB(C{ I Cj_i). The trigram model uses the conditional probability of one category (or word) given the two preceding categories (or words), that is, PROB(C[ I Q_2 Q_i). These models are called n-gram models, in which n represents the number of words used in the pattern. While the trigram model will produce better results in practice, we will consider the bigram model here for simplicity. Using bigrams, the following approximation can be used: PROB(Ch CT) =Ui=ij PROB(Ci I C;_i) To account for the beginning of a sentence, we posit a pseudocategory 0 at position 0 as the value of Co - Thus the first bigram for a sentence beginning with an ART would be PROB(ART I 0). Given this, the approximation of the probability of the sequence ART N V N using bigrams would be PROB (ART N V N) = PROB(AK3 I 0) * PROB(N I ART) * PROB(V I N) * PROB(N I V) The second probability in formula 3, PROB(w\.....wTICi, ...,Ct) can be approximated by assuming that a word appears in a category independent of the words in the preceding or succeeding categories. It is approximated by the product of the probability that each word occurs in the indicated part of speech, that is, by PROB(v/\,wT IC i,CT) = rii=i,T PROB(w[ I Ci) With these two approximations, the problem has changed into finding the sequence C\,Ct that maximizes the value of ni=i,T PROB(Ci ICj_i) * PROB(wi | CO The advantage of this new formula is that the probabilities involved can be readily estimated from a corpus of text labeled with parts of speech. In particular, given a database of text, the bigram probabilities can be estimated simply by counting the number of limes each pair of categories occurs compared to the 198 CHAPTER 7 Ambiguity Resolution: Statistical Methods 199 Category Count at i Pair Count at i,i+l Bigram Estimate 0 300 0, ART 213 PROB (ART 1 0) .71 0 300 0,N 87 PROB(N[0) .29 ART 558 ART, N 558 PROB(N 1 ART) 1 N 833 N,V 358 PROBiy 1 N) .43 N 833 N,N 108 PROB(N\N) .13 N 833 N, P 366 PROB(P\N) .44 V 300 V,N 75 PROB(N\V) .35 V 300 V, ART 194 PROB (ART 1 V) .65 P 307 P, ART 226 PROB (AKT 1 P) .74 P 307 P,N 81 PROB (NW) .26 Figure 7.4 Bigram probabilities from the generated corpus individual category counts. The probability that a V follows an N would be esli -mated as follows: DD„D,^ x„ CountfN at position i-1 and V at i) 11 Count(N at position t-1) Figure 7.4 gives some bigram frequencies computed from an artificially generated corpus of simple sentences. The corpus consists of 300 sentences bul has words in only four categories: N, V, ART, and P. In contrast, a typical real tagset used in the Penn Treebank, shown in Figure 7.3, contains about 40 tags. The artificial corpus contains 1998 words: 833 nouns, 300 verbs, 558 articles, and 307 prepositions. Each bigram is estimated using the previous formula. To deal with the problem of sparse data, any bigram that is not listed here will be assumed to have a token probability of .0001. The lexical-generation probabilities, PROB(w{ I Cj), can be estimated simply by counting the number of occurrences of each word by category. Figure 7.5 gives some counts for individual words from which the lexical-generation probability estimates in Figure 7.6 are computed. Note that the lexical-generation probability is the probability that a given category is realized by a specific word, not the probability that a given word falls in a specific category. For instance, PROB(the I ART) is estimated by Count (# times the is an ART) / Count (# times an ART occurs). The other probability, PROB(AKT I the), would give a very different value. Given all these probability estimates, how might you find the sequence of categories that has the highest probability of generating a specific sentence? The brute force method would be to generate all possible sequences that could generate the sentence and then estimate the probability of each and pick the best one. The problem with this is that there are an exponential number of sequences—given N categories and T words, there are N 1 possible sequences. N V ART P TOTAL flies 21 23 0 0 44 fruit 49 5 1 0 55 like 10- 30 0 21 61 a 1 0 201 0 202 the 1 0 300 2 303 flower 53 15 0 0 68 flowers 42 16 0 0 58 birds 64 1 0 0 65 others 592 210 56 284 1142 TOTAL 833 300 558 307 1998 Figure 7.5 A summary of some of the word counts in the corpus PROB (the ART) .54 PROB(a\ ART) .360 PROB (flies IN) .025 PROB(a\H) .001 PROB (flies 1 V) .076 PROB (flower IN) .063 PROB (like IV) .1 PROB (flower IV) .05 PROB (like IP) .068 PROB (birds 1N) .076 PROB(like IN) .012 Figure 7.6 The lexical-generation probabilities Luckily, you can do much better than this because of the independence assumptions that were made about the data. Since we are only dealing with bigram probabilities, the probability that the i'th word is in a category C, depends only on the category of the (i-l)th word, Ci_i. Thus the process can be modeled by a special form of probabilistic finite state machine, as shown in Figure 7.7. Each node represents a possible lexical category and the transition probabilities (the bigram probabilities in Figure 7.4) indicate the probability of one category following another. With such a network you can compute the probability of any sequence of categories simply by finding the path through the network indicated by the sequence and multiplying the transition probabilities together. For instance, the sequence ART N V N would have the probability .71 * 1 * .43 * .35 = .107. This representation* of course, is only accurate if the probability of a category occurring depends only on the one category before it. In probability theory this is often called the Markov assumption, and networks like that in Figure 7.7 are called Markov chains. 200 CHAPTER 7 Ambiguity Resolution: Statistical Methods 201 Figure 7.7 A Markov chain capturing the bigram probabilities The network representation can now be extended to include the lexical-generation probabilities as well. In particular, we allow each node to have an output probability, which gives a probability to each possible output that could correspond to the node. For instance, node N in Figure 7.7 would be associated with a probability table that indicates, for each word, how likely that word is to be selected if we randomly select a noun. The output probabilities are exactly the lexical-generation probabilities shown in Figure 7.6. A network like that in Figure 7.7 with output probabilities associated with each node is called a Hidden Markov Model (HMM). The word hidden in the name indicates that for a specific sequence of words, it is not clear what state the Markov model is in. For instance, the word flies could be generated from state N with a probability of .025 (given the values in Figure 7.6), or it could be generated from state V with a probability .076. Because of this ambiguity, it is no longer trivial to compute the probability of a sequence of words from the network. If you are given a particular sequence, however, the probability that it generates a particular output is easily computed by multiplying the probabilities on the path times the probabilities for each output. For instance, the probability that the sequence N V ART N generates the output Flies like a flower is computed as follows. The probability of the path N V ART N, given the Markov model in Figure 7.7, is .29 * .43 * .65 * 1 = .081. The probability of the output being Flies like a flower for this sequence is computed from the output probabilities given in Figure 7.6: PROB (flies I N) * PROBUike I = .025 * . I * .36 * .063 V) * PROB(a I ART) * PROB(flower I N) : 5.4 * 10 -5 Multiplying these together gives us the likelihood that the HMM would generate the sentence, 4.37 * 10"6. More generally, the formula for computing the probability of a sentence w \,.... Wf given a sequence C\,Op is (NULLA*) -Z*- í a/ART Figure 7.8 Encoding the 256 possible sequences exploiting the Markov assumption Lli=i,T PROB(Ci I Ci_i) * PROB(v/i | CO Now we can resume the discussion of how to find the most likely sequence of categories for a sequence of words. The key insight is that because of the Markov assumption, you do not have to enumerate all the possible sequences. In fact, sequences that end in the same category can be collapsed together since the next category only depends on the one previous category in the sequence. So if you just keep track of the most likely sequence found so far for each possible ending category, you can ignore all the other less likely sequences. For example, consider the problem of finding the most likely categories for the sentence Flies like a flower, with the lexical-generation probabilities and bigram probabilities discussed so far. Given that there are four possible categories, there are 4 4 = 256 different sequences of length four. The brute force algorithm would have to generate all 256 sequences and compare their probabilities in order to find this one. Exploiting the Markov assumption, however, this set of sequences can be collapsed into a representation that considers only the four possibilities for each word. This representation, shown as a transition diagram in Figure 7.8, represents all 256 sequences. To find the most likely sequence, you sweep forward through the words one at a time finding the most likely sequence for each ending category. In other words, you first find the four best sequences for the two words flies like: the best ending with like as a V, the best as an N, the best as a P, and the best as an ART. You then use this information to find the four best sequences for the three words flies like a, each one ending in a different category. This process is repeated until all the words are accounted for. This algorithm is usually called the Viterbi algorithm. For a problem involving T words and N lexical categories, it is guaranteed to find the most likely sequence using k*T*N2 steps, for some constant k, significantly better than the NT steps required by the brute force search! The rest of this section develops the Viterbi algorithm in detail. 202 CHAPTER 7 Ambiguity Resolution: Statistical Methods 203 Given word sequence w[.....wT> lexical categories Li,..., LN, lexical probabilities PROB(wt I L;), and bigram probabilities PROBQ^ I Lp, find the most likely sequence of lexical categories Cl,C-p for the word sequence. Initialization Step For i = 1 to N do SEQSCOREfi,\) = PROB(v/l I L;) * PROBiLj I 0) BACKPTR(i, 1) = 0 Iteration Step Fort = 2toT For i = 1 to N SEQSCORE(i, t) = MAXj=,N(SEQSCORE(j, t-1) * PROB (L; I Lj)) * PKOflfw, I L;) BACKPTR(i, t) = index of j that gave the max above Sequence Identification Step C(T) = i that maximizes SEQSCORE(i, T) For i = T-1 to 1 do C(i) = BACKPTR(C(i+l), i+1) Figure 7.9 The Viterbi algorithm The Viterbi Algorithm We will track the probability of the best sequence leading to each possible category at each position using an NxT array, where N is the number of lexical categories (Lj, Ln) and T is the number of words in the sentence (wi, wj). This array, SEQSCORE(n, t), records the probability for the best sequence up to position t that ends with a word in category Ln. To record the actual best sequence for each category at each position, it suffices to record only the one preceding category for each category and position. Another NxT array, BACKPTR, will indicate for each category in each position what the preceding category is in the best sequence at position t-1. The algorithm, shown in Figure 7.9, operates by computing the values for these two arrays. Let's assume you have analyzed a corpus and obtained the bigram and lexical-generation probabilities in Figures 7.4 and 7.6, and assume that any bigram probability not in Figure 7.4 has a value of .0001. Using these probabilities, the algorithm running on the sentence Flies like a flower will operate as follows. The first row is set in the initialization phase using the formula SEQSCORE(i, 1) = PROB (Flies I Lj) * PROB(L[ I 0) where Lj ranges over V, N, ART, and P. Because of the lexical-generation probabilities in Figure 7.6, only the entries for a noun and a verb are greater than zero. (^flies/v) (Mki^R) .00022 like/ART) q Figure 7.10 The results of the first two steps of the Viterbi algorithm Thus the most likely sequence of one category ending in a V (= Lj) to generate flies has a score of 7.6 * 1(H>, whereas the most likely one ending in an N (= L2) has a score of .00725. The result of the first step of the algorithm is shown as the left-hand side network in Figure 7.10. The probability of flies in each category has been computed. The second phase of the algorithm extends the sequences one word at a time, keeping track of the best sequence found so far to each category. For instance, the probability of the state likeAf is computed as follows: PROB (like/V) = MAX( Pfi0B(flies/N) * PROB (V I N), PP05(flies/V) * PROB(W I V)) * PROB (like/V) = MAX(.00725 * .43, 7.6 * 10"6 * .0001) * .1 = 3.12* 10^ The difference in this value from that shown in Figure 7.10 is simply a result of the fact that the calculation here used truncated approximate values for the probabilities, ln other words, the most likely sequence of length two generating Flies like and ending in a V has a score of 3.1 * 10~4 (and is the sequence N V), the most likely one ending in a P has a score of 2.2 * 10-^ (and is the sequence N P), and the most likely one ending in an N has a score of 1.3 * 10~5 (and is the sequence N N). The heavier arrows indicate the best sequence leading up to each node. The computation continues in the same manner until each word has been processed. Figure 7.11 shows the result after the next iteration, and Figure 7.12 shows the final result. The highest probability sequence ends in state flower/N. It is simple to trace back from this category (using BACKPTR(1, 4) and so on) to find the full sequence N V ART N, agreeing with our intuitions. Algorithms like this can perform effectively if the probability estimates are computed from a large corpus of data that is of the same style as the input to be classified. Researchers consistently report labeling with 95 percent or better 204 CHAITER 7 Ambiguity Resolution: Statistical Methods 205 —^C^l) 0 ^^^^ Mike/P J — \C^) ° (^ARt) 7.2*10"5 Figure 7.11 The result after the second iteration Figure 7.12 The result after the third iteration accuracy using trigram models. Remember, however, that the naive algorithm picks the most likely category about 90 percent of the time. Still, the error rate is cut in half by introducing these techniques. 7.4 Obtaining Lexical Probabilities Corpus-based methods suggest some new ways to control parsers. If we had some large corpora of parsed sentences available, we could use statistical methods to identify the common structures in English and favor these in the parsing algorithm. This might allow us to choose the most likely interpretation when a sentence is ambiguous, and might lead to considerably more efficient parsers that are nearly deterministic. Such corpora of parsed sentences arc now becoming available. BOX 7.1 Getting Reliable Statistics Given that you need to estimate probabilities for lexical items and for n-grams, how much data is needed for these estimates to be reliable? In practice, the amount of data needed depends heavily on the size of the n-grams used, as the number of probabilities that need to be estimated grows rapidly with n. For example, a typical tagset has about 40 different lexical categories. To collect statistics on a unigram (a simple count of the words in each category), you would only need 40 statistics, one for each category. For bigrams, you would need 1600 statistics, one for each pair. For trigrams, you would need 64,000, one for each possible triple. Finally, for four-grams, you would need 2,560,000 statistics. As you can see, even if the corpus is a million words, a four-gram analysis would result in most categories being empty. For trigrams and a million-word corpus, however, there would be an average of 15 examples per category if they were evenly distributed. While the trigrams are definitely not uniformly distributed, this amount of data seems to give good results in practice. One technique that is very useful for handling sparse data is called smoothing. Rather than simply using a trigram to estimate the probability of a category C;, at position i, you use a formula that combines the trigram, bigram, and unigram statistics. Using this scheme, the probability of category C; given the preceding categories C l,CH is estimated by the formula PROB (CiIC1,...,CH): = Xt PROB (C;) + l^PROBiC, I Q_i) + ^raOBCCilC^Cn) where Xi+X2+X3 - 1. Using this estimate, if the trigram has never been seen before, the bigram or unigram estimates still would guarantee a nonzero estimate in many cases where it is desired. Typically, the best performance will arise if X?, is significantly greater than the other parameters so that the trigram information has the most effect on the probabilities. It is also possible to develop algorithms that learn good values for the parameters given a particular training set (for example, see Jelinek (1990)). The first issue is what the input would be to such a parser. One simple approach would be to use a part-of-speech tagging algorithm from the last section to select a single category for each word and then start the parse with these categories. If the part-of-speech tagging is accurate, this will be an excellent approach, because a considerable amount of lexical ambiguity will be eliminated before the parser even starts. But if the tagging is wrong, it will prevent the parser from ever finding the correct interpretation. Worse, the parser may find a valid but implausible interpretation based on the wrongly tagged word and never realize the error. Consider that even at 95 percent accuracy, the chance that every word is correct in a sentence consisting of only 8 words is .67, and with 12 words it is .46—less than half. Thus the chances of this approach working in general look slim. 206 CHAPTER 7 Ambiguity Resolution: Statistical Methods 207 PROB (ART 1 the) = .99 PROB(N\ like) = ,16 PROB (N\ flies) = .48 PROB (ART 1 a) = .995 PROB (VI flies) = .52 PKOB(N\ a) = .005 PROB'(V'1 like) = .49 PROB(Nl^ower) = .78 PROB(P\ like) = .34 PÄOB(VI./Zo>ver) = .22 Figure 7.13 Context-independent estimates for the lexical categories A more appropriate approach would be to compute the probability that each word appears in the possible lexical categories. If we could combine these probabilities with some method of assigning probabilities to rule use in the grammar, then we could develop a parsing algorithm that finds the most probable parse for a given sentence. You already saw die simplest technique for estimating lexical probability by counting the number of times each word appears in the corpus in each category. Then the probability that word w appears in a lexical category L j out of possible categories Li,Ln could be estimated by the formula PROB(Lj i w) = count(Lj & w) /2j=i;n count(Li & w) Using the data shown in Figure 7.5, we could derive the context-independent probabilities for each category and word shown in Figure 7.13. As we saw earlier, however, such estimates are unreliable because Ihey do not take context into account. A better estimate would be obtained by computing how likely it is that category L, occurred at position t over all sequences given the input wi,wt. In other words, rather than searching for the one sequence that yields the maximum probability for the input, we want to compute the sum of the probabilities for the input from all sequences. For example, the probability that flies is a noun in the sentence The flies like flowers would be calculated by summing the. probability of all sequences that end with flies as a noun. Given the transition and lexical-generation probabilities in Figures 7.4 and 7.6, the sequences that have a nonzero values would be 9.58 *irr3 1.13 * 10"6 4.55 * 1(T9 which adds up to 9.58 * 10-3. Likewise, three nonzero sequences end with flies as a V, yielding a total sum of 1.13 * 10~5. Since these are the only sequences that have nonzero scores when the second word is ///cv. the sum of all these sequences will be the probability of the sequence The flies, namely 9.591 * 10-3. We can now compute the probability that flies is a noun as follows: P/?OB(flies/N I The flies) = PROli Wch/N & The flies) I PROB (The.flies) The/ART flies/N The/N flies/N The/P flies/N dl L "1 ' Initialization Step For i = 1 to N do SEQSUM(i, 1) = PROB(w1 I LJ * PROB (Lj 10) Computing the Forward Probabilities For t = 2 to T do For i = 1 to N do SEQSUM(i, t) = Zj=1]N (PROB (L{ I Lj) * SEQSUM(j, t-1)) * PROB(wt IL;) Computing the Lexical Probabilities For t = 1 to T do For i = 1 to N do P/?OS(Ct=Li) = SEQSUM(i, t) / Ej=i>N SEQSUM(j, t) Figure 7.14 The forward algorithm for computing the lexical probabilities = 9.58* 10-3/9.591 * 10"3 = .9988 Likewise, the probability that flies is a verb would be .0012. Of course, it would not be feasible to enumerate all possible sequences in a realistic example. Luckily, however, the same trick used in the Viterbi algorithm can be used here. Rather than selecting the maximum score for each node at each stage of the algorithm, we compute the sum of all scores. To develop this more precisely, we define the forward probability, written as cq(t), which is the probability of producing the words wi,wt and ending in state wt/Lj: cq(t) =PROB(wt/L[, w!,wt) For example, with the sentence The flies like flowers, 02 (3) would be the sum of values computed for all sequences ending in V (the second category) in position 3 given the input The flies like. Using the definition of conditional probability, you can then derive the probability that word wt is an instance of lexical category Li as follows: PROB(wtILi I wi,..... wt) = PROB (wt /Lj, wi,wt) /PS0B(wi,wt) We estimate the value of PROB(\n\, wt) by summing over all possible sequences up to any state at position t, which is simply £j=i,n afi)- In other words, we end up with PROB (w{/L\ I wi,.....wt)£cq(t)/2j=i NOj(t) The first two parts of the algorithm shown in Figure 7.14 compute the forward probabilities using a variant of the Viterbi algorithm. The last step converts the 208 CHAPTER 7 Ambiguity Resolution: Statistical Methods 209 Initialization Iteration 1 Iteration 2 Iteration 3 Figure 7.15 Computing the sums of the probabilities of the sequences PROB (the/ART 1 the) = 1.0 PÄOß(like/P 1 the flies like) = .4 PROB (flies/N 1 the flies) = .9988 PROB (like/N 1 the flies like ) = .022 PROB (flies/V 1 the flies) s .0011 PROB (flowers/N 1 the flies like flowers) = .967 PROB(like/V 1 the flies like) = .575 PROB (flowers/V 1 the flies like flowers) s .033 Figure 7.16 Context-dependent estimates for lexical categories in the sentence The flies like flowers forward probabilities into lexical probabilities for the given sentence by normal -izing the values. Consider deriving the lexical probabilities for the sentence The flies like flowers using the probability estimates in Figures 7.4 and 7.6. The algorithm in Figure 7.14 would produce the sums shown in Figure 7.15 for each category in each position, resulting in the probability estimates shown in Figure 7.16. Note that while the context-independent approximation in Figure 7.13 slightly favors the verb interpretation of flies, the context-dependent approximation virtually eliminates it because the training corpus had no sentences with a verb immediately following an article. These probabilities are significantly different than the context-independent ones and much more in line with intuition. Note that you could also consider the backward probability, Pj(t), the probability of producing the sequence wt, v/j beginning from state wj/Li. These values can be computed by an algorithm similar to the forward probability algorithm but starting at the end of the sentence and sweeping backward through the states. Thus a better method of estimating the lexical probabilities for word wt would be to consider the entire sentence rather than just the words up to t. In this case, the estimate would be I'ROlHw^.i) = ((-/.:(!! * ßi(t))/ 2j=i;N(Oj(t)I * ßj(0) Rule Count for LHS Count for Rule Probability 1. S->NPVP 300 300 1 2. VP —> V 300 116 .386 3. VP->VNP 300 118 .393 4. VP^VNPPP 300 66 .22 5. NP^NPPP 1023 241 .24 6. NP^NN 1023 92 .09 7. NP->N 1023 141 .14 8. NP -> ARTN 1023 558 .55 9. PP^PNP 307 307 1 Grammar 7.17 A simple probabilistic grammar 7.5 Probabilistic Context-Free Grammars Just as finite state machines could be generalized to the probabilistic case, context-free grammars can also be generalized. To do this, we must have some statistics on rule use. The simplest approach is to count the number of times each rule is used in a corpus containing parsed sentences and use this to estimate the probability of each rule being used. For instance, consider a category C, where the grammar contains m rules, R i, Rm, with the left-hand side C. You could estimate the probability of using rule Rj to derive C by the formula PROBiRj IC) = Count(# times Rj used) / £i=i,m (# times Ri used) Grammar 7.17 shows a probabilistic CFG with the probabilities derived from analyzing a parsed version of the demonstration corpus. You can then develop algorithms similar in function to the Viterbi algorithm that, given a sentence, will find the most likely parse tree that could have generated that sentence. The technique involves making certain independence assumptions about rule use. In particular, you must assume that the probability of a constituent being derived by a rule Rj is independent of how the constituent is used as a subconstituent. For example, this assumption would imply that the probabilities of NP rules are the same whether the NP is the subject, the object of a verb, or the object of a preposition. We know that this assumption is not valid in most cases. For instance, noun phrases in the subject position are much more likely lo be pronouns than noun phrases not in the subject position. Rut, as before, it might be that useful predictive power can be obtained using these techniques in practice. With this assumption, a formalism can be developed based on the probability that a constituent C generates a sequence of words wj. w;+|.....w;, 210 CHAPTER 7 Ambiguity Resolution: Statistical Methods 211 NP NP /8\ /6\ ART N 1 1 N N 1 1 1 1 a flower 1 1 a flower Figure 7.18 The two possible ways that a flower could be an NP written as wy. This type of probability is called the inside probability because it assigns a probability to the word sequence inside the constituent. It is written as PROB (wjj I C) Consider how to derive inside probabilities. The case for lexical categories is simple. In fact, these are exactiy the lexical-generation probabilities derived in Section 7.3. For example, PROBiflower I N) is the inside probability that the constituent N is realized as the word flower, which for our hypothetical corpus was .06, given in Figure 7.6. Using such lexical-generation probabilities, we can then derive the probability that the constituent NP generates the sequence a flower as follows: There are only two NP rules in Grammar 7.17 that could generate a sequence of two words. The parse trees generated by these two rules are shown in Figure 7.18. You know the likelihood of each rule, estimated from the corpus as shown in Grammar 7.17, so the probability that the constituent NP generates the words a flower will be the sum of the probabilities of the two ways it can be derived, as follows: PROB (a flower INP) = PROB (Rule 8 I NP) * PROB (a I ART) * PROBiflower I N) + PROB(Rule 6 I NP) * PROB(a I N) * mOBiflower I N) = .55 * .36 * .06 + .09 * .001 * .06 = .012 This probability can then be used to compute the probability of larger con -stituents. For instance, the probability of generating the words A flower wilted from constituent S could be computed by summing the probabilities generated from each of the possible trees shown in Figure 7.19. Although there are three possible interpretations, the first two differ only in the derivation of a flower as an NP. Both these interpretations are already included in the preceding computation of PROB(a flower I NP). Thus the probability of a flower blooms is PROB (a flower blooms I S) = PROB (Rule 1 I S) * PROB (a flower I NP) * PROB (blooms I VP) + PROB (Rule I I S) * PROB (a I NP) * PROBiflower blooms I VP) S /'\ NP VP /8\ I2 ART N V I I I a flower wilted NP /6\ VP 12 N flower wilted NP VP I7 /3\ / NP N a flower N wilted Figure 7.19 The three possible ways to generate a flower wilted as an S Using this method, the probability that a given sentence will be generated by the grammar can be computed efficiently. The method only requires recording the value of each constituent between each two possible positions. In essence, it is using the same optimization that was gained by a packed chart structure, as discussed in Chapter 6. We will not pursue this development further, however, because in parsing, you are interested in finding the most likely parse rather than the overall probability of a given sentence. In particular, it matters which of the two NP interpretations in Figure 7.19 was used. The probabilities of specific parse trees can be found using a standard chart parsing algorithm, where the probability of each constituent is computed from the probability of its subconstituents and the probability of the rule used. Specifically, when entering an entry E of category C using a rule i with n subconstituents corresponding to entries E i,En, then PROB(E) = PROB(Rule i I C) * PROB (Ei): PROB(EB) For lexical categories it is better to use the forward probability than the lexical-generation probability. This will produce better estimates, because it accounts for some of the context of the sentence. You can use the standard chart parsing algorithm and add a step that computes the probability of each entry when it is added to the chart. Using the bottom-up algorithm and the probabilities derived from the demonstration corpus, Figure 7.20 shows the complete chart for the input a flower. Note that the most intuitive reading, that it is an NP, has the highest probability by far, .54. But there are many other possible interpretations because of the low probability readings of the word a as a noun and the reading of flower as a verb. Note that the context-independent probabilities for flower would favor the verb interpretation, but the forward algorithm strongly favors the noun interpretation in the context where it immediately follows the word a. The probabilities for the lexical constituents ART416, N417, V419, and N422 were computed using the forward algorithm. The context-independent lexical probabilities would be much less in line with intuition. 212 CHAPTER 7 Ambiguity Resolution: Statistical Methods 213 NP425 1 N422 .14 NP424 1N417 2N422 .00011 NP423 1 ART416 2N422 .54 S421 1 NP418 2VP420 3.2 xlO"8 NP418 1N417 .00018 VP420 .00018 N417 .001 N422 .999 ART416 .99 V419 .00047 Figure 7.20 The full chart for a flower Unfortunately, a parser built using these techniques turns out not to work as well as you might expect, although it does help. Some researchers have found that these techniques identify the correct parse about 50 percent of the time. It doesn't do better because the independence assumptions that need to be made arc too radical. Problems arise in many guises, but one critical issue is the handling of lexical items. For instance, the context-free model assumes that the probability of a particular verb being used in a VP rule is independent of which rule is being considered. This means lexical preferences for certain rules cannot be handled within the basic framework. This problem then influences many other issues, such as attachment decisions. For example, Grammar 7.17 indicates that rule 3 is used 39 percent of the time, rule 4, 22 percent of the time, and rule 5, 24 percent of the time. This means that, independent of whatever words are used, an input sequence of form V NP PP will always be interpreted with the PP attached to the verb. The tree fragments are shown in Figure 7.21: Any structure that attaches the PP to the verb will have a probability of .22 from this fragment, whereas the structure that attaches the PP to the NP will have a probability of .39 * .25, namely .093. So the parser will always attach a PP to the verb phrase independent of what words arc used. Tn fact, there are 23 cases of this situation in the corpus where the PP attaches to an NP, and the probabilistic parser gets every one of them wrong, Figure 7.21 The two structures affecting attachment decisions This is true even with particular verbs that rarely take a _np_pp complement. Because of the context-free assumption made about probabilities, the particular verb used has no effect on the probability of a particular rule being used. Because of this type of problem, a parser based on this approach with a realistic grammar is only a slight improvement over the nonprobabilistic methods. For example, 84 sentences were selected from the corpus, each involving a PP attachment problem. With the standard bottom-up nonprobabilistic algorithm, using Grammar 7.17, the first complete S structure found was the Correct answer on one-third of the sentences. By using probabilistic parsing, the highest probability S structure was correct half of the time. Thus, pure probabilistic parsing is better than guessing but leaves much to be desired. Note also that this test was on the same sentences that were used to compute the probabilities in the first place, so performance on new sentences not in the training corpus would tend to be even worse! Papers in the literature that use real corpora and extensive training report accuracy results similar to those described here. While the pure probabilistic context-free grammars have their faults, the techniques developed here are important and can be reused in generalizations that attempt to develop more context-dependent probabilistic parsing schemes. 7.6 Best-First Parsing So far, probabilistic context-free grammars have done nothing to improve the efficiency of the parser. Algorithms can be developed that attempt to explore the high-probability constituents first. These are called best-first parsing algorithms. The hope is that the best parse can be found quickly and much of the search space, containing lower-rated possibilities, is never explored. It turns out that all the chart parsing algorithms in Chapter 3 can be modified fairly easily to consider the most likely constituents first. The central idea is to make the agenda a priority queue—a structure where the highest-rated elements arc always first in the queue. The parser then operates by always removing the highest-ranked constituent from the agenda and adding it to the chart. It might seem that this one change in search strategy is all that is needed to modify the algorithms, but there is a complication. The previous chart parsing algorithms all depended on the fact that the parser systematically worked from 214 CHAPTER 7 Ambiguity Resolution: Statistical Methods 215 To add a constituent C from position pi to p2: 1. Insert C into the chart from position pj to P2- 2. For any active arc of the form X —> X j ... ° C ... Xn from position p0 to pj, add a new active arc X —>X, ... C ° ... X„ from position p0 to P2. To add an active arc X —>X] ... C ° C ... Xn to the chart from p0 to fo'- 1. If C is the last constituent (that is, the arc is completed), add a new constituent of type X to the agenda. 2. Otherwise, if there is a constituent Y of category C in the chart from p2 to p3. then recursively add an active arc X —> Xj ... CC'° ... X„ from p0 to p3 (which may of course add further arcs or create further constituents). Figure 7.22 The new arc extension algorithm left to right, completely processing constituents occurring earlier in the sentence before considering later ones. With the modified algorithm, this is not the case. If the last word in the sentence has the highest score, it will be added to the chart first. The problem this causes is that you cannot simply add active arcs to the chart (and depend on later steps in the algorithm to extend them). In fact, the constituent needed to extend a particular active arc may already be on the chart. Thus, whenever an active arc is added to the chart, you must check to see if it can be extended immediately, given the current chart. Thus we need to modify the arc extension algorithm. The algorithm is as before (in Section 3.4), except that step 2 is modified to check for constituents that already exist on the chart. The complete algorithm is shown in Figure 7.22. Adopting a best-first strategy makes a significant improvement in the efficiency of the parser. For instance, using Grammar 7.17 and lexicon trained from the corpus, the sentence The man put a bird in the house is parsed correctly after generating 65 constituents with the best-first parser. The standard bottom-up algorithm generates 158 constituents on the same sentence, only to obtain the same result. If the standard algorithm were modified to terminate when the first complete S interpretation is found, it would still generate 106 constituents for the same sentence. So best-first strategies can lead to significant improvements in efficiency. Even though it does not consider every possible constituent, the best-first parser is guaranteed to find the highest probability interpretation. To see this, assume that the parser finds an interpretation S1 with probability pi. The important property of the probabilistic scoring is that the probability of a constituent must always be lower (or equal) to the probability of any of its subconstituents.-: So if there were an interpretation S2 that had a score p2, higher than pi, then iu would have to be constructed out of subconstituents all with a probability of p2 or higher. This means that all the subconstituents would have been added to the chart before SI was. But this means that the arc that constructs S2 would be BOX 7.2 Handling Unknown Words Another area where statistical methods show great promise is in handling unknown words. In traditional parsing, one unknown word will disrupt the entire parse. The techniques discussed in this chapter, however, suggest some interesting ideas. First, if you have a trigram model of the data, you may already have significant predictive power on the category of the unknown word. For example, consider a sequence w i w2 w3, where the last word is unknown. If the two previous words are in categories Ci and C2, respectively, then you could pick the category C for the unknown word that maximizes PROB(C I Cj C2). For instance, if C2 is the category ART, then the most likely interpretation for C will probably be a noun (or maybe an adjective). Other techniques have been developed that use knowledge of morphology to predict the word class. For instance, if the unknown word ends in -ing then it is likely to be a verb; if it ends in -ly, then it is likely to be an adverb; and so on. Estimates based on suffixes can be obtained by analyzing a corpus in the standard way to obtain estimates of PROS (word is category C I end is -ly) and so on. Techniques for handling unknown words are discussed in Church (1988), de Marcken (1990), and Weischedel et al. (1993). completed, and hence S2 would be on the agenda. Since S2 has a higher score than S1, it would be considered first. While the idea of a best-first parser is conceptually simple, there are a few problems that arise when trying to apply this technique in practice. One problem is that if you use a multiplicative method to combine the scores, the scores for constituents tend to fall quickly as they cover more and more input. This might not seem problematic, but in practice, with large grammars, the probabilities drop off so quickly that the search closely resembles a breadth-first search: First build all constituents of length 1, then all constituents of length 2, and so on. Thus the promise of quickly finding the most preferred solution is not realized. To deal with this problem, some systems use a different function to compute the score for constituents. For instance, you could use the minimum score of any subconstituent and the rule used, that is, Score(C) = MIN(Score(C -> Ci,Cn), Scored),Score(Cn)) This gives a higher (or equal) score than the first approach, but a single poorly-rated subconstituent can essentially eliminate any constituent that contains it, no matter how well all the other constituents are rated. Unfortunately, testing the same 84 sentences using the MIN function leads to a significant decrease in accuracy to only 39 percent, not much better than a brute force search. But other researchers have suggested that the technique performs better than this in. actual practice. You might also try other means of combining the scores, such as taking the average score of all subconstituents. 216 CHAPTER 7 Ambiguity Resolution: Statistical Methods 217 Rule the house peaches flowers NP ->N 0 0 .65 .76 NP ->NN 0 .82 0 0 NP —> NP PP .23 .18 .35 .24 NP -> ARTN .76 0 0 0 Rule ate bloom like put VP -»V .28 .84 0 ,= .03 VP ^ V NP .57 .1 .9 .03 VP -> VNPPP .14 .05 .1 .93 Figure 7.23 Some rule estimates based on the first word 7.7 A Simple Context-Dependent Best-First Parser The best-first algorithm leads to improvement in the efficiency of the parser but does not affect the accuracy problem. This section explores a simple alternative method of computing rule probabilities that uses more context-dependent lexical information. The idea exploits the observation that the first word in a constituent is often the head and thus has a dramatic effect on the probabilities of rules that account for its complement. This suggests a new probability measure for rules that is relative to the first word, PROB(R I C, w). This is estimated as follows: PROB (RIC, w) = Count(# times rule R used for cat C starting with w) Count(# times cat C starts with w) The effect of this modification is that probabilities are sensitive to the particular words used. For instance, in the corpus, singular nouns rarely occur alone as a noun phrase (that is, starting rule NP -> N), whereas plural nouns rarely are used as a noun modifying (that is, starting rule NP -> N N). This can be seen in the difference between the probabilities for these two rules given the words house and peaches, shown in Figure 7.23. With the context-free probabilities, the rule NP -> N had a probability of .14. This underestimates the rule when the input has a plural noun, and overestimates it when the input has a singular noun. More importantly, the context-sensitive rules encode verb preferences for different subcategorizations. In particular, Figure 7.23 shows that the rule VP -> V NP PP is used 93 percent of the time with the verb put but only 10 percent of the time with like. This difference allows a parser based on these probabilities to do significantly better than the context-free probabilistic parser. In particular, on the same 84 test sentences on which the context-free probabilistic parser had 49 percent accuracy, the context-dependent probabilistic parser has 66 percent accuracy, getting the attachment correct on 14 sentences on which the context- .1 1 J S' Strategy Accuracy on 84 PP Size of Chart Generated for Attachment Problems The man put the bird in the house Full Parse 33% (taking first S found) Context-Free Probabilities 49% Context-Dependent Probabilities 66% 158 65 36 Figure 7.24 A summary of the accuracy and efficiency of different parsing strategies VP6840 1 V6812 2NP6815 3 PP6822 .54 VP6828 1 V6812 2 NP6827 .0038 NP6827 1 NP6815 2 PP6822 .13 V6812 .99 NP6815 .76 PP6822 .76 put the bird in the house Figure 7.25 The chart for the VP put the bird in the house free parser failed. The context-dependent parser is also more efficient, finding the answer on the sentence The man put the bird in the house generating only 36 constituents. The results of the three parsing strategies are summarized in Figure 7.24. To see why the context-dependent parser does better, consider the attachment decision that has to be made between the trees shown in Figure 7.21 for the verbs like and put, say in the sentences The man put the bird in the house and The man likes the bird in the house. The context-free probabilistic parser would assign the same structure to both, getting the example with put right and the example with like wrong. The relevant parts of the charts are shown in Figures 7.25 and 7.26. In Figure 7.25 the probability of the rule VP ^ V NP PP, starting with put, is .93, yielding a probability of .54 (.93 * .99 * .76 * .76) for constituent VP6840. This is far greater than the probability of .0038 given the alternative VP6828. Changing the verb to like, as shown in Figure 7.26, affects the probabilities enough to override the initial bias towards the V-NP-PP interpretation. In this case, the probability of the rule VP —> V NP PP. starting with like, is only . I. 218 CHAPTER 7 Ambiguity Resolution: Statistical Methods 219 VP6883 1 V6848 2NP6873 .1 VP6881 1 V6848 2NP6851 3 PP6858 .054 NP6873 1NP6851 2 PP6858 .13 V6848 .94 NP6851 .76 PP6858 .76 likes the bird in the house Figure 7.26 The chart for the VP likes the bird in the house whereas the probability of rule VP -» V NP is .9. This gives a probability of .1 to VP6883, beating out the alternative. The question arises as to how accurate a parser can become by using the appropriate contextual information. While 66 percent accuracy was good compared to the alternative strategies, it still leaves a 33 percent error rate on sentences involving a PP attachment decision. What additional information could be added to improve the performance? Clearly, you could make the rule probabilities relative to a larger fragment of the input, using a bigram or trigram at the beginning of the rule. This may help the selectivity if there is enough training data. Attachment decisions depend not only on the verb but also on the preposition in the PP. You could devise a more complex estimate based on the previous category, the head verb, and the preposition for rule VP -> V NP PP. This would require more data to obtain reliable statistics but could lead to a significant increase in reliability. But a complication arises in that you cannot compare across rules as easily now: The VP -> V NP PP rule is evaluated using the verb and the preposition, but there is no corresponding preposition with the rule VP -> V NP. Thus a more complicated measure would have to be devised. In general, the more selective the lexical categories, the more predictive the estimates can be, assuming there is enough data. Earlier we claimed that basing the statistics on words would require too much data. But is there a subset of words that could profitably be used individually? Certainly function words such as prepositions seem ideally suited for individual treatment. There is a fixed number of them, and they have a significant impact on the structure of the sentence. Similarly, other closed class words—articles, quantifiers, conjunctions, and so on—could all be treated individually rather than grouped together as a class. It is reasonable to assume we will be able to obtain enough data to obtain reliable estimates with these words. The complications arise with the open class words. Verbs and nouns, for instance, play a major role in constraining the data, but there are far too many of them to consider individually. One approach is to model the common ones individually, as we did here. Another would be to try to cluster words together into classes based on their similarities. This could be done by hand, based on semantic properties. For instance, all verbs describing motion might have essentially the same behavior and be collapsed into a single class. The other option is to use some automatic technique for learning useful classes by analyzing corpora of sentences with attachment ambiguities. Summary Statistically based methods show great promise in addressing the ambiguity resolution problem in natural language processing. Such techniques have been used to good effect for part-of-speech tagging, for estimating lexical probabilities, and for building probabilistically based parsing algorithms. For part-of-speech tagging, the Viterbi algorithm using bigram or trigram probability models can attain accuracy rates of over 95 percent. Context-dependent lexical probabilities can be computed using an algorithm similar to the Viterbi algorithm that computes the forward probabilities. These can then be used as input to a chart parser that uses a context-free probabilistic grammar to find the most likely parse tree. The efficiency of parsers can be dramatically improved by using a best-first search strategy, in which the highest-rated constituents are added to the chart first. Unfortunately, the context-free probabilistic framework often does not identify the intuitively correct parse tree. Context-dependent probability measures can be developed that significantly improve the accuracy of the parser over the context-free methods. This is an area of active research, and significant developments can be expected in the near future. Related Work and Further Readings The literature on statistical methods of natural language processing is quite young, with the techniques being developed in the last decade. An excellent source that describes the techniques in detail is Charniak (1993). Many of the basic algorithms were developed in areas outside natural language processing. For instance, the Viterbi algorithm was originally developed on Markov models (Viterbi, 1967) and is in extensive use in a wide range of applications. The technique for developing probability estimates for lexical classes is based on the forward and forward-backward algorithms by Baum (1972). The use of these techniques for speech recognition is described well in an article by Rabiner (1989) and in many of the papers in Waibel and Lee (1990). 220 CHAPTER 7 Ambiguity Resolution: Statistical Methods 221 The use of these techniques for part-of-speech tagging has been developed -by many researchers, including Jelinek (1990), Church (1988), DeRose (1988),-and Weischedel et al. (1993), to name a few. Many researchers have also developed parsers based on the context-independent rule probabilities described in Section 7.5, including Jelinek (1990). One of the major attractions for statistically based approaches is the ability to learn effective parameters from processing corpora. These algorithms start with an initial estimate of the probabilities; and then process the corpus to compute a better estimate. This can be repeated" until further improvement is not found. These techniques are guaranteed to converge but do not necessarily find the optimal values. Rather, they find a local maximum and are thus similar to many hill-climbing algorithms. An example of a technique to learn parameters for a grammar from only partially analyzed data: is given in Pereira and Schabes (1992). Best-first parsing algorithms have been in use for a long time, going back to work such as that by Paxton and Robinson (1973). Only recently have they been explored using data derived from corpora (as in Weischedel et al. (1993) and Chitrao and Grishman (1990)). A good example of a context-dependent method using trigrams is that of Magerman and Weir (1992). In general, an excellent collection of recent work on statistical models can be found in a special7 issue of Computational Linguistics (Volume 19, Issues 1 and 2, 1993). Prepositional phrase attachment seems to be a problem well suited to statistical techniques. Whittemore et al. (1990) explored the traditional strategies such as right association and minimal attachment and found that they did not have good predictive power, whereas lexical preference models looked promising. Hindle and Rooth (1993) explored some techniques for gathering such data from corpora that is not fully annotated with syntactic analyses. The Brown corpus is described in Francis and Kucera (1982) and has been widely used as a testbed since its release. A good source for linguistic data is the Linguistic Data Consortium (LDC) housed at the University of Pennsylvania. The LDC has many large databases of language, in written and spoken forms, and in a wide range of styles and formats. It also contains the Penn Treebank data, which is a large corpus of sentences annotated with parses. The Penn Treebank is described in Marcus et al. (1993). There are many introductory books on probability theory available. One good one, by Ross (1988), introduces the basic concepts and includes brief discussions of Markov chains and information theory. Exercises for Chapter 7_ 1. (easy) Prove Bayes' rule by using the definition of conditional probability. Also prove that if A and B are independent then PROB(A I B) = PROM A). 2. (medium) Say we have an acceptable margin of error of between .4 and .6 for estimating the probability of a fair coin coming up heads. What is the= chance of obtaining a reliable estimate for the probability with five flips? 7. (medium) Hand-simulate the Viterbi algorithm using the data and probability estimates in Figures 7.4-7.6 on the sentence Flower flowers like flowers. Draw transition networks as in Figures 7.10-7.12 for the problem, and identify what part of speech the algorithm identifies for each word. (medium) Using the bigram and lexical-generation probabilities given in this chapter, calculate the word probabilities using the forward algorithm for the sentence The a flies like flowers (involving a very rare use of the word a as a noun, as in the a flies, the b flies, and so on). Remember to use .0001 as a probability for any bigram not in the table. Are the results you get reasonable? If not, what is the problem and how might it be fixed? (medium) Using the tagged corpus provided, write code to collect bigram and lexical statistics, and implement a part-of-speech tagger using the Viterbi algorithm. Test your algorithm on the same corpus. How many sen -tences are tagged correctly? How many category errors did this involve? Would you expect to get better accuracy results on the sentences that involved an error if you had access to more data? (medium) Consider an extended version of Grammar 7.17 with the additional rule 10. VP->VPP The revised rule probabilities are shown here. (Any not mentioned are the same as in Grammar 7.17.) VP^ V VP-> VNP VP^ VNPPP VP^ VPP .32 .33 .20 .15 In addition, the following bigram probabilities differ from those in Figure 7.4: PROB(N I V) = .53 PROB (ART \V) = .32 PROB(P\V) = .15 a Hand-simulate (or implement) the forward algorithm on Fruit flies like birds to produce the lexical probabilities. b. Draw out the full chart for Fruit flies like birds, showing the probabilities of each constituent. (hard) Implement an algorithm to derive the forward and backward probabilities for lexical categories. Develop two lexical analysis programs: one that uses only the forward algorithm, and one that combines both techniques. Give a sentence for which the results differ substantially. 222 CHAPTER 7 8. (hard) Generalize the Viterbi algorithm to operate based on trigram statistics. Since the probability of a state at position t now depends on the two preceding states, this problem does not fit the basic definition of the Markov assumption. It is often called a second-order Markov assumption. The best way to view this is as an HMM in which all the states are labeled by a pair of categories. Thus, the sequence ART N V N could be generated from a sequence of states labeled (0, ART), (ART, N), (N, V), and (V, N). The transition probabilities between these states would be the trigram probabilities. For example, the probability of the transition from state (0, ART) to (ART, N) would be the probability of category N following the categories 0 and ART. Train your data using the supplied corpus and implement a part-of-speech tagger based on trigrams. Compare its performance on the same data with the performance of the bigram model. PART II Semantic Interpretation #1 j§* k ■ wniwwn.....»Wi)i m*.%, "^J****** | ;jtl f i * •fc- HP 224 PAKT II CHAPTER As discussed in the introduction, the task of determining the meaning of a sentence will be divided into two steps: first computing a context-independent notion of meaning, called the logical form, and then interpreting the logical form in context to produce the final meaning representation. Part II of the book-concerns the first of these steps, which will be called semantic interpretations -Many actual systems do not make this division and use contextual information early in the processing. Nonetheless, the organization is very helpful for classifying the issues and for illustrating the different techniques. A similar division is often made in linguistics, where the study of context-independent meaning is often called semantics, and the study of language in context is called pragmatics. Using the need for context as the test for dividing' up problems, the following are some examples of issues that can be addressed in a context-independent way and hence will be addressed in Part II: • eliminating possible word senses by exploiting context-independent structural constraints • identifying the semantic roles that each word and phrase could play in the logical form, specifically identifying the predicate/argument and modification relations • identifying co-reference restrictions derived from the structure of the sentence lust as important, the following will not be studied in Part II but will be left for the discussion of contextual inteipretation in Part III: • determining the referents of noun phrases and other phrases • selecting a single interpretation (and set of word senses) from those that are possible • determining the intended use of each utterance Chapter 8 discusses the distinction between logical form and the final meaning representation in more detail, and develops a logical form language that is used throughout the rest of the book. Chapter 9 then addresses the issue of how the logical form relates to syntactic structure and shows how the feature system in the grammar can be used to identify the logical form in a rule-by-rule fashion. Chapter 10 discusses the important problem of ambiguity resolution and shows how semantic constraints or preferences can be used to identify the most plausible word senses and semantic structures. Chapter 11 discusses some alternate : methods of semantic interpretation that have proven useful in existing systems : and applications. Finally, Chapter 12 discusses a selection of more advanced ■ issues in semantic interpretation, including the analysis of scoping dependencies,.. for the student with strong interests in semantic issues. Semantics and Logical Form 8.1 Semantics and Logical Form 8.2 Word Senses and Ambiguity 8.3 The Basic Logical Form Language 8.4 Encoding Ambiguity in the Logical Form 8.5 Verbs and States in Logical Form 8.6 Thematic Roles 8.7 Speech Acts and Embedded Sentences o 8.8 Defining Semantic Structure: Model Theory Semantics and Logical Form 227 This chapter introduces the basic ideas underlying theories of meaning, or semantics. It introduces a level of context-independent meaning called the logical form, which can be produced direcdy from the syntactic structure of a sentence. Because it must be context independent, the logical form does not contain the results of any analysis that requires interpretation of the sentence in context. Section 8.1 introduces the basic notions of meaning and semantics, and describes the role of a logical form in semantic processing. Section 8.2 introduces word senses and the semantic primitives, and discusses the problem of word-sense ambiguity. Section 8.3 then develops the basic logical form language for expressing the context-independent meaning of sentences. This discussion is extended in Section 8.4 with constructs that concisely encode certain common forms of ambiguity. Section 8.5 discusses the representation of verbs and introduces the notion of state and event variables. Section 8.6 then discusses thematic roles, or cases, and shows how they can be used to capture various semantic generalities across verb meanings. Section 8.7 introduces the notion of surface speech acts and discusses the treatment of embedded sentences in the logical form. This completes the description of the logical form language, which is concisely characterized in Figures 8.7 and 8.8. The optional Section 8.8 is for the student interested in defining a model-theoretic semantics for the logical form language itself. It describes a model theory for the logical form language and discusses various semantic relationships among sentences that can be defined in terms of entailment and implicature. 8.1 Semantics and Logical Form Precisely defining the notions of semantics and meaning is surprisingly difficult because the terms are used for several different purposes in natural and technical usage. For instance, there is a use of the verb mean that has nothing to do with language. Say you are walking in the woods and come across a campfire that is just noticeably warm. You might say This fire means someone camped here last night. By this you mean that the fire is evidence for the conclusion or implies the conclusion. This is related to, but different from, the notion of meaning that will be the focus of the next few chapters. The meaning we want is closer to the usage when defining a word, such as in the sentence "Amble " means to walk slowly. This defines the meaning of a word in terms of other words. To make this more precise, we will have to develop a more formally specified language in which we can specify meaning without having to refer back to natural language itself. But even if we can do this, defining a notion of sentence meaning is difficult. For example, I was at an airport recently and while I was walking towards my departure gate, a guard at the entrance asked, "Do you know what gate you are going to?" I interpreted this as asking whether I knew where I was going and answered yes. But this response was based on a misunderstanding of what the guard meant, as he then asked, "Which gate is it?" He clearly had wanted me to 228 CHAPTER 8 Semantics and Logical Form 229 SYNTACTIC ANALYSIS NP VP ART the N ball V ADJP is red LOGICAL FORM semantic interpretation (RED1 ) ^ contextual interpretation FINAL REPRESENTATION Red(B073) Figure 8.1 Logical form as an intermediate representation tell him the gate number. Thus the sentence Do you know what gate you are going to? appears to mean different things in different contexts. Can we define a notion of sentence meaning that is independent of context? In other words, is there a level at which the sentence Do you know what gate you are going to? has a single meaning, but may be used for different purposes? This is a complex issue, but there are many advantages to trying to make such an approach work. The primary argument is modularity. If such a division can be made, then we can study sentence meaning in detail without all the complications of sentence usage. In particular, if sentences have no context-independent meaning, then we may not be able to separate the study of language from the study of general human reasoning and context. As you will see in the next few chapters, there are many examples of constraints based on the meaning of words that appear to be independent of context. So from now on, we will use the term meaning in this context-independent sense, and we will use the term usage for the context-dependent aspects. The representation of context-independent meaning is called the logical form. The process of mapping a sentence to its logical form is called semantic interpretation, and the process of mapping the logical form to the final knowledge representation (KR) language is called-contcxlual interpretation. Figure 8.1 shows a simple version of the stages of interpretation. The exact meaning of the notations used will be defined later. For the moment let us assume theknowledge representation language is the first-order predicate calculus (FOPC). Given that assumption, what is the status of the logical form? In some approaches the logical form is defined as the literal meaning of the utterance, and the logical form language is the same as the final knowledge representation language. If this is to be a viable approach in the long run, however, it would mean that the knowledge representation must be considerably more complex than representations in present use in AI systems. For instance, the logical form language must allow indexical terms, that is, terms that are defined by context. The pronouns / and you are indexical because their interpretation depends on the context of who is speaking and listening. In fact most definite descriptions (such as the red ball) are indexical, as the object referred to can only be identified with respect to a context. Many other aspects of language, including the interpretation of tense and determining the scope of quantifiers, depend on context as well and thus cannot be uniquely determined at the logical form level. Of course, all of this could be treated as ambiguity at the logical form level, but this would be impractical, as every sentence would have large numbers of possible logical forms (as in the sentence The red ball dropped, which would have a different logical form for every possible object that could be described as a ball that is red). But if the logical form language is not part of the knowledge representation language, what is its formal status? A promising approach has been developed in linguistics over the last decade that suggests an answer that uses the notion of a situation, which is a particular set of circumstances in the world. This corresponds reasonably well to the intuitive notion of the meaning of situation in English. For instance, when attending a class, you are in a situation where there are fellow students and an instructor, where certain utterances are made by the lecturer, questions asked, and so on. Also, there will be objects in the lecture hall, say a blackboard and chairs, and so on. More formally, you might think of a situation as a set of objects and relations between those objects. A very simple situation might consist of two objects, a ball B0005 and a person P86, and include the relationship that the person owns the ball. Let us encode this situation as the set {(BALL BQ005), (PERSON P86), (OWNS P86 B0005)}. Language creates special types of situations based on what information is conveyed. These issues will be explored in detail later, but for now consider the following to help your intuition. In any conversation or text, assume there is a discourse situation that records the information conveyed so far. A new sentence is interpreted with respect to this situation and produces a new situation that includes the information conveyed by the new sentence. Given this view, the logical form is a function that maps the discourse situation in which the utterance was made to a new discourse situation that results from the occurrence of the utterance. For example, assume that the situation we just encoded has been created by some preceding sentences describing the ball and who owns it. The utterance The ball is red might produce a new situation that consists of the old situation plus the new fact that B0005 has the property RED: {(BALL B0005), (PERSON P86), (OWNS P86 B0005), (RED B0005)}. Figure 8.2 shows this view of the interpretation process, treating the logical form as a function between 230 CHAPTER 8 Semantics and Logical Form 231 syntactic analysis np vp A A art the n ball adjp red logical form semantic interpretation (assert (red1 ) {(ball b0005), (personp86), (Owns p86 booos)} initial discourse situation contextual interpretation {(ball b0005), (person p86), (owns p86 b0005), (redb0005)) updated discourse situation Figure 8.2 Logical form as a function situations. The two organizations presented in Figures 8.1 and 8.2 differ in that the latter might not include a single identifiable expression in the knowledge representation that fully captures the "meaning" of the sentence. Rather, the logical form might make a variety of changes to produce the updated situation. This allows other implications to be derived from an utterance that are not directly captured in the semantic content of the sentence. Such issues will become important later when we discuss contextual interpretation. Even though much of language is highly context dependent, there is still considerable semantic structure to language that is context independent and lhat can be used in the semantic interpretation process to produce the logical form. Much of this semantic knowledge consists of the type of information you can find in dictionaries—the basic semantic properties of words (that is, whether they refer to relations, objects, and so on), what different senses are possible for each-word, what senses may combine to form larger semantic structures, and so on. Identifying these forms of information and using this information to compute a logical form arc the focus of Part II of the book. 8.2 Word Senses and Ambiguity To develop a theory of semantics and semantic interpretation, we need to develop a structural model, just as we did for syntax. With syntax we first introduced the notion of the basic syntactic classes and then developed ways to constrain how simple classes combine to form larger structures. We will follow the same basic strategy for semantics. You might think that the basic semantic unit could be the word or the morpheme, but that approach runs into problems because of the presence of ambiguity. For example, it is not unusual for the verb go to have more than 40 entries in a typical dictionary. Each one of these definitions reflects a different sense of the word. Dictionaries often give synonyms for particular word senses. For go you might find synonyms such as move, depart, pass, vanish, reach, extend, and set out. Many of these highlight a different sense of the verb go. Of course, if these are true synonyms of some sense of go, then the verbs themselves will share identical senses. For instance, one of the senses of go will be identical to one of the senses of depart. If every word has one or more senses then you are looking at a very large number of senses, even given that some words have synonymous senses. Fortunately, the different senses can be organized into a set of broad classes of objects by which we classify the world. The set of different classes of objects in a representation is called its ontology. To handle a natural language, we need a much broader ontology than commonly found in work on formal logic. Such classifications of objects have been of interest for a very long time and arise in the writings of Aristotle (384-322 B.C.). The major classes that Aristotle suggested were substance (physical objects), quantity (such as numbers), quality (such as bright red), relation, place, time, position, state, action, and affection. To this list we might add other classes such as events, ideas, concepts, and plans. Two of the most influential classes are actions and events. Events are things that happen in the world and are important in many semantic theories because they provide a structure for organizing the interpretation of sentences. Actions are things that agents do, thus causing some event. Like all objects in the ontology, actions and events can be referred to by pronouns, as in the discourse fragment We lifted the box. It was hard work. Here, the pronoun it refers to the action of lifting the box. Another very influential category is the situation. As previously mentioned, a situation refers to some particular set of circumstances and can be viewed as subsuming the notion of events. In many cases a situation may act like an abstraction of the world over some location and time. For example, the sentence We laughed and sang at the football game describes a set of activities performed at a particular time and location, described as the situation the football game. Not surprisingly, ambiguity is a serious problem during semantic interpretation. We can define a word as being semantically ambiguous if it maps to more than one sense. But this is more complex than it might first seem, because we need to have a way to deteririine what the allowable senses are. For example, 232 chapter 8 Semantics and Logical Form 233 intuitively the word kid seems to be ambiguous between a baby goat and a human child. But how do we know it doesn't have a single sense that includes both interpretations? The word horse, on the other hand, seems not to be ambiguous even though we know that horses may be subdivided into mares, colts, trotters, and so on. Why doesn't the word horse seem ambiguous when each time the word is used we might not be able to tell if it refers to a mare or a colt? A few linguistic tests have been suggested to define the notion of semantic ambiguity more precisely. One effective test exploits the property that certain syntactic constructs typically require references to identical classes of objects. For example, the sentence / have two kids and George has three could mean that' George and I are goat farmers or that we have children, but it can't mean a combination of both (I have goats and George has children). On the other hand-you can say / have one horse and George has two, even when I have a colt and George has mares. Thus this test provides a way to examine our intuitions about word senses. The word kid is ambiguous between two senses, BABY-GOAT1 and BABY-HUMAN1, whereas horse is not ambiguous between MARE1 and COLT1 but rather has a sense HORSE1 that includes both. This brings up the important point that some senses are more specific than others. This property is often referred to as vagueness. The sense HORSE1 is vague to the extent that it doesn't distinguish between mares and colts. Of course, the sense MARE1 is vague with respect to whether it is a large mare or a small one. Virtually all senses involve some degree of vagueness, as they might always allow some more precise specification. A similar ambiguity test can be constructed for verb senses as well. For example, the sentence / ran last year and George did too could mean that we both were candidates in an election or that we both ran some race, but it would be difficult to read it as a mixture of the two. Thus the word run is ambiguous between the senses RUN1 (the exercise sense) and RUN2 (the political sense). In contrast, the verb kiss is vague in that it does not specify where one is kissed. You can say / kissed Sue and George did too, even though I kissed her on the cheek and George kissed her hand. In addition to lexical ambiguity, there is considerable structural ambiguity at the semantic level. Some forms of ambiguity are parasitic on the underlying syntactic ambiguity. For instance, the sentence Happy cats and dogs live on the farm is ambiguous between whether the dogs are also happy or not (that is. is it happy cats and happy dogs, or happy cats and dogs of any disposition). Although this ambiguity does have semantic consequences, it is actually rooted in the syntactic structure; that is, whether the conjunction involves two noun phrases. (Happy cats) and (dogs), or the single noun phrase (Happy (cats and dogs)). But other forms of structural ambiguity are truly semantic and arise from a single syntactic structure. A very common example involves quantifier scoping. For instance, does the sentence Every boy loves a dog mean that there is a single dog that all boys love, or that each boy might love a different dog? The syntactic structure is the same in each case, but the difference lies in how the quantifiers are scoped. For example, the two meanings correspond roughly to the following statements in FOPC: 3d. Dog(d) & V b. Boy(b) u Loves(b, d) V b. Boy(b) d3J. Dog(d) & Loves(b, d) Thus, while Every boy loves a dog has a single syntactic structure, its semantic structure is ambiguous. Quantifiers also vary with respect to vagueness. The quantifier all is precise in specifying every member of some set, but a quantifier such as many, as in Many people saw the accident, is vague as to how many people were involved. You might also think that indexical terms such as you, I, and here are ambiguous because their interpretations depend on context. But this is not the sense of ambiguity being discussed here. Note that the word dog is not Considered ambiguous because there are many dogs to which the noun phrase the dog could refer. The semantic meaning of the dog is precisely determined. Likewise, the pronoun / may be unspecified as to its referent, but it has a single well-defined sense. While the referents of phrases are context dependent and thus beyond the scope of this discussion, there are context-independent constraints on reference that must be accounted for. Consider the sentence Jack saw him in the mirror. While it is not specified who was seen, you know that it wasn't Jack seeing himself. To express this meaning, you would have to use the reflexive pronoun, as in Jack saw himself in the mirror. Thus certain reference constraints arise because of the structure of sentences. This topic will be explored in Chapter 12. A very important aspect of context-independent meaning is the cooccurrence constraints that arise between word senses. Often the correct word sense can be identified because of the structure and meaning of the rest of the sentence. Consider the verb run, which has one sense referring to the action you do when jogging and another referring to the action of operating some machine. The first sense is typically realized as an intransitive verb (as in Jack ran in the park), whereas the second can only be realized as a transitive verb (as in Jack ran the printing press for years). In other cases the syntactic structure remains the same, but the possible senses of the words can only combine in certain ways. For instance, consider Jack ran in the park versus Jack ran in the election. The syntactic structures of these sentences are identical, but different senses of the verb run must be selected because of the possible senses in the modifier. One of the most important tasks of semantic interpretation is to utilize constraints such as this to help reduce the number of possible senses for each word. 8.3 The Basic Logical Form Language The last section introduced a primitive unit of meaning, namely the word sense. This section defines a language in which you can combine these elements to form meanings for more complex expressions/This language will resemble FOPC, 234 CHAPTER 8 Semantics and Logical Form 235 LOVES 1 LOVES 1 y V pred '/\ 11 / \ SUE1 JACK1 agent^/' \^ theme SUE1 \ JACK1 Figure 8 J Two possible network representations of Sue loves Jack although there are many equivalent forms of representation, such as network- " based representations, that use the same basic ideas. The word senses will serve-as the atoms or constants of the representation. These constants can be classified by the types of things they describe. For instance, constants that describe objects in the world, including abstract objects such as events and situations, are called^ terms. Constants that describe relations and properties are called predicates. A proposition in the language is formed from a predicate followed by an J appropriate number of terms to serve as its arguments. For instance, the proposition corresponding to the sentence Fido is a dog would be constructed from the term FIDOl and the predicate constant DOG1 and is written as (DOG1 FIDOl) Predicates that take a single argument are called unary predicates or properties; those that take two arguments, such as LOVES 1, are called binary predicates; and those that take n arguments are called n-ary predicates. The proposition corresponding to the sentence Sue loves Jack would involve a binary • predicate LOVES 1 and would be written as (LOVES1 SUE1 JACK1) You can see that different word classes in English correspond to different types of constants in the logical form. Proper names, such as Jack, have word senses that are terms; common nouns, such as dog, have word senses that are-unary predicates; and verbs, such as run, love, and put, have word senses that correspond to rc-ary predicates, where n depends on how many terms the verb subcategorizes for. Note that while the logical forms are presented in a predicate-argument form here, the same distinctions are made in most other meaning representations. For instance, a network representation would have nodes that correspond to the word senses and arcs that indicate the predicate-argument structure. The meaning of the sentence Sue loves Jack in a semantic networklike representation might appear in one of the two forms shown in Figure 8.3. For most purposes all of^ these representation formalisms are equivalent. More complex propositions are constructed using a new class of constants called logical operators. For example, the operator NOT allows you to construct a proposition that says that some proposition is not true. The proposition corresponding to the sentence Sue does not love Jack would be (NOT (LOVES1 SUE1 JACK1)) English also contains operators that combine two or more propositions to form a complex proposition. FOPC contains operators such as disjunction (v), conjunction (&), what is often called implication (o), and other forms (there are 16 possible truth functional binary operators in FOPC). English contains many similar operators including or, and, if, only if, and so on. Natural language connectives often involve more complex relationships between sentences. For instance, the conjunction and might correspond to the logical operator & but often also involves temporal sequencing, as in / went home and had a drink, in which going home preceded having the drink. The connective but, on the other hand, is like and except that the second argument is something that the hearer might not expect to be true given the first argument. The general form for such a proposition is (connective proposition proposition). For example, the logical form of the sentence Jack loves Sue or Jack loves Mary would be (OR1 (LOVES1 JACK1 SUE1) (LOVES1 JACK1 MARY1)). The logical form language will allow both operators corresponding to word senses and operators like & directly from FOPC. The logic-based operators will be used to connect propositions not explicitly conjoined in the sentence. With the simple propositions we just defined, we can define the meaning of only a very limited subset of English, namely sentences consisting of simple verb forms and proper names. To account for more complex sentences, we must define additional semantic constructs. One important construct is the quantifier. In first-order predicate calculus, there are only two quantifiers: V and 3. English contains a much larger range of quantifiers, including all, some, most, many, a few, the, and so on. To allow quantifiers, variables are introduced as in first-order logic but with an important difference. In first-order logic a variable only retains its significance within the scope of the quantifier. Thus two instances of the same variable x occurring in two different formulas—say in the formulas 3x. P(x) and 3x. Q(x)—are treated as completely different variables with no relation to each other. Natural languages display a different behavior. For instance, consider the two sentences A man entered the room. He walked over to the table. The first sentence introduces a new object to the discussion, namely some man. You might think to treat the meaning of this sentence along the lines of the existential quantifier in logic. But the problem is that the man introduced existentially in the first sentence is referred to by the pronoun He in the second sentence. So variables appear to continue their existence after being introduced. To allow this, each time a discourse variable is introduced, it is given a unique name not used before. Under the right circumstances, a subsequent sentence can then refer back to this term. 236 CHAPTER 8 Semantics and Logical Form 237 Quantifier Use Example THE definite reference the dog A indefinite reference a dog BARE bare singular NP (mass term) or water, food BARE bare plural NP (generics) dogs Figure 8.4 Some common quantifiers Natural language quantifiers have restricted ranges and thus are more complex than those found in FOPC. In FOPC a formula of the form Vx . Px is true if and only if Px is true for every possible object in the domain (that is, x maybe any term in the language). Such statements are rare in natural language. Rather, you would say all dogs bark or most people laughed, which require constructs that are often called generalized quantifiers. These quantifiers are used in statements of the general form (quantifier variable : restriction-proposition body-proposition) For instance, the sentence Most dogs bark would have the logical form (MOST1 dl: (DOG1 dl) (BARKS 1 dl)) This means that most of the objects dl that satisfy (DOG1 dl) also satisfy (BARKS 1 dl). Note that this has a very different meaning from the formula (MOST d2 : (BARKS 1 d2) (DOG1 d2)) which roughly captures the meaning of the sentence Most barking things are dogs. A very important class of generalized quantifiers corresponds to the articles the and a. The sentence The dog barks would have a logical form (THE x : (DOG1 x) (BARKS 1 x)) which would be true only if there is a uniquely determined dog in context and that dog barks. Clearly, in any natural setting there will be many dogs in the world, so the use of context to identify the correct one is crucial for understanding the sentence. Since this identification process requires context, however, discussion of it is delayed until Part III. Here it suffices to have a way to write the logical form. The special set of quantifiers corresponding to the articles (or the absence of articles in bare noun phrases) is shown in Figure 8.4. More complex noun phrases will result in more complex restrictions. For instance, the sentence The happy dog barks will involve a restriction that is a conjunction, namely (THE x : (& (DOG1 x) (HAPPY x)) (BARKS 1 x)). This will be true only if there is a contextually unique x such that (& (DOG1 x) (HAPPY xi) is mie. and Ibis x barks. Another construct needs to be introduced to handle plural forms, as in a phrase such as the dogs bark. A new type of constant called a predicate operator is introduced that takes a predicate as an argument and produces a new predicate. For plurals the predicate operator PLUR will be used. If DOG1 is a predicate that is true of any dog, then (PLUR DOG1) is a predicate that is true of any set of dogs. Thus the representation of the meaning of the sentence The dogs bark would be (THE x : ((PLUR DOG 1) x) (BARKS 1 x)) Plural noun phrases introduce the possibility of a new form of ambiguity. Note that the natural reading of The dogs bark is that there is a specific set of dogs, and each one of them barks. This is called the distributive reading, since the predicate BARKS 1 is distributed over each element of the set. In contrast, consider the sentence The dogs met at the corner. In this case, it makes no sense to say that each individual dog met; rather the meeting is true of the entire set of dogs. This is called the collective reading. Some sentences allow both interpretations and hence are ambiguous. For instance, the sentence Two men bought a stereo can mean that two men each bought a stereo (the distributive reading), or that two men bought a stereo together (the collective reading). The final constructs to be introduced are the modal operators. These are needed to represent the meaning of verbs such as believe and want, for representing tense, and for many other constructs. Modal operators look similar to logical operators but have some important differences. Specifically, terms within the scope of a modal operator may have an interpretation that differs from the normal one. This affects what conclusions you can draw from a proposition. For example, assume that Jack is also known as John to some people. There are two word senses that are equal; that is, JACK1 = JOHN22. With a simple proposition, it doesn't matter which of these two constants is used: if (HAPPY JOHN22) is true then (HAPPY JACK1) is true, and vice versa. This is true even in complex propositions formed from the logical operators. If (OR (HAPPY JOHN1) (SAD JOHN1)) is true, then so is (OR (HAPPY JACK1) (SAD JACK1)), and vice versa. The same propositions within the scope of a modal operator such as BELIEVE1, however, are not interchangeable. For instance, if Sue believes that Jack is happy, that is, (BELIEVE SUE1 (HAPPY JACK1)) then it does not necessarily follow that Sue believes John is happy, that is, (BELIEVE SUE (HAPPY JOHN22)) because Sue might not know that JACK1 and JOHN22 are the same person. Thus you cannot freely substitute equal terms when they occur within the scope of a modal operator. This is often referred to as the failure of substitutivity in modal contexts. 238 CHAPTER 8 Semantics and Logical Form 239 An important class of modal operators for natural language are the tensed operators, PAST, PRES, and FUT. So far all examples have ignored the effect ofl tense. With these new operators, however, you can represent the difference in: meaning between John sees Fido, John saw Fido, sad: John will see Fido, namely" as the propositions (PRES (SEES1 JOHN1 FIDOl)) (PAST (SEES1 JOHN1 FIDOl)) (FUT (SEES1 IOHN1 FIDOl)) You can see that these are modal operators because they exhibit the failure off substitutivity. For example, consider the operator PAST, and assume two constants, say JOHN1 and PRESIDENT 1, that are equal now, indicating that lohn is_ currently the president. But in the past, lohn was not the president, so IOHN1 did not equal PRES1DENT1. Given this and the fact that John saw Fido in the past-CP AST (SEES1 JOHN1 FTDOl)), you cannot conclude that the president saw-Fido in the past, that is, (PAST (SEES1 PRESIDENT 1 FIDOl)), since John was not the president at that time. Note also that a proposition and its negation can-both be true in the past (but at.different times). Thus it is possible for both the sentences John was happy and John was not happy to be true; that is. (PAST (HAPPY JOHN1)) and (PAST (NOT (HAPPY JOHN1))) are both true. This completes the specification of the basic logical form language. The next sections discuss various extensions that make the language more convenient^ for expressing ambiguity and capturing semantic regularities. 8.4 Encoding Ambiguity in the Logical Form The previous sections defined many of the constructs needed to specify the logical form of a sentence, and if you were interested solely in the nature of logical: form, you could be finished. But representations for computational use have another important constraint on them, namely the handling of ambiguity. A typi- i cal sentence will have multiple possible syntactic structures, each of which might have multiple possible logical forms. In addition, the words in the sentence will? have multiple senses. Simply enumerating the possible logical forms will not be practical. Rather, we will take an approach where certain common ambiguities can be collapsed and locally represented within the logical form, and we will -develop techniques to incrementally resolve these ambiguities as additional con-; straints from the rest of the sentence and from context are brought into play.: Many researchers view this ambiguity encoding as a separate levelof represents- = tion from the logical form, and it is often referred to as the quasi-logical form. Perhaps the greatest source of ambiguity in the logical form comes from the fact that most words have multiple senses. Some of these senses have., different structural properties, so they can be eliminated given the context of the surrounding sentence. But often words have different senses that have identical structural constraints. At present, the only way to encode these would be to build" BOX 8.1 The Need for Generalized Quantifiers For the standard existential and universal quantifiers, there are formulas in standard FOPC equivalent to the generalized quantifier forms. In particular, the formula (EXISTS x : Px Qx) is equivalent to 3x.Px&Qx and the universally quantified form (ALLx:PxQx) is equivalent to - (?, These generalized quantifier forms can be thought of simply as abbreviations. But the other quantifiers do not have an equivalent form in standard FOPC. To see this, consider trying to define the meaning of Most dogs bark using the standard semantics for quantifiers. Clearly . Dog(x) id Bark(x) is too strong, and 3x . Dog(x) & Bark(x) is too weak. Could we define a. new quantifier M such that some formula of form Mx. Px has the right truth conditions? Say we try to define Mx . Px to be true if over half the elements of the domain satisfy Px. Consider Mx . Dog(x) & Bark(x). This will be true only if over half the objects in the domain satisfy Dog(x) & Bark(x). But this is unlikely to be true, even if all dogs bark. Specifically, it requires over half the objects in the domain to be dogs! Maybe Mx. Dog(x) z> Bark(x) works. This would be true if more than half the objects in the domain either are not dogs or bark. But this will be true in a model containing 11 objects, 10 seals that bark, and 1 dog that doesn't! You may try other formulas for Px, but no simple approach will provide a satisfactory account. a separate logical form for each possible combination of senses for the words in the sentence. To reduce this explosion of logical forms, we can use the same technique used to handle multiple feature values in the syntactic structure. Namely, anywhere an atomic sense is allowed, a set of possible atomic senses can be used. For example, the noun ball has at least two senses: BALL1, the object used in games, and BALL2, the social event involving dancing. Thus the sentence Sue watched the ball is ambiguous out of context. A single logical form can represent these two possibilities, however: 1. (THE bl : ({BALL1 BALL2} bl) (PAST CWATCH1 SUE1 bl))) This abbreviates two possible logical forms, namely 2. (THE bl : (BALL1 bl) (PAST (WATCH1 SUE! bl))) and (THE bl : (BALL2 bl) (PAST (WATCH1 SUE! bl))) 240 CHAPTER 8 Semantics and Logical Form 241 One of the most complex forms of ambiguity in logical forms arises from the relative scoping of the quantifiers and operators. You saw in Section 8.1 that a sentence such as Every boy loves a dog is ambiguous between two readings? depending on the scope of the quantifiers. There is no context-independent v method for resolving such issues, so the ambiguity must be represented in the final logical forms for the sentence. Rather than enumerating all possible scop-ings, which would lead to an exponentially growing number of interpretations?! based on the number of scoping constructs, we introduce another abbreviation" into the logical form language that collapses interpretations together. Specifi-; cally, the abbreviated logical form does not contain scoping information at all. Rather, constructs such as generalized quantifiers are treated syntactically like terms and appear in the position indicated by the syntactic structure of the sen-: tence. They are marked using angle brackets to indicate the scoping abbreviation. For example, the logical forms for the sentence Every boy loves a dog are captured by a single ambiguous form (LOVES 1 ) This abbreviates an ambiguity between the logical form (EVERY bl : (BOY1 bl) (A dl: (DOG1 dl) (LOVES 1 bl dl))) and (A dl : (DOG1 dl) (EVERY bl : (BOY1 bl) (LOVES 1 bl dl))) While the savings don't amount to much in this example, consider that a sentence with four scoping constructs in it would have 24 (4 factorial) possible orderings, and one with five scoping constructs would have 120 orderings. The abbreviation convention allows all these forms to be collapsed to a single representation. Chapter 12 will explore some heuristic techniques for determining the scope of operators. For the time being, however, it is reasonable to assume that no context-independent scoping constraints need be represented. If the restriction in a generalized quantifier is a proposition involving a.; simple unary predicate, a further abbreviation will be used that drops the variable. Thus the form will often be abbreviated as . A large number of constructs in natural language are sensitive to scoping. In particular, all the generalized quantifiers, including the, are subject to scoping. For example, in At every hotel, the receptionist was friendly, the preferred reading in almost any context has the definite reference the receptionist fall within the scope of every hotel; that is, there is a different receptionist at each hotel. In addition, operators such as negation and tense are also scope sensitive. For example, the sentence Every boy didn't run is ambiguous between the reading in which some boys didn't run and some did, that is, (NOT (EVERY bl: (BOY1 bl) (RUN1 bl))) and the reading where no boys ran, that is, (EVERY bl : (BOY1 bl) (NOT (RUN1 bl))) These two readings are captured by the single logical form ( ) where unscoped unary operators (for example, NOT, PAST, PRES, and so on) are wrapped around the predicate. Finally, let us return to two constructs that need to be examined further: proper names and pronouns. So far we have assumed that each proper name identifies a sense that denotes an object in the domain. While this was a useful abstraction to introduce the basic ideas of logical form, it is not a general enough treatment of the phenomena. In fact, proper names must be interpreted in context, and the name John will refer to different people in different situations. Our revised treatment resolves these problems by using a discourse variable that has the property of having the specified name. We will introduce this construct as a special function, namely (NAME ) which produces the appropriate object with the name in the current context. Thus, the logical form of John ran would be ( (NAME jl "John")). Arguments similar to those previously given for proper names can be made for pronouns and other indexical words, such as here and yesterday, and we will treat them using a special function of the form (PRO ). For example, the quasi-logical form for Every man liked him would be ( (PRO m2 (HE1 m2))) HE1 is the sense for he and him, and formally is a predicate true of objects that satisfy the restrictions on any antecedent, that is, being animate-male in this case. As with generalized quantifiers, when the restriction is a simple unary predicate, the pro forms are often abbreviated. For example, the logical form for he will often be written as (PRO m2 HE1). The constructs described in this chapter dramatically reduce the number of logical forms that must be initially computed for a given sentence. Not all ambiguities can be captured by the abbreviations, however, so there will still be sentences that will require a list of possible logical forms, even for a single syntactic structure. 8.5 Verbs and States in Logical Form So far, verbs have mapped to appropriate senses acting as predicates in the logical form. This treatment can handle all the different forms but loses some generalities that could be captured. It also has some annoying properties. 242 CHAPTER 8 Semantics and Logical Form 243 Consider the following sentences, all using the verb break. John broke the window with the hammer. The hammer broke the window. The window broke. Intuitively, all these sentences describe the same type of event but in-varying detail. Thus it would be nice if the verb break mapped to the same sense in each case. But there is a problem, as these three uses of the verb seem to indicate verb senses of differing arity. The first seems to be a ternary relation' between John, the window, and the hammer, the second a binary relation between the hammer and the window, and the third a unary relation involving the windows It seems you would need three different senses of break, BREAK1, BREAK2. and BREAK3, that differ in their arity and produce logical forms such as 1. ( (NAME jl "John") ), 2. ( ), and 3. ( ) Furthermore, to guarantee that each predicate is interpreted appropriately, the representation would need axioms so that whenever 1 is true, 2 is true, and when-: ever 2 is true, 3 is true. These axioms are often called meaning postulates. Having to specify such constraints for every verb seems rather inconvenient. Davidson (1967) proposed an alternative form of representation to handle such cases as well as more complicated forms of adverbial modification. He suggested introducing events into the ontology, and treating the meaning of a sentence like John broke it along the following lines (translated into our notation): (3 el : (BREAK el (NAME jl "John") (PRO il IT1))) which asserts that el is an event of John breaking the indicated window. Nov\ the meaning of John broke it with the hammer would be (3el:(& (BREAK el (NAME jl "John") (PRO illTl)) (INSTR el ))) The advantage is that additional modifiers, such as with the hammer or on Tuesday or in the hallway, can be incrementally added to the basic representation : by adding more predications involving the event. Thus only one sense of the verb break need be defined to handle all these cases. Similar intuitions also motivated the development of case grammar in the early 1970s. While case grammar has been abandoned as originally proposed, many of its concepts have continued to influence current theories. One claim that remains influential is that there is a limited set of abstract semantic relationships that can hold between a verb and its arguments. These are often called thematic roles or case roles, and while different researchers have used different sets «>l roles, the number required has always remained small. The intuition is that John. the hammer, and the window play the same semantic roles in each of these sentences. John is the actor (the agent role), the window is the object (the theme role), and the hammer is the instrument (the instrument role) used in the act of breaking. This suggests a representation of sentence meaning similar to that used in semantic networks, where everything is expressed in terms of unary and binary relations. Specifically, using the three thematic roles just mentioned, the meaning of John broke the window would be (3 e (& (BREAK e) (AGENT e (NAME jl "John")) (THEME e ))) Because such constructs are so common, we introduce a new notation for them. The abbreviated form for an assertion of the form (3 e : (& (Event-p e) (Relation} e obji)... (Relationn e objn))) will be (Event-p e [Relationr obji]... [Relationn objn]) In particular, the quasi-logical form for the sentence John broke the window using this abbreviation is ( el [AGENT (NAME jl "John")] [THEME ]) It turns out that similar arguments can be made for verbs other than event verbs. Consider the sentence Mary was unhappy. If it is represented using a unary predicate as ( (NAME jl "Mary")) then how can we handle modifiers, as in Mary was unhappy in the meeting! Using the previous argument, we can generalize the notion of events to include states and use the same technique. For instance, we might make UNHAPPY a predicate that asserts that its argument is a state of unhappiness, and use the THEME role to include John, that is, (3 s s) (THEME s (NAME jl "Mary")) Of course, we could use the same abbreviation convention as previously defined for events and thus represent Mary was unhappy in the meeting as ( s [THEME (NAME jl "Mary")] [IN-LOC ]) It has been argued that having event variables as arguments to atomic predicates is still not enough to capture the full expressiveness of natural language. Hwang and Schubert (1993a), for instance, allow events to be defined by arbitrary sentences. For now, having events and states will be sufficient to develop the important ideas in this book. 244 CHAPTER 8 Semantics and Logical Form 245 In many situations using explicit event and state variables in formulas cumbersome and interferes with the development of other ideas. As a result, w*. will use different representations depending on what is best for the presentation For example, the logical form of Mary sees John will sometimes be written as (PRES (SEESl 11 [AGENT (NAME jl "Mary")] [THEME (NAME ml "John")])) which of course is equivalent to (PRES (3 11 (& (SEES 111) (AGENT 11 (NAME j 1 "Mary")) (THEME 11 (NAME ml "John"))))) Other times it will be written in predicate argument form: (PRES (SEESl (NAME jl "Mary") (NAME ml "John"))) There has been considerable debate in the literature about whether the thematic role analysis is necessary, or whether all the properties we want from thematic roles can be obtained in other ways from a predicate-argument slyle representation. While an argument that thematic roles are necessary is hard to construct, the representation is sufficiently helpful to be used in many semantic representation languages. 8.6 Thematic Roles This section examines theories based on the notion of thematic roles, or cases. One motivating example from the last section included the sentences John broke the window with the hammer. The hammer broke the window. The window broke. John, the hammer, and the window play the same semantic roles in each of these sentences. John is the actor, the window is the object, and the hammer is the instrument used in the act of breaking of the window. We introduced relations such as AGENT, THEME, and INSTR to capture these intuitions. But can we define these relations more precisely, and what other thematic roles have proved useful in natural language systems? These issues are explored in this section. Perhaps the easiest thematic role to define is the AGENT role. A noun phrase fills the AGENT role if it describes the instigator of the action described by the sentence. Further, this role may attribute intention, volition, or responsibility for the action to the agent described. One test for AGENT-hood involves adding phrases like intentionally' or in order to to active voice sentences. If the resulting sentence is well formed, the subject NP can fill the AGENT role. The following sentences are acceptable: John intentionally broke the window. John broke the window in order to let in some air. But these sentences are not acceptable: *The hammer intentionally broke the window. *The window broke in order to let in some air. Thus the NP John fills the AGENT role only in the first two sentences. Not all animate NPs, even in the subject position, fill the AGENT role. For instance, you cannot normally say *John intentionally died. *Mary remembered her birthday in order to get some presents. Of course, by adding the phrase intentionally to the first sentence, you may construct some plausible reading of the sentence (John killed himself), but this is a result of modifying the initial meaning of the sentence John died. NPs that describe something undergoing some change or being acted upon will fill a role called THEME. This usually corresponds to the syntactic OBJECT and, for any transitive verb X, is the answer to the question "What was Xed?" For example, given the sentence The gray eagle saw the mouse, the NP the mouse is the THEME and is the answer to the question "What was seen?" For intransitive verbs, the THEME role is used for the subject NPs that are not AGENTs. Thus in The clouds appeared over the horizon, the NP the clouds fills the THEME role. More examples follow, with the THEME NP in italics: The rock broke. John broke the rock. I gave John the book. A range of roles has to do with locations, or abstract locations. First, we must make the distinction mentioned earlier between relations that indicate a location or place and those that indicate motion or paths. The AT-LOC relation indicates where an object is or where an event takes place, as in Harry walked on the road. The chair is by the door. On the road describes where the walking took place, while by the door describes where the chair is located. Other phrases describe changes in location, direction of motion, or paths: I walked from here to school yesterday. It fell to the ground. The birds flew from the lake along the river gorge. There are at least three different types of phrases here: those that describe where something came from (the FROM-LOC role), such as from here; those that describe the destination (the TO-LOC role), such as to the ground; and those that describe the trajectory or path (the PATH-LOC role), such as along the gorge. 246 CHAPTER 8 Semantics and Logical Form 247 These location roles can be generalized into roles over arbitrary state^ values, called the AT role, and roles for arbitrary state change (the FROM. TO, and PATH roles). Thus AT-LOC is a specialization of the AT role, and so on. You can see other specializations of these roles when you consider the abstract ■ relation of possession: I threw the ball to John. I gave a book to John. I caught the ball from John. I borrowed a book from John. The box contains a ball. John owns a book. (the TO-LOC role) (the TO-POSS role) (the FROM-LOC role) (the FROM-POSS role) (the AT-LOC role) (the AT-POSS role) Similarly, you might define AT-TIME, TO-TIME, and FROM-TIME roles, as in I saw the car at 3 o 'clock. (the AT-TIME role) I worked from one until three, (the FROM-TIME and TO-TIME role) The roles apply to general state change as well, as with temperature in The temperature remains at zero. The temperaturemsz from zero. (AT-VALUE) (FROM-VALUE) Thus the notion of general value and change of value along many dimensions seems to be supported by the similarity of the ways of realizing these roles in sentences. Another role is motivated by the problem that, given the present taxonomy, you cannot easily classify the role of the NP in a sentence such as John believed that it was raining. The THEME role is filled with the clause that it was raining, since this is what is believed. John cannot be an AGENT because there is no intentionality in believing something. Thus you must introduce a new role, called EXPERI-ENCER, which is filled by animate objects that are in a described psychological state, or that undergo some psychological process, such as perception, as in the preceding sentence and as in John saw the unicorn. Another role is the BENEFICIARY role, which is filled by the animate person for whom a certain event is performed, as in I rolled on the floor for Lucy. Find me the papers! I gave the book to Jack for Susan. The last example demonstrates the need to distinguish the TO-POSS role (that is; to Jack) from the BENEFICIARY role. The INSTR role describes a tool, material, or force used to perform some event, as in Harry broke the glass with the telescope. The telescope broke the glass. I used some flour to make a cake. I made a cake with some flour. Depending on the verb, the INSTR role sometimes can be used as the surface subject when the AGENT role is not specified. Natural forces are also included in the INSTR category here, although you could argue for a different analysis. Thus the following are also examples of the INSTR role: The sun dried the apples. Jack used the sun to dry the apples. The AGENT and INSTR roles could be combined into a more general role named CAUSAL-AGENT. Other roles need to be identified before certain sentences can be analyzed. For example, some sentences describe situations where two people perform an act together: Henry lifted the piano with Jack. To handle this, you must introduce a role CO-AGENT to account for the PP with Jack. A more complicated case occurs in sentences involving exchanges or other complex interactions. For example, consider the sentences Jack paid SI to the man for the book. Jack bought the book from the man for $1. These sentences both describe a situation where Jack gives the man $1 and receives the book in exchange. In the first sentence, however, the $1 is the THEME and there is no role to account for the book. In the second sentence the situation is reversed: the book is the THEME and $1 is unaccounted for. To handle these cases you must add a role CO-THEME for the second object in an exchange. A more general solution to this problem would be to analyze such sentences as describing two events. The primary event is the one you have been considering so far, but a secondary event may be present. In this analysis you might analyze the first sentence as follows (the primary event being Jack paying the dollar, and the secondary being Jack receiving the book): Jack: AGENT of both PRIMARY and SECONDARY event $ 1: THEME of PRIMARY event the man: TO-POSS of PRIMARY, FROM-POSS of SECONDARY the book: THEME of SECONDARY event 248 CHAPTER 8 Semantics and Logical Form 249 Role and Subroles Other Common Names Definition CAUSAL-AGENT the object that caused the event AGENT intentional causation INSTR force/tool used in causing the event THEME PATIENT the thing affected by the event EXPERIENCER the person involved in perception or a physical/psychological state BENEFICIARY the person for whom an act is done AT the state/value on some dimension AT-LOC LOCATION current location AT-POSS POSSESSOR current possessor AT-VALUE current value AT-TTME current time TO final value in a state change TO-LOC DESTINATION final location TO-POSS RECIPIENT final possessor TO-VALTJE final value FROM original value in a state change FROM-LOC SOURCE original location FROM-POSS original possessor FROM-VALUE original value PATH path over which something travels CO-AGENT secondary agent in an action CO-THEME secondary theme in an exchange Figure 8.5 Some possible semantic roles This possibility will not be pursued further, however, since it leads into many issues not relevant to the remainder of this chapter. Figure 8.5 provides a summary of most of the roles distinguished thus far and the hierarchical relationships between them. As you've seen, verbs can be classified by the thematic roles that they require. To classify them precisely, however, you must make a distinction between roles that are "intimately" related to the verb and those that are not. For example, almost any past tense verb allows an AT-TIME role realized by the adverb yesterday. Thus this role is apparently more a property of verb phrases in general than a property of any individual verb. However, other roles—namely, those realized by constituents for which the verb subcategorizes—seem to be properties of the verb. For example, the verb put subcategorizes for a PP, and furthermore, this PP must realize the TO-LOC role. In verb classification this latter -type of role is important, and these roles are called the inner roles of the verb. The preceding examples suggest one test for determining whether a given role is an inner role for a given verb: if the role is obligatory, it is an inner role." Other inner roles, however, appear to be optional, so other tests are also needed.-Another test is based on the observation that all verbs may take at most one NP in any given inner role. If multiple NPs are needed, they must be related by a conjunction. Thus you can say John and I ran to the store, but not *John I ran to the store. Similarly, you can say I ran to the store and to the bank, but not *I ran to the store to the bank. Thus the AGENT and TO-LOC roles for the verb run are inner roles. Verbs typically specify up to three inner roles, at least one of which must always be realized in any sentence using the verb. Sometimes a particular role must always be present (for example, TO-LOC with put). Typically, the THEME role is also obligatory, whereas the AGENT role is always optional for any verb that allows the passive form. There are also syntactic restrictions on how various roles can be realized. Figure 8.6 shows a sample of ways that roles can be realized in different sentences. The following are some sample sentences with each verb in italics and its argument, whether NP, PP, or embedded S, classified by its role in order of occurrence: Jack ran. Jack ran with a crutch. Jack ran with a crutch for Susan. Jack destroyed the car. Jack put the car through the wall. Jack sold Henry the car. Henry pushed the car from Jack's house to the junkyard. Jack is tall. Henry believes that Jack is tall. Susan owns a car. I am in the closet. The ice melted. Jack enjoyed the play. The ball rolled down the hill to the water. AGENT only AGENT + INSTR AGENT + INSTR + BENEFICIARY AGENT + THEME AGENT + THEME + PATH AGENT + TO-POSS + THEME AGENT + THEME + FROM-LOC + TO-LOC THEME EXPERIENCER + THEME AT-POSS + THEME THEME + AT-LOC THEME EXPERIENCER + THEME THEME + PATH + TO-LOC 250 CHAPTER 8 Semantics and Logical Form 251 Role Realization AGENT as subject in active sentences preposition by in passive sentences THEME as object of transitive verbs as subject of nonaction verbs HMSTR as subject in active sentences with no agent preposition with EXPERIENCER as animate subject in active sentences with no agent BENEFICIARY as indirect object with transitive verbs preposition for AT-LOC prepositions in, on, beyond, etc. AT-POSS possessive NP as subject of sentence if no agent TO-LOC prepositions to, into TO-POSS _ preposition to, indirect object with certain verbs FROM-LOC prepositions from, out of, etc. FROM-POSS preposition from Figure 8.6 Common realizations of the major roles 8.7 Speech Acts and Embedded Sentences Sentences are used for many different purposes. Each sentential mood indicates a different relation between the speaker and the propositional content of the utterance. These issues will be examined in greater detail in later chapters. For now the logical form language is extended to capture the distinctions. Each of the major sentence types has a corresponding operator that takes the sentence interpretation as an argument and produces what is called a surface speech act. These indicate how the proposition described is intended to be used to update the discourse situation. They are indicated by new operators as follows: ASSERT—the proposition is being asserted. Y/N-QUERY—the proposition is being queried. COMMAND—the proposition describes an action to perform. WH-QUERY—the proposition describes an object to be identified. For declarative sentences, such as The man ate a peach, the complete LF is (ASSERT ( el [AGENT ] [THEME ])) For yes/no questions, such as Did the man eat a peach?, the LF is (Y/N-QUERY ( el [AGENT ] [THEME ])) For commands, such as Eat the peach, the LF is (COMMAND (EAT el [THEME ])) For wh-questions, such as What did the man eat?, several additions need to be made to the logical form language. First, you need a way to represent the meaning of noun phrases involving wh-terms. A new quantifier WH is defined that indicates that the term stands for an object or objects under question. Thus a noun phrase such as what would be represented , which man as , and who as el [AGENT ] [THEME ])) Embedded sentences, such as relative clauses, end up as complex restrictions within the noun phrase construction and thus do not need any new notation. For example, the logical form of the sentence The man who ate a peach left would be (ASSERT ( U [AGENT e2 [AGENT ml] [THEME ]))>])) Figure 8.7 gives a formal definition of the syntax of the logical form language introduced throughout this chapter. Figure 8.8 gives the additional rules defining the syntax for the quasi-logical form language. 8.8 Defining Semantic Structure: Model Theory So far, the term semantics has been used to reflect the representation of meanings lor sentences. This was expressed in the logical form language. There is another meaning of semantics used in formal language theory that provides a meaning of the logical form language itself. It concentrates on distinguishing the different classes of semantic objects by exploring their model-theoretic properties, that is, by defining semantic units in terms of their mapping to set theory. While these techniques are most commonly used for logics, they can be applied to natural language as well. This section is optional and is not necessary for understanding 252 CHAPTER 8 Semantics and Logical Form .253 UTTERANCE (ASSERT PROPOSITION) I (Y/N-QUERY PROPOSITION) I (COMMAND PROPOSITION) I (WH-QUERY PROPOSITION) PROPOSITION -4 (n-ARY-OPERATOR PROPOSITION, ... PROPOSITION„) I (QUANTIFIER VARIABLE: PROPOSITION PROPOSITION) I (n-ARY-PREDICATE TERM; ... TERM„) I (EVENT-STATE-PRED VARIABLE [ROLE-NAME TERM\i -[ROLE-NAME TERM]n) TERM -» VARIABLE I (NAME VARIABLE NAME-STRING) I (PRO VARIABLE PROPOSITION) 1-ARY-OPERATOR -> NOT I PAST I PERF I PROG I ... 2-ARY-OPERATOR -» AND I BUT I IF-THEN I ... QUANTIFIER -> THE I SOME I WH I 3 I .. VARIABLE -> bl I man3 I ... T-ARY-PREDICATE -> TYPE-PREDICATE I HAPPY1 I RED1 I ... TYPE-PREDICATE -* EVENT-STATE-PRED I (PLUR TYPE-PREDICATE) I MAN1 I ... EVENT-STATE-PRED -> RUN1 I LOVE3 I GIVE1 I HAPPY I ... 2-ARY-PREDICATE -> ROLE-NAME I ABOVE1 I ... ROLE-NAME -» AGENT I THEME I AT-LOC I INSTR I ... NAME-STRING -4 "John" I "The New York Times" I ... Figure 8.7 A formal definition of the syntax of the logical form language TERM -4 QUANTIFIER VARIABLE PROPOSITION> TERM -» n-ARY-PREDICATE -> n-ARY-OPERATOR -» {n-ARY-OPERATOR, ... n-ARY-OPERATOR^ QUANTIFIER -4 {QUANTIFIERj ... QUANTIFIER,,,} n-ARY-PREDICATE -* {n-ARY-PREDICATE 1... n-ARY-PREDICATEm] TYPE -PREDICATE -> {TYPE -PREDICATEj ... TYPE -PREDICATE,,,} EVENT-STATE-PRED -> {EVENT-STATE-PRED, ... EVENT-STATE-PREDm] ROLE-NAME -4 {ROLE-NAME! ... ROLE-NAMEJ Figure 8.8 Additional rules defining the quasi-logical form the rest of the book. If you are not familiar with FOPC and its model-theoretic,: semantics, you should read Appendix B before reading (his section. The basic building block for defining semantic properties is the idea of a model, which informally can be thought of as a set of objects and their properties and relationships, together with a specification of how the language being studied relates to those objects and relationships. A model can be thought of as representing a particular context in which a sentence is to be evaluated. You might impose different constraints on what properties a model must have. For instance, the standard models for logic, called Tarskian models, are complete in that they must map every legal term in the language into the domain and assign every statement to be true or false. But various forms of partial models are possible that do not assign all statements to be either true or false. Such models are like the situations described in the last section, and in fact a mathematical theory of situations could be a formal model in the general sense we are using here. These issues will be discussed further as they become relevant to the discussion. Model theory is an excellent method for studying context-independent meaning, because the meanings of sentences are not defined with respect to one specific model but rather by how they relate to any possible model. In other words, the meaning of a sentence is defined in terms of the properties it has with respect to an arbitrary model. Formally, a model m is a tuple , where Dm is the domain of interpretation (that is, a set of primitive objects), and I is the interpretation function. To handle natural language, the domain of interpretation would have to allow objects of all the different types of things that can be referred to, including physical objects, times, locations, events, and situations. The interpretation function maps senses and larger structures into structures defined on the domain. For example, the following describe how an interpretation function will interpret the senses based on some lexical classes: Senses of noun phrases—refer to specific objects; the interpretation function maps each to an element of Dm. Senses of singular common nouns (such as dog. idea, party)— identify classes of objects in the domain; the interpretation function maps them to sets of elements from Dm (that is, subsets of Dm). Senses of verbs—identify sets of n-ary relations between objects in D. The arity depends on the verb. For instance, the exercising sense of run, RUN1, might map to a set of unary relations (, where X runs), and the usual sense of loves, LOVES1, to a set of binary relations (, where X loves Y), and the usual sense of put, PUT1, to a set of ternary relations (, where X puts Y in location L). We can now define a notion of truth of a proposition in the logical form language, again relative to an arbitrary model m. A proposition of the form (Vn aj ... an) is true with respect to a model m if and only if the tuple consisting of the interpretations of the aj's is in the set that is the interpretation of Vn, i.e., the 254 CHAPTER 8 Semantics and Logical Form 255 tuple is in the set Im(Vn). Following conventional notation. ' ' - ■ ■ • - " :~ true with respect to a model m by we will write the fact that a proposition P is ml=P This is sometimes also read as "m supports P." For example, m supports (RUN 1 JACK1) only if Im(JACKl) is in the set Im(RUNl), and m supports (LOVES 1 JACK1 SUE1) only if is in the set Im(LOVESl). The semantics of negation depend on the properties of the models used. In standard Tarskian semantics, where the models are complete, negation is defined in terms of the absence of the relation necessary to make the proposition true; that is, a model m supports (NOT P) if and only if it doesn't support P. In otbu models this may be too strong, as a proposition might be neither true nor false Such models would have to maintain two disjoint sets for every relation nanx the tuples for which the relation holds and the tuples for which it is false. Am tuple not in either of these sets would be neither true nor false. The semantics of the quantifiers in FOPC is fairly simple. For example, the proposition V* . P(x) is true with respect to a model m if and only if P(x) is true for any value of x in Dm. The proposition 3x . P(x), on the other hand, is true' with respect to a model m if and only if P(x) is true for at least one value of x in Dm. For natural languages we have generalized quantifiers. The truth conditions! for each quantifier specify the required relationship between the objects satisfying the two propositions. For example, consider the proposition (MOST1 x (P x) (Q x)). This might be defined to be true with respect to a model m if and only if over half the objects in Im(P) are in Im(Q)- For example, (MOST1 x (DOG1 x) (BARKS I x)) would be true with respect to a model m if and only if more than half the elements of the set (x I (DOG1 x)} are also in the set {y I (BARKS 1 y)}, that is, only if over half the dogs in m bark. Semantic Relations Among Sentences With a semantic theory in hand, it is now possible to be more precise about certain inferential relationships among sentences. For instance, when you know that some sentence S is true, then some other sentences must also be true. For instance, if you know A red box is on the table then you also know that A box is on the table. This relationship between sentences is called entailment, and we say the sentence A red box is on the table entails the sentence A box is on the table. Formally, entailment can be defined in terms of the models that support the sentences. In particular, sentence S entails sentence S' if and only if every model that supports S also supports S'; that is, S entails S' if and only if for any model m, if m 1= S then m 1= S' Conversely, if there are no models that support both sentences simul- ~ taneously, the sentences are said to contradict each other; that is, there is no possible situation in which both statements can be true. Slightly modifying the previous example, we know that the sentences A red box is on the table and There is no box on the table are contradictory because there is no model that can support these two sentences simultaneously. Given that entailments must hold in all possible models, it might seem that there won't be very many entailments to be found. This turns out not to be the case as the models share more properties than you might first expect. In particular, all the context-independent properties of language will be shared by every model. Consider the above example. While the set of boxes and the set of red things could vary dramatically from model to model, the above entailment did not rely on the actual objects being referred to. Rather, it depended only on the general interpretation of the senses of red and box, and the set-theoretic property that given two sets X and Y, then the intersection of X and Y is contained in the set X (and Y for that matter). Thus, as long as the model interprets the NP A red box as selecting an object from the set RED1 intersected with the set BOX1, then that same object will be in the set BOX1. This is all that is required. Thus every model will include the entailment. Furthermore, Section 8.2 suggested that word senses could be organized into a hierarchy based on set inclusion. This defines an additional context-independent structure on the senses and thus will be included in all possible models. As a consequence, the sentence A mare ran away will entail the sentence A horse ran away, because every model must map MARE1 to a subset of HORSE1. While entailment is a useful property, natural language understanding typically deals in weaker connections between sentences, which we will call implications. A sentence S implies a sentence S' if S being true suggests or makes it likely that S' is also true. For example, the sentence / used to walk to school every day implies that I don't walk to school anymore but doesn't entail it, as I could say / used to walk to school every day. In fact, I still do! Given that these two sentences are not anomalous, there must be a model that assigns truth to both of them, and in this model it will not be true that I don't walk to school anymore. However, in a typical context, the models that are relevant to that context might all support that I don't walk to school anymore. One of the difficult problems in language understanding results from the fact that most inferences made are implications rather than entailments, and thus conclusions drawn at one time might need to be reconsidered and retracted at a later time. Modal Operators and Possible Worlds Semantics To provide a semantics for modal operators, more structure must be imposed on the model theory. In particular, the truth of a property will be defined with respect to a set of models rather than a single model. In addition, this set of models may have a rich structure and hence is called a model structure. 256 CHAPTER 8 Semantics and Logical Form 257 BOX 8.2 Extensional and Intensional Readings The semantic framework discussed in this chapter almost exclusively concerns what is called the extensional meaning of expressions. This is a way of saying that meaning is defined in terms of what the expressions denote in the world. For example, the word sense DOG1 obtains its meaning from the set of objects that it denotes, namely Im(DOGl). Likewise, a term such as the dog or John would derive its meaning from the object in the domain that it denotes. If two terms denole the same object, then they are semantically indistinguishable. The extensional view covers a wide range of language but cannot account for many expressions. Consider a classic example from Montague (1974): The temperature is rising. What is the meaning of the NP the temperature in this sentence? If it denotes a particular value, say 30°, then it could not be rising, as the sentence 30° is rising is nonsensical. Rather, the phrase the temperature must denote a function over time, say the function Temp that, given a time and a location, yields the temperature. Given this meaning, the predicate is rising would be true of any function that is increasing in value over time. Thus there seem to be two different meanings for the term the temperature. The extensional reading would be the actual value at a particular time and place, say 30°, while the intensional reading would be a function that takes a time and place and yields a value. Some verb phrases, such as is rising, require the intensional meaning, while others, such as in the sentence The temperature is over 90°, require the extensional meaning, as it would make no sense to say that a function has the property of being over 90°. Because of the complexities in dealing with intensional readings, we will only deal with extensional readings of expressions throughout this book. Summary Consider the past tense operator PAST. Let us assume that each individual model m represents the state of the world at some point in time. Then a model structure would be the set of models describing one possible history of the world, in which models are related to each other by an operator <, where ml < m2 means ml describes a state of the world before the state described by m2. The truth of propositions is now defined with respect to an arbitrary model structure Q. that has the < relation between models. For example, we can now define a semantics for the expression (PAST P) with respect to a model m in a model structure Q m \=q (PAST P) if and only if there is another model m' in Q such that m' < m and m' I=q P In other words, (PAST P) is true in a model m if and only if there is another model in the model structure that describes a time previous to m in which P is true. This type of analysis is called a possible worlds semantics. This chapter presented a context-independent semantic representation called the logical form. Such a representation is desirable to simplify the process of com -puting semantic structures from the syntactic structure, and more generally to modularize the processing of sentences by separating out contextual effects. The logical form language uses many of the concepts from FOPC, including terms, predicates, propositions, and logical operators. The most significant extensions to basic FOPC were the following: • generalized quantifiers, which indicate a relationship between two sets and correspond to words such as each, every, some, most, several, and so on • modal operators, which identify different modalities in which to consider propositions and correspond to words such as believe and hopes, as well as the tense operators • predicate operators, which map one predicate into a new predicate, as with the operator for plural forms that takes a predicate describing individuals with a property P and produces a predicate that describes sets of individuals, each with property P In addition, the logical form language can encode many common forms of ambiguity in an efficient manner by allowing alternative senses to be listed wherever a single sense is allowed. An important aspect of the logical form language is its use of event and state variable's on predicates. This allows you to represent additional adverbial modifiers without having to introduce different predicates for each possible combination of arguments to the verb. A representation based on thematic roles was also introduced. While the roles may differ from system to system, this style of representation is very common in natural language systems. Related Work and Further Readings The logical form language developed here draws from many sources. The most influential early source is the representation language in the LUNAR system (Woods, 1978), which contains initial ideas for many of the details in this representation. More recent strong influences are the logical form languages developed by Schubert and Pelletier (1982), Hwang and Schubert (1993a), and Moore (1981), and the quasi-logical form in the Core Language Engine (Alshawi, 1992). Like many logical form languages, the one developed here is heavily based on techniques of introducing events into the ontology to handle optional arguments and other verb modifiers. This technique became influential following a landmark paper by Davidson (1967) who presented many arguments for the approach. Moore (1981), Alshawi (1992), and Hobbs et al. (1987) are all good examples of such event-based systems. 258 CHAPTER 8 Semantics and Logical Form 259 Work on thematic roles originates from work on case grammar by Fillmore (1968), who introduced six cases. These ideas have been adapted by many researchers since in linguistics and philosophy (for example, see Jackendoff (1972), Dowty (1989), and Parsons (1990)). Similar techniques can be found in many computational systems. In these systems, thematic roles are often linked lo the underlying knowledge representation (for example, Charniak (1981)). The idea of encoding scoping ambiguity in the logical form has been used by many systems. Woods (1978) and Cooper (1983), for instance, suggested a two-part representation of logical form: one part for the predicate-argument structure and the other for the quantifier information. Encoding the scoped forms using a special syntax has been used in Schubert and Pelletier (1982), McCord (1986), Hobbs and Shieber (1987). and many recent systems such as the Core Language Engine (Alshawi, 1992). This encoding is not only important for encoding ambiguity, but also plays a key role in allowing the definition of compositional semantic interpretation theories, as will be discussed in Chapter 9. An excellent general source on semantics in linguistics can be found in Chierchia and McConnell-Ginet (1990). This book contains discussions on the nature of semantics and issues such as ambiguity and vagueness, indexicality. denotational semantics, generalized quantifiers, intensionality, and a host of other issues. McCawley (1993) is also an excellent and accessible source on the logical underpinnings of semantics. More detailed presentation of the mathematical underpinnings of semantic representations can be found in Partee et al. (1993). Generalized quantifiers are discussed in Barwise and Cooper (1981). Many of the semantic ideas discussed here were first developed within symbolic logic, for which there are many excellent texts (for example, Thomason (1970); Barwise and Etchemendy (1987)). The current form of semantics for natural language is most heavily influenced by the work of Montague (1974). Much of the current work in formal semantics is within the framework of situation semantics. The notion of a situation discussed in this chapter is drawn from that work. Situation semantics was developed by Jon Barwise and collaborators, and an initial version of the theory is described in Barwise and Perry (1983). An excellent discussion and formalization of situations in found in Devlin (1991). Exercises for Chapter 8_ 1. (easy) State why each of the following sentences are ambiguous or not. Specifically, state whether they are ambiguous because of their possible syntactic structures, their word senses, their semantic structures, or a combination of these factors. Give a paraphrase of each reading. A man slopped at every truck stop. Several people ate the pizza. We saw her duck. 2. (medium) For each of the following words, argue whether the word is ambiguous or vague. Give at least one linguistic test that supports your answer. Discuss any difficulties that arise in making your decision. face—as a noun, as in what's on your head, the front of a clock, and the side of a mountain bird—as a noun, as in animals that fly (such as a sparrow) and china figures of such animals record—as a noun, as in where a school keeps your grades or where the FBI writes down your movements 3. (easy) Identify the senses of the word can used in the following sentences, and specify whether the sense is a term, property (unary predicate), n-ary predicate, logical operator, generalized quantifier, predicate operator, or modal operator. Justify your classification and discuss alternate interpretations that could work just as well. Give an example logical form that uses each sense. The yellow can fell to the ground. He can see it. He wants to can the tomatoes. 4. (easy) Expand out the following ambiguous logical forms, giving a complete list of the full logical forms allowed by the ambiguous form. ({RUN1 RUN2) [AGENT (PRO hi HE1)]) (SEESl si [AGENT (PRO hi HE1)] []) (GIVES 111 [AGENT ] [THEME ]) ( [AGENT l) 5. (medium) Specify a quasi-logical form for the following sentences. If the sentence is ambiguous, make sure you represent all the possibilities, either using ambiguous logical forms or by listing several logical forms. George ate a pizza at every road stop. Several employees from every company bought a pizza. We saw John in the park by the beach. 6. (medium) Specify the roles for each NP in the following sentences. Give a plausible logical form for each sentence. We returned the ring to the store. We returned to the party. The owner received a ticket. 260 CHAPTER 8 7. (medium) a Specify a formal semantic model that makes each of the following logical forms true. (INI JOHN1 ROOM1) (BOY1 JOHN1) (EVERY bl : BOY1 (EAT bl PIZZA1)) Specifically, define a set of objects for the domain of the model, and specify the interpretation function that maps each of these symbols into an element or set in the domain. Demonstrate that each of these logical forms is assigned true by specifying their interpretation. b. Does your model also assign true to (EATJOHN1PIZZA1) Could you build a model that does the opposite (assigns false if yours assigns true, or vice versa)? If so, give the model. If not, why not? 8. (medium) For each of the following lists of sentences, state whether the first sentence entails or implies each of the sentences that follow it or that there is no semantic relationship among them. Justify your answers using some linguistics tests. a John didn't manage to find the key. John didn't find the key. John looked for the key. The key is hard to find. b. John was disappointed that Fido was last in the dog show. Fido was last in the dog show. Fido was entered in the dog show. John wanted Fido to win. Fido is a stupid dog. CHAPTER Linking Syntax and Semantics 9.1 Semantic Interpretation and Compositionality 9.2 A Simple Grammar and Lexicon with Semantic Interpretation 9.3 Prepositional Phrases and Verb Phrases 9.4 Lexicalized Semantic Interpretation and Semantic Roles 9.5 Handling Simple Questions 9.6 Semantic Interpretation Using Feature Unification o 9.7 Generating Sentences from Logical Form Linking Syntax and Semantics 263 This chapter discusses a particular method for linking logical forms with syntactic structures. This will allow logical forms to be computed while parsing, a process called semantic interpretation. One version discussed also allows a syntactic tree to be generated from a specified logical form, a process called semantic realization. To fully couple syntax and semantics, there must be a well-formed meaning expression for every constituent. The relation between the meaning of a constituent and the meanings of its subconstituents can then be specified in the grammar using features. Because each syntactic rule has a corresponding semantic interpretation rule, this method is often referred to as a rule-by-rule style of semantic interpretation. Section 9.1 discusses the notion of compositionality and introduces the lambda calculus as a tool for building compositional theories. Sections 9.2 and 9.3 then examine some basic constructs in language and develop a grammar for a Small fragment of English that computes the logical form of each constituent as it is parsed. The logical form used in Sections 9.2 and 9.3 is a predicate argument structure. Section 9.4 shows how to generate a logical form using semantic roles and briefly discusses the need for hierarchical lexicons to reduce the amount of work required to specify the meanings of lexical items. Section 9.5 discusses how semantic interpretation relates to gaps and shows how to handle simple questions. Section 9.6 develops an alternative method of computing logical forms that uses additional features rather than lambda expressions, thus allowing us to express reversible grammars. Optional Section 9.7 discusses semantic realization, showing how to generate a sentence given a logical form and a reversible grammar. 9.1 Semantic Interpretation and Compositionality One of the principal assumptions often made about semantic interpretation is that it is a compositional process. This means that the meaning of a constituent is derived solely from the meanings of its subconstituents. Compositional theories have some attractive properties. In particular, interpretations can be built incrementally from the interpretations of subphrases. The context-free grammar model of syntax, for example, is a compositional theory of syntax. Rules apply based on the category of the subconstituents, without concern for their internal structure. The rule S -> NP VP, for instance, applies no matter what particular form of NP arises. By simply adding another NP rule to the grammar, say NP -> PRO, a whole new class of sentences becomes acceptable as well, namely any sentences with a pronoun in an acceptable NP position. This is a very attractive property that we want to have for semantic interpretation. In linguistics compositionality is often defined in terms of a strict criterion: One subconstituent must have as its meaning a function that maps the meanings of the other subconstituents into the meaning of the new constituent. In computational approaches, the requirement is often more relaxed and requires an incremental process of building meanings constituent by constituent, where the 264 CHAPTER 9 Linking Syntax and Semantics 265 meaning of a constituent is produced by some well-defined computational func-r tion (that is, a program), that uses the meanings of the subconstituents as inputs. Compositional models tend to make grammars easier to extend and main -: tain. But developing a compositional theory of semantic interpretation is harder than it looks. First, there seems to be a structural inconsistency between syntactic structure and the structure of the logical forms. For instance, a classic problem arises with quantified sentences. Consider the sentence Jill loves every dog. The syntactic structure of this sentence clusters the words into phrases in the obvious way: ((Jill) (loves (every dog))). But the unambiguous logical form of the sentence in a predicate-argument form would be something like (EVERY d : (DOG1 d) (LOVES111 (NAME jl "Jill") d)) This asserts that for every dog d there is an event II that consists of Jill loving d. There seems to be no simple one-to-one correspondence between parts of the logical form and the constituents in syntactic analysis. The phrase every dog, for instance, is a subconstituent of the VP loves every dog. Its semantic interpretation, however—the generalized quantified phrase (EVERY d : (DOG1 d)...)— seems to have the meaning of the verb phrase as a part of it. Worse than that, the interpretation of every dog seems to be split—it produces both the quantifier structure outside the predicate as well as an argument to the predicate. As a result, it is hard to see how the meaning of every dog could be represented in isolation and then used to construct the meaning of the sentence. This is one of the most difficult problems to be dealt with if we are to have a compositional theory. Note that introducing the unscoped logical form constructs provides one way around the problem. If we define the goal of semantic interpretation as producing an unscoped logical form, the unscoped version of this sentence would be (LOVES111 (NAME jl "Jill") ) which is much closer in structure to the syntactic form. Another challenge to compositional theories is the presence of idioms. For instance, you may say Jack kicked the bucket, meaning that Jack died. This interpretation seems to have no relation to the meanings of the verb kick and nothing to do with buckets. Thus the meaning of the sentence does not seem to be constructed out of the meaning of its subconstituents. One way to handle such cases is to allow semantic meanings to be assigned to entire phrases rather than build the meaning compositionally. So far, we have assumed that the primitive unit is the word (or morpheme). Idiomatic expressions suggest that this might be generalized so that complete phrases may have a primitive (that is, nonderived) meaning. In the previous example the verb phrase kick the bucket has a primitive meaning similar to the verb die. This approach is supported by the observation that certain syntactic paraphrases that are fine for sentences given their compositional meaning do not apply for idiomatic readings. For instance, the passive sentence The bucket was kicked by Jack could not mean that Jack died. ,1 Interestingly, Jack kicked the bucket is ambiguous. It would have.one meaning constructed compositionally from the meanings of each of the words, say (KICK1 kl (NAME jl "Jack") ), and another constructed from the meaning of the word Jack and the primitive meaning of the phrase kick the bucket, say (DIE1 dl (NAME jl "Jack")). Another approach to this problem is to introduce new senses of words that appear in idioms. For example, there might be a sense of kick that means DIE1, and that subcategorizes for an object of type BUCKET1. While idioms are a very interesting and important aspect of language, there will not be the space to deal with them in the next few chapters. For the purposes of this book, you can assume that the primitive meanings are always associated with the word. If the process of semantic interpretation is compositional, then you must be able to assign a semantic structure to any syntactic constituent. For example, you must be able to assign some uniform form of meaning to every verb phrase that can be used in any rule involving a VP as a subconstituent. Consider the simplest case, where the VP consists of an intransitive verb, as in a sentence such as Jack laughed. One suggestion is that the meaning of the verb phrase laughed is a unary predicate that is true of any object that laughed in the past. Does this approach generalize? In other words, could every VP have a meaning that is a unary predicate? Consider the sentence Jack kissed Sue, with the logical form (KISS1 kl (NAME jl "Jack") (NAME si "Sue")) What is the meaning of the VP kissed Sue ? Again, it could be a unary predicate that is true of any object that kissed Sue. But so far we have no way to express such complex unary predicates. The lambda calculus provides a formalism for this. In particular, the expression (A, x (KISS1 kl x (NAME si "Sue"))) is a predicate that takes one argument. You can view x as a parameter, and this predicate is true of any object O, such that substituting O for x in the expression results in a true proposition. Like any other predicate, you can construct a proposition from a lambda expression and an argument. In the logical form language, the following is a proposition: ((A,x(KISSl klx(NAMEsl "Sue"))) (NAME jl "Jack")) This proposition is true if and only if (NAME jl "Jack") satisfies the predicate (X x (KISS1 kl x (NAME si "Sue"))), which by definition is true if and only if (KISS1 kl (NAME jl "Jack") (NAME si "Sue")) is true. We will often say that this last expression was obtained by applying the lambda expression (k x (KISS1 x (NAME si "Sue"))) to the argument (NAME jl "Jack"). This operation is called lambda reduction. Given that we have had to introduce new concepts such as lambda expressions in order to establish a close syntactic-semantic coupling, you might 266 CHAPTER 9 Linking Syntax and Semantics 267 BOX 9.1 The Lambda Calculus and Lambda Reduction The lambda calculus is a powerful language based on a simple set of primitives. Formulas in the lambda calculus consist of equality assertions of the form = The most crucial axiom in this system for our purposes is ((AxPx)a) = P{x/a) where Px is an arbitrary formula involving x and P{x/a) is the formula where every instance of x is replaced by a. From this axiom, two principal operations can be defined: lambda reduction (moving from left to right across the axiom) and lambda abstraction (moving from right to left across the axiom). In general, lambda reduction is the principal concern to us because it tends to make formulas simpler. In fact, because lambda reduction simply replaces one formula with a simpler one that is equal to it, the operation is not formally necessary at all to account for semantic interpretation. Without using lambda reduction, however, the answers, though correct, would tend to be unreadable. be tempted to abandon this approach and develop some other method of semantic interpretation. As a grammar becomes larger and handles more complex phenomena, however, the compositional theory becomes more attractive. For instance, using this method, verb phrases can easily be conjoined even when they have different syntactic structures, as in the sentence Sue laughs and opens the door. There are two VPs here: laughs, which has a semantic interpretation as a unary predicate true of someone who laughs, say (k a (LAUGHS112 a)); and opens the door, a unary predicate true of someone who opens the door, namely (k a (OPENS 1 12 a )) These two unary predicates can be combined to form a complex unary predicate that is true of someone who both laughs and opens the door, namely, (k a (& (LAUGHS 1 12 a) (OPENS 1 ol a ))) This is in exactly the right form for a VP and can be combined with other constituents like any other VP. For instance, it can be applied to a subject NP with logical form (NAME si "Sue") to form the meaning of the original sentence. (& (LAUGHS 1 12 (NAME si "Sue")) (OPENS1 ol (NAME si "Sue") )) Consider another example. Prepositional phrase modifiers in noun phrases could be handled in many different ways. For instance, we might not have an independent meaning for the phrase in the store in the noun phrase The man in-.-. the store. Rather, a special mechanism might be used to look for location modifiers in noun phrases and incorporate them into the interpretation. But this mechanism would then not help in interpreting sentences like The man is in the store or The man was thought to be in the store. If the prepositional phrase has an independent meaning, in this case the unary predicate (k o (IN-LOCl'o )) then this same interpretation can be used just as easily as a modifier to a noun phrase (adding a new restriction) or as the predicate of a sentence. The logical form of the noun phrase the man in the store would be )> while the logical form of the sentence The man is in the store would be (IN-LOC1 ) These are just two simple examples. There are many other generalities that also arise if you adopt a compositional approach to semantics. In general, each major syntactic phrase corresponds to a particular semantic construction. VPs and PPs map to unary predicates (possible complex expressions built out of lambda expressions), sentences map to propositions, and NPs map to terms. The minor categories map to expressions that define their role in building the major categories. Since every constituent in the same syntactic category maps to the same sort of semantic construct, these can all be treated uniformly. For example, you don't need to know the specific structure of a VP. As long as its meaning is a unary predicate, you can use it to build the meaning of another larger constituent that contains it. 9.2 A Simple Grammar and Lexicon with Semantic Interpretation This section constructs a simple grammar and lexicon to illustrate how the logical form can be computed using the features while parsing. To keep the examples simple, the logical form will be the predicate-argument structure used in the last section rather than the thematic role representation. This will allow all verbs with the same subcategorization structure to be treated identically. Section 9.4 will then discuss methods of generalizing the framework to identify thematic roles. The main extension needed is to add a SEM feature to each lexical entry and grammatical rule. For example, one rule in the grammar might be (S SEM (?semvp ?semnp)) -> (NP SEM ?semnp) (VP SEM ?semvp) Consider what this rule does given the NP subconstituent with SEM (NAME ml "Mary") and the VP subconstituent with SEM (k a (SEES1 e8 a (NAME jl "Jack"))) 268 chapter 9 S SEM (SEES 1 e8 (NAME ml "Mary") (NAME jl "Jack")) NP SEM (NAME ml "Mary") VP SEM (X a (SEES 1 e8 a NAME jl "Jack")) V SEM SEES1 NP SEM (NAME ml "Jack") Figure 9.1 a parse tree showing the sem features The SEM feature of the new S constituent is simply the expression ((X a (SEES1 e8 a (NAME jl "Jack"))) (NAME ml "Mary")) This expression can be simplified using lambda reduction to the formula (SEES1 e8 (NAME ml "Mary") (NAME jl "Jack")) which is the desired logical form for the sentence. Figure 9.1 shows the parse tree for this sentence giving the SEM feature for each constituent. In the lexicon the SEM feature is used to indicate the possible senses of each word. Generally, a word will have a different word sense for every possible subcategorization it has, since these will be different arity predicates. A sample lexicon is shown in Figure 9.2. When a word has a different SEM form depending on its syntactic features, multiple entries are required. For example, the verb decide has two entries: one for the case where the SUBCAT is _none, and one for the case where the SUBCAT is _pp:on, where the verb has an additional argument. Note also that the word fish also has two entries because its SEM depends on whether it is singular or plural. Consider Grammar 9.3, which accepts very simple sentence and verb phrases and computes their logical form. Note that another feature is introduced in addition to the SEM feature. The VAR feature is new and stores the discourse variable that corresponds to the constituent. It will be useful for handling certain forms of modifiers in die development that follows. The VAR feature is automatically generated by the parser when a lexical constituent is constructed from a word, and then it is passed up the tree by treating VAR as a head feature. It guarantees that the discourse variables are always unique. The lexical rules for morphological derivation imi.-t aKo be modified to handle the SEM feature. For instance, the rule that converts a singular noun into a plural noun takes the SEM of the singular noun and adds the PLUR operators: (N AGR 3p SEM (PLUR ?semn)) -> (N AGR 3s IRREG-PL - SEM ?semn) +S Linking Syntax and Semantics 269 a (art AGR 3s SEM INDEF1) can (aux SUBCAT base SEMCAN1) car (n SEM CAR1 AGR 3s) cry (v SEM CRY1 VFORM base SUBCAT .none) decide (v SEM DECIDES 1 VFORM base SUBCAT _none) decide (v SEM DECIDES-ON1 VFORM base SUBCAT _pp:on) dog (n SEM DOG1 AGR 3s) fish (n SEM FISH 1 AGR 3s) fish (n SEM (PLUR FISH1) AGR 3p) house (n SEM HOUSE! AGR 3s) has (aux VFORM pres AGR 3s SUBCAT pastprt SEM PERF) he (pro SEM HE1 AGR 3s) in (p PFORM {LOC MOT) SEM IN-LOC1) Jill (name AGR 3s SEM "Jill") man (n SEM MAN1 AGR 3s) men (n SEM (PLUR MAN1) AGR 3p) on (p PFORM LOC SEM ON-LOC1) saw (v SEM SEES1 VFORM past SUBCAT _np AGR ?a) see (v SEM SEES1 VFORM base SUBCAT _np IRREG-PAST + EN-PASTPRT +) she (pro AGR 3s SEM SHE1) the (art SEM THE AGR {3s 3p}) to (to AGR - VFORM inf) Figure 9.2 a small lexicon showing the sem features 1. (S SEM (?semvp ?semnp) -» (NP SEM ?semnp) (VP SEM ?semvp) 2 (VP VAR ?v SEM (X a2 (?semv ?v a2))) -> (V[_none] SEM ?semv) 3. (VP VAR ?v SEM (X a3 (?semv ?v a3 ?semnp))) -> (V[_np]SEM ?semv) (NP SEM ?semnp) 4. (NP WH - VAR 7v SEM (PRO ?v ?sempro)) -> (PRO SEM ?sempro) 5. (NP VAR ?v SEM (NAME ?v ?semname)) -> (NAME SEM ?semname) 6. (NP VAR ?v SEM ) -> (ART SEM ?semart) (CNP SEM ?semcnp) 7. (CNP SEM ?semn) -* (N SEM ?semn) Head features for S, VP, NP, CNP: VAR Grammar 93 a simple grammar with sem features" A similar technique is used to insert unscoped tense operators for the past and present tenses. The revised morphological rules are shown in Grammar 9.4. 270 CHAPTER 9 Linking Syntax and Semantics 271 Ll. L2. L3. L4. L5. L6. L7. (V VFORM pres AGR 3s SEM ) -> (V VFORM base IRREG-PRES - SEM ?semv) +S (V VFORM pres AGR {Is 2s lp 2p 3pj SEM ) -> (V VFORM base IRREG-PRES-SEM ?semv) (V VFORM past AGR {Is 2s 3s lp 2p 3p} SEM ) - (V VFORM base IRREG-PAST - SEM ?semv) +ED (V VFORM pastprt SEM ?semv) (V VFORM base EN-PASTPRT - SEM ?semv) +ED (V WORM pastprt SEM ?semv) (V VFORM base EN-PASTPRT + SEM ?semv) +EN (V VFORM ing SEM ?semv) (V VFORM base SEM ?semv) +ING (N AGR 3p SEM (PLUR ?semn)) -> (V AGR 3s IRREG-PL - SEM ?semn) +S Grammar 9.4 The morphological rules with semantic interpretation These are the same as the original rules given as Grammar 4.5 except for the addition of the SEM feature. Rule 1 was previously discussed. Rules 2 and 3 handle transitive and intransitive verbs and build the appropriate VP interpretation. Each takes the SEM of the verb, ?semv, and constructs a unary predicate that will apply to the subject. The arguments to the verb sense include an event variable, which is stored in the VAR feature, the subject, and then any additional arguments for subcategorized constituents. Rule 4 constructs the appropriate SEM structure for pronouns given a pronoun sense ?sempro, and rule 5 does the same for proper names. Rule 6 defines an expression that involves an unscoped quantified expression, which consists of the quantifier ?semart, the discourse variable ?v, and a proposition restricting the quantifier, which is constructed by applying the unary predicate ?semcnp to the discourse variable. For example, assuming that the discourse variable ?v is ml, the NP the man would combine the SEM of the, namely the operator THE, with the SEM of man, namely MAN1, to form the expression . Rule 7 builds a simple CNP out of a single N. Since the SEM of a common noun is a unary predicate already, this value simply is used as the SEM of the CNP, Only two Simple modifications need to be made to the standard chart parser to handle semantic interpretation: • When a lexical rule is instantiated for use, the VAR feature is set to a new discourse variable. • Whenever a constituent is built, the SEM is simplified by performing any lambda reductions that are possible. S SEM ( evl (NAME jl "Jill") ) VARevl NP SEM (NAME jl "Jill") VARjl VP SEM 0.x «PAST SEES1> evl x VAR dl VSEM NAME SEM "Jill" VAR evl VAR jl ART SEM THE the CNP1 SEM DOG1 VAR dl dog Figure 9.5 The parse of Jill saw the dog showing the SEM and VAR features With these two changes, the existing parser now will compute the logical form as it parses. Consider a trace of the parser on Jill saw the dog, whose parse tree is shown in Figure 9.5. The word Jill is parsed as a name using the lexical lookup. A new discourse variable, jl, is generated and set to the VAR feature. This constituent is used by rule 5 to build an NP. Since VAR is a head feature, the VAR feature of the NP will be jl as well, and the SEM is constructed from the equation in the obvious manner. The lexical entry for the word saw generates a V constituent with the SEM and a new VAR evl. Rule 6 combines the SEMs of the two entries for the and dog with the VAR from the noun to build an NP with the SEM . This is then combined with the SEM of the verb and its VAR by rule 3 to form a VP with the SEM (k x ( evl x (AUX SUBCAT ?v SEM ?semaux) (VP VFORM ?v SEM ?semvp) This rule inserts a modal operator in the appropriate place for the new VP., The SEM equation for the new VP is a little complicated, so consider it in moref detail. If ?semaux is a modal operator such as CAN1, and ?semvp is a lambda* expression such as (X x (LAUGHS 1 e3 x)), then, according to the auxiliary rule;; the SEM of the VP can laugh will be (X al (CAN1 ({X x (LAUGHS 1 e3 x)) al)) This can be simplified to (X al (CAN1 (LAUGHS 1 e3 al))) It might help to view this type of SEM equation as "lifting" the variable for the subject over the CAN1 operator. Starting with the VP interpretation (X x (LAUGHS 1 e3 x)), the equation builds a new formula containing the CAN1 operator yet still retaining the lambda variable for the subject on the outside of the formula. Note that, like all VPs, the new SEM is a unary predicate that applies to the subject, so the auxiliary rule could be used recursively to analyze more complex sequences of auxiliary verbs. To analyze prepositional phrases, it is important to realize that they play , two different semantic roles in sentences. In one analysis the PP is a modifier to as noun phrase or verb phrase. In the other use the PP is subcategorized for by a head word, and the preposition acts more as a flag for an argument position than as an independent predicate. Consider the modification case first. In these cases the SEM of the PP is a-unary predicate to be applied to whatever it eventually modifies. Thus the following rule would be appropriate for building a PP modifier: (PP SEM (X y (?semp y ?semnp))) h> (P SEM ?semp) (NP SEM ?semnp) Given the PP in the corner, if the SEM of the P is IN-LOC1, and the SEM of the NP is , then the SEM of the PP would be the unary predicate (X y (IN-LOC1 y )) Now you can consider the interpretation of the noun phrase the man in the: corner. A reasonable rule that incorporates the PP modifier would be (CNP SEM (X nl (& (?semcnp nl) (?sempp nl)))) (CNP SEM ?semcnp) (PP SEM ?sempp) Given that the SEM of the CNP man is the unary predicate MAN1 and the SEM of the PP in the corner is (X y (INI y )), the new SEM of the CNP, before simplification, would be (X nl (& (MAN1 nl) ((X y (INI y )) nl))) The subexpression ((X y (INI y )) nl) can then be simplified to (INI nl ). Thus the overall expression becomes (X nl (& (MAN1 nl) (INI nl ))) This is a unary predicate true of any man who is in the comer, the desired interpretation. Combining this with a quantifier such as the using rule 6 would form a SEM such as ))) m2)> which itself simplifies to ))> PPs can also modify verb phrases, as in the VP cry in the corner in Jill can cry in the corner. The syntactic rule that introduces the PP modifier would be VP -> VP PP You might think that the SEM feature equation can be treated in exactly a parallel way as with the noun phrase case, but there is a complication. Consider the desired behavior. The VP subconstituent cry has the logical form (X x (CRIES 1 el x)) and the PP has the logical form shown above. The desired logical form for the entire VP cry in the corner is (X a (& (CRIES1 el a) (IN-LOC1 el ))) The problem is that the unary predicate for the VP subconstituent is intended to apply to the subject, where the unary predicate for the PP modifier must apply to the discourse variable generated for the VP. So the technique used for rule 9 does not yield the correct answer. Rather, the SEM constructed from the PP must be applied to the discourse variable instead. In other words, the appropriate rule is (VP VAR ?v SEM (A, x (& (?semvp x) (?sempp ?v)))) (VP VAR ?v SEM ?semvp) (PP SEM ?sempp) The parse tree for the VP cry in the corner using this rule is shown in Figure 9.6. Prepositional phrases also appear as subcategorized constituents in verb phrases, and these cases must be treated differently. Specifically, it is the verb that determines how the prepositional phrase is interpreted. For example, the PP on a couch in isolation might indicate the location of some object or event, but with the verb decide, it can indicate the object that is being decided about. The 274 CHAPTER 9 Linking Syntax and Semantics 275 VP SEM (X x (& (CRIES el x) (IN-loc1 el ))) VAR el V SEM (^x (CRIES 1 elx)) VAR el PSEM VP SEM (X y (IN-LOC1 y )) NP SEM VARcl cry ART SEM THE N SEM CORNER1 VAR cl the Figure 9.6 Using the VAR feature for PP modifiers of VPs distinction between the two readings can be illustrated by considering that the sentence Jill decided on a couch is ambiguous between two readings: • Jill made a decision while she was on a couch. • Jill made a decision about a couch. The first case results from treating on a couch as an adverbial PP, which was discussed above. What is the semantic interpretation equation for the second case? The appropriate syntactic rule is VP -» V[_pp:on] NP PP[on] and the desired logical form of the final VP is (I s (DECIDES-ON1 dl s )) Note that there appears to be no semantic contribution from the word on in this case. Subcategorized PPs are handled in many systems by distinguishing between two different types of prepositional phrases. A new binary feature PRED is introduced, where + indicates that the prepositional phrase should be interpreted as a predicate, and - means it should be interpreted as an argument, that is, a term. This binary distinction allows us to specify two rules for prepositional phrases. These are rules 8 and 9 in Grammar 9.7. Rule 8 is restricted to prepositional phrases with a +PRED value, and the PP acts as a modifier. Rule 9 handles all the subcategorized prepositional phrases, with the -PRED feature, in which case the SEM of the PP is simply the SEM of the object NP. Grammar 9.7 also summarizes all me other rules developed in this section. Figure 9.8 shows the two readings of the VP decide on a couch given the grammar defined by Grammars 9.3 and 9.7. The ease in which the decision ite 8. (PP PRED + SEM (X x (?semp x ?semnp))) -> (P SEM ?semp) (NP SEM ?semnp) 9. (PP PRED - PFORM ?pf SEM ?semnp) -» (P ROOT ?pf) (NP SEM ?semnp) 10. (VP VAR*?v SEM (X agl (& (?semvp agl) (?sempp ?v)))) -> (VP SEM ?semvp) (PP PRED+ SEM ?sempp) 11. (VP VAR ?v SEM (X ag2 (?sem v ? v ag2 ?sempp))) -> (V[_np_pp:on SEM Tsemv) (PP PRED - PFORM on SEM ?sempp) 12. (VP SEM (X al (?semaux (?semvp al)))) -> (AUX SUBCAT ?v SEM ?semaux) (VP VFORM ?v SEM ?semvp) 13. (CNP SEM (X nl (& (?semcnp n 1) (?sempp nl)))) -> . (CNP SEM ?semcnp) (PP PRED + SEM ?sempp) Head features for PP: PFORM Head features for VP, CNP: VAR Grammar 9.7 Rules to handle PPs in verb phrases about a couch is shown in the upper half of the figure: the PP has the feature -PRED, as previously discussed, and its SEM is )). 9.4 Lexicalized Semantic Interpretation and Semantic Roles So far, the semantic forms of the lexical entries have only consisted of the possible senses for each word, and all the complexity of the semantic interpretation is encoded in the grammatical rules. While this is a reasonable strategy, many researchers use a different approach, in which the lexical entries encode the complexities and the grammatical rules are simpler. Consider the verb decide, say in its intransitive sense, DECIDES 1. The SEM of the verb is simply the sense DECIDES 1, and rule 2 builds the lambda expression (X y (DECIDES 1 el y)). An alternative approach would be to define the SEM of the lexical entry to be (X y (DECIDES 1 el y)), and then the SEM equation for rule 2 would just use the SEM of the verb as its SEM. Likewise, the SEM for the lexical entry for the transitive sense could be the expression (Xo(Xy (DECIDES-ON1 el y o))). The SEM equation for rule 3 would then apply this predicate to the SEM of the object to obtain the appropriate SEM, as before. There is a tradeoff here between the complexity of the grammatical rules and the complexity of the lexical entries. With the grammar we have developed so far, it seems better to stay with the simple lexicon. But if the logical forms become more complex, the alternative approach begins to look attractive. As an 276 CHAPTER 9 Linking Syntax and Semantics 277 VP SEM X a (DECIDES-ON1 el a ) V SEM DECIDES-ON1 VAR el PPSEM PRED -PFORM on decide PSEM ON-LOC1 NP SEM a couch VP SEM X a (& (DECIDES1 el a) (ON-LOC1 el )) VAR el VP SEM (Xy (DECIDES1 el y)) pp SEM X x (ON-LOC1 x ) VAR el PRED + V SEM DECIDES 1 VAR el decide PSEM ON-LOC1 NP SEM A a couch Figure 9.8 Two possible parse trees for the VP decide on a couch example, consider how to specify a grammar that produces a logical form based on thematic roles. Consider first what happens if the lexicon can only store atomic word senses. Whereas the earlier grammar had only one rule that could cover all transitive verbs, the new grammar would have to classify transitive verbs by the thematic roles they use, and a separate rule must be used for each. The verbs see and eat, for example, both have transitive forms where the subject fills the AGENT role and the object fills the THEME role. The verb break, on the other hand, also has a sense where the subject fills the INSTR role and the object fills the THEME role, as in The hammer broke the window. To handle these two cases, a new feature, say ROLES, would have to be added to the lexical entries that identifies the appropriate forms, and then a separate grammar rule added for_-each. These might look as follows: (VP VAR ?v SEM (A, a (?semv ?v [AGENT a] [THEME ?semnp]))) -> (V ROLES AG-THEME SEM ?semv) (NP SEM ?semnp) (VP VAR ?v SEM (X a (?semv ?v [INSTR a] [THEME ?semnp]))) ->• (V ROLES INSTR-THEME SEM ?semv) (NP SEM ?semnp) Additional rules would be added for all the other possible combinations of roles that can be used by the verbs. Clearly, this approach could be cumbersome. Since it requires adding thematic role information to the lexicon anyway (using the ROLES feature), it might be simpler just to encode the appropriate forms in the lexicon. For instance, if the lexical entries are see: (V VAR ?v SEM (X o (X a (SEES 1 ?v [AGENT a] [THEME ?o]).») break: (V VAR ?v SEM (Xo(Xa (BREAKS 1 ?v [INSTR a] [THEME ?o])))) then a single grammar rule, namely, (VP SEM (?semv ?semnp)) -> (V SEM ?semv) (NP SEM ?semnp) will cover all the cases. Consider the VP see the book, where the SEM of see is as above and the book is . The SEM for the VP would be {(Xo (X a (SEES1 bl [AGENT a] [THEME o]))) ) which can be simplified using one lambda reduction to (X a (SEES1 bl [AGENT a] [THEME ])) The same rule given the VP break the book would apply the SEM of break above to the SEM for the book to produce the reduced logical form (X a (BREAKS 1 bl [INSTR a] [THEME ])) ° Hierarchical Lexicons The problem with making the lexicon more complex is that there are many words, and specifying a lexicon is difficult even when the entries are simple. Just specifying the semantic interpretation rules for the most common sense is tedious, as there is a different semantic interpretation rule for each complement structure the verb allows. For example, the verb give allows the forms I gave the money. I gave John the money. I gave the money to John. The lexical entries for give would include the following: (V SUBCAT _np SEM XoX a (GIVE1 * [AGENT aj [THEME o])) 278 CHAPTER 9 Linking Syntax and Semantics 279 (V SUBCAT_np_np SEM X r X o X a (GIVE1 * [AGENT a] [THEME o] [TO-POSS r])) (V SUBCAT _np_pp:to ' SEM X oXrX a (GIVE1 * [AGENT a] [THEME o] [TO-POSS r|)) This is quite a burden if it must be repeated for every verb. Luckily, we can-do much better by exploiting some general regularities across verbs in English^ For example, there is a very large class of verbs, including most transitive verbs, that all use the same semantic interpretation rule for the _np SUBCAT form. This' class includes verbs such as give, take, see, find, paint, and so on—virtually all verbs that describe actions. The idea of a hierarchical lexicon is to organize verb' senses in such a way that their shared properties can be captured concisely. This depends on a technique called inheritance, where word senses inherit or acquire the properties of the abstract classes above them in the hierarchy. For example, a very useful lexical hierarchy can be based on the SUBCAT and SEM properties of verbs. Near the top of the hierarchy are abstract verb senses that define common verb classes. For example, the abstract class INTRANS-ACT defines a class of verbs that allow a SUBCAT _none and have the semantic interpretation rule X S (7PREDN * [AGENT s]) where 7PREDN is a predicate name determined by the verb. This fully specifies the semantic interpretation of intransitive verbs such as run, laugh, sit, and so on, except for the actual predicate name that still needs to be specified in the lexical entry for the word. Another common form is the simple transitive verb that describes an action, including the verbs listed above. This form, TRANS-ACT, has a SUBCAT _np and a SEM X o X a (7PREDN * [AGENT a] [THEME o]). We can define similar classes for all the common verb forms and then build an inheritance structure that relates verb senses to the forms they allow. Figure 9.9 shows a lexical hierarchy that encodes the definitions of four different verb senses. It is equivalent to the following entries specified without a hierarchy: run (in the intransitive exercise sense, RUN1): (SUBCAT _none SEM X a (RUN1 * [AGENT a])) run (in the transitive "operate" sense, OP1): (SUBCAT _np SEM X o X a (OP1 * [AGENT a] [THEME o])) donate (allowing the transitive and "to" form): (SUBCAT _np SEM XoX&(DONATE 1 * [AGENT a] [THEME o])) (SUBCAT _np_pp:to SEM X o X r X a (DONATE 1 * [AGENT a] [THEME o] [TO-POSS r])) INTRANS-ACT > SUBCAT jione SEM Xs (7PREDN * [AGENT s]); TRANS-ACT SUBCAT _np SEM X o A-a (7PREDN * [AGENT a] [THEME o]j, /'TO-ACT "> SUBCAT _np_pp:to SEM XoXr Xa (7PREDN * [AGENT a] [THEME o] [TO-POSS r]). V ROOT run PREDN OP1 V ROOT run PREDN RUN1 BITRANS-TO-ACT SUBCAT _np_np SEM X r X o Xa (7PREDN * [AGENT a] _ [THEME o] [TO-POSS r]J/ V ROOT give PREDN GIVE1 V ROOT donate PREDN DONATE1 Figure 9.9 Part of a lexical hierarchy for verbs based on SUBCAT and SEM features and, of course, give (in all its forms as previously discussed): (SUBCAT _np SEM XoXa. (GIVE1 * [AGENT a] [THEME o])) (SUBCAT _np_pp:to SEM XoXiXa (GIVE1 * [AGENT a] [THEME o] [TO-POSS r])) (SUBCAT _np_np SEM XrXoXa (GIVE1 * [AGENT a] [THEME o] [TO-POSS r])) You could implement a lexical hierarchy by adding another feature to each lexical entry called SUP (for superclass), which has as its value a list of abstract categories from which the constituent inherits properties. It is then relatively simple to write a program that searches up this hierarchy to find all the relevant feature values whenever the lexicon is accessed. The entry for the verb give might now look like this: give: (V ROOT give PREDN GIVE1 SUP (BITRANS-TO-ACT TRANS-ACT)) 280 CHAPTER 9 Linking Syntax and Semantics 281 14. 15. 16. (S INV - SEM (WH-query ?sems)) -» (NP WH Q AGR ?a SEM ?semnp) (S INV + SEM ?sems GAP (NP AGR ?a SEM ?semnp)) (S INV + GAP ?g SEM (?semaux (?semvp ?semnp))) -» (AUXAGR ?a SUBCAT ?s SEM ?semaia) (NP AGR ?a GAP - SEM ?sempp) (VP VFORM ?s GAP ?g SEM ?semvp) (NP WH Q VAR ?v SEM ) -> (PRO WHQ SEM ?sempro) Grammar 9.10 Rules to handle simple wh-questions 9.5 Handling Simple Questions The grammar developed so far deals only with simple declarative sentences. To extend the grammar to handle other sentence types requires adding rules to interpret wh-terms, inverted sentences, and the gap propagation required to handle wh-questions. This can be done quite directly with the mechanisms developed so far and requires no additional extensions to the parser. All you need to do is augment the S rules first developed for questions in Chapter 5 with appropriate SEM feature equations. The part that may need some explanation is how the SEM feature interacts with the GAP feature. As you recall, many wh-questions use the GAP feature to build the appropriate structures. Examples are Who did Jill see? Who did Jill want to see? Who did Jill want to win the prize? The rule introduced in Chapter 5 to account for these questions was (S INV -) -> (NP WH Q AGR ?a) (S INV + GAP (NP AGR ?a)) That is, a wh-question consists of an NP with the WH feature Q followed by an inverted sentence with an NP missing that agrees with the first NP. To make the semantic interpretation work, we add the SEM feature to the features of the gap. This way it is passed into the S structure and can be used at the appropriate place when the gap is found. The revised rule is (S INV - SEM (WH-query ?sems)) -» (NP WH Q AGR ?a SEM ?semnp) (S INV + SEM ?sems GAP (NP AGR ?a SEM ?semnp)) Grammar 9.10 gives the new rules required to handle this type of question. = Rule 14 was just discussed. Rule 15 handles inverted sentences and could be used S2 SEM (WH-QUERY ( (SEES) (NAME jl "Jill") (WH pi WHOl)))) NP1 SEM . SI GAP SEM ( (SEES1 (NAME jl "Jill") ?semnp)) VP1 GAP SEM (X ag(SEESl ag ?semnp)) GAP1 SEM VI SEM SEES 1 I Who did Jill Figure 9.11 The parse tree for Who did Jill see! for yes/no questions as well as for the wh-questions discussed here. Rule 17 allows noun phrases to be built from wh-pronouns, such as who and what. The lexical entry for the wh-words would have to be extended with a SEM feature as well. For instance, the word who would have the entry (PRO WH {QR} SEM WHOl AGR {3s 3p}) The predicate WHOl would be true of any objects that are reasonable answers to such questions, including people and possibly other animate agents. To see how the SEM feature and GAP features interact, consider a derivation of the parse tree for the question Who did Jill see?, as shown in Figure 9.11. Using rule 16, the word who would be parsed as the noun phrase (NP WH Q AGR 3s SEM ) This constituent can be used to start rule 14, and thus we need the following constituent to complete the rule: (S INV + GAP (NP AGR 3s SEM) SEM ?sems) Rule 15 can be used to rewrite this, and the words did and /;'// provide the first two subconstituents. The third subconstituent needed is: (VP VFORM base GAP (NP AGR 3s SEM ) SEM ?semvp) 282 CHAPTER 9 Linking Syntax and Semantics 283 This is a VP with an NP gap. Rule 3 from Grammar 9.3 applies to transitive verbs such as see. The GAP feature is used to fill a gap for the object of the verb, instantiating ?semnp to . Thus the SEM of the new VP is (1 a3 (SEES 1 si a3 )) The SEM of the initial word who has found its way to the appropriate place in the' sentence. The parse can now be completed as usual. This technique of passing in the SEM in the GAP is completely general and can be used to handle all of the question types discussed in Chapter 5. Prepositional Phrase Wh-Questions Questions can also begin with prepositional phrases, such as In which box did you put the book?, Where did you put the book?, and When did he disappear? The semantic interpretation of these questions will depend on whether the PP^ are subcategorized for the verb or are VP modifiers. Many such questions can be handled by a rule virtually identical to rule 14 for NP wh-questions, namely (SEW-SEM(WH-query ?sems)) -> (PP WH Q PRED ?p PTYPE ?pt SEM ?sempp) (SINV + SEM ?sems GAP (PP PRED ?p PTYPE ?pt SEM ?sempp) i To handle wh-terms like where appropriately, the following rule is also needed: (PP PRED ?pd PTYPE ?pt SEM ?sem) -> (PP-WRD PRED ?pd PTYPE ?pt SEM ?sem) The wh-term where would have two lexical entries, one for each PRED value: (PP-WRD PTYPE {LOC MOT} PRED SEM ) ■VAR ?v (PP PRED + VAR ?v SEM (k x (AT-LOC x ))) These rules would extend the existing grammar so that many such questions could be answered. Figure 9.12 shows part of the parse tree for the question Where did Jill go? Note that handling questions starting with +PRED prepositional phrases depends on having a solution to the problem concerning gap propagation first mentioned in Chapter 5. Specifically, the rule VP -> VP PP, treated in the normal way, would only pass the GAP into the VP subconstituent, namely the nonlexical head. Thus there seems no way to create a PP gap that modifies a verb phrase. But this was a problem with the syntactic grammar as well, and any solution at that level should carry over to the semantic level. S4 SEM (WH-QUERY (GOES1 gl (NAME jl "Jill"") )) PPI SEM S3 GAP (PPSEM) PRED - ' SEM (GOES1 gl (NAME jl "Jill") PTYPE MOT // V ) Where did SEM GOES1 VAR gl VP3 GAP (PP SEM ) SEM (X ag(GOESlgl ag ) PPSEM EMPTY + Figure 9.12 The parse tree for Where did Jill go? 9.6 Semantic Interpretation Using Feature Unification So far we have used lambda expressions and lambda reduction to drive the semantic interpretation. This provides a good framework for explaining and comparing techniques for semantic interpretation. However, many systems do not explicitly use lambda expressions and perform semantic interpretation directly using feature values and variables. The basic idea is to introduce new features for the argument positions that earlier would have been filled using lambda reduction. For instance, instead of using rule 1 in Grammar 9.3, namely (S SEM (?semvp ?semnp)) -> (NP SEM ?semnp) (VP SEM ?semvp) a new feature SUBJ is introduced, and the rule becomes (S SEM ?semvp) -> (NP SEM ?semnp) (VP SUBJ ?semnp SEM ?semvp) The SEM of the subject is passed into the VP constituent as the SUBJ feature and the SEM equations for the VP insert the subject in the correct position. The new version of rule 3 in Grammar 9.3 that does this is (VP VAR ?v SUBJ ?semsubj SEM (?semv ?v ?semsubj ?semnp)) (V[_none] SEM ?semv) (NP SEM ?semnp) Figure 9.13 shows how this rule builds the SEM of the sentence Jill saw the dog. Compare this to the analysis built using Grammar 9.3 shown in Figure 9.5. The differences appear in the treatment of the VP. Here the SEM is the full proposition with the subject inserted, whereas before the SEM was a lambda expression that would be applied to the subject later in the rule that builds the S. 284 CHAPTER 9 Linking Syntax and Semantics 285 S4SEM(sl (NAME jl "JiH") si (NAME jl "Jill") NP1 SEM (NAME jl "Jill") VAR dl Jill \ CNP1 SEM (DOG1 dl) VAR dl I dog Figure 9.13 The parse tree for Jill saw the dog using the SUBJ feature 1. (S SEM ?semvp) -> (NP SEM ?semsubj) (VPSUBJ ?semsubjSEM ?semvp) Z (VP VAR ?v SUBJ ?semsubj SEM (?semv ?v ?semsubj)) -» (V[_none] SEM ?semv) 3. (VP VAR ?v SUBJ ?semsubj SEM (?semv ?v ?semsubj ?semnp)) -» (V[_np]SEM ?semv) (NP SEM ?semnp) 4. (NP VAR ?v SEM (PRO ?v ?sempro)) -» (PRO SEM ?sempro) 5. (NP VAR ?v SEM (NAME ?v ?semname)) -> (NAME SEM ?semname) 6. (NP VAR ?v SEM ) -» (ART SEM ?semart) (CNP SEM ?semcnp) 7. (CNP VAR ?v SEM (?semn ?v)) -> (NSEM ?semn) Head features for S, VP, NP, CNP: VAR Grammar 9.14 A simple grammar with SEM features Grammar 9.14 is a version of Grammar 9.3 reformulated using this technique. Besides the changes to rules 1, 2, and 3, rules 6 and 7 are also modified. Rule 7 uses the VAR value to build a full proposition (as opposed to a unary predicate in the old grammar), and rule 6 is changed appropriately to account for the change to rule 7. BOX 9.2 Encoding Semantics Entirely in Features A more radical departure from the approach described in this chapter encodes the entire logical form as a set of features. For instance, rather than the logical form previously developed, the sentence Jill saw the dog might have a SEM of (PRED SEES1 VAR si TNS PAST AGENT (NAME jl "Jill") THEME (SPECTHE1 VARdT RESTRICTION (PREDDOG1 ARG1 dl))) Logical forms like this can easily be constructed using additional features for each semantic slot. For instance, here are new versions of the first three rules in Grammar 9.14 that compute this type of logical form: 1. (S VFORM past SEM ?semvp) -» (NP SEM ?semnp) (VP TNS past SUBJ ?semnp SEM ?semvp) Z (VP TNS ?tns SUBJ ?semsubj SEM ?semv) -> (V[_none] VAR ?v SUBJ ?semsubj TNS ?tns SEM ?semv) 3. (VP TNS ?tns SUBJ ?subj SEM ?semv) -» (V[_np] VAR ?v SUBJ ?subj OBJVAL ?obj TNS ?tns SEM ?semv) (NP SEM ?obj) Note that all the arguments are passed down into the lexical rule, which would assemble the appropriate semantic form. For instance, the lexical entry for saw might be (V[_np] SUBJ ?s OBJVAL ?o VAR ?v TNS past SEM (PRED SEES1 VAR ?v TNS PAST AGENT ?s THEME ?o)) When this entry is unified with the V in rule 3, the appropriate SEM structure would be built and bound to the variable ?semv, as desired. One advantage of this approach is that no special mechanism need be introduced to handle semantic interpretation. In particular, there is no need to have a lambda reduction step. Everything is accomplished by feature unification. Another significant advantage is that a grammar specified in this form is reversible and hence can be used to generate sentences as well as parse them, as discussed in the next section. Not all lambda expressions can be eliminated using these techniques, however. For instance, to handle conjoined subject phrases as in Sue and Sam saw Jack, the meaning of the verb phrase must still be a lambda expression. If the subject were inserted into the VP using a variable for the SUBJ feature, then this variable would have to unify with the SEMs of both Sue and Sam, which it can't do. 286 CHAPTER 9 BOX 9.3 Relative Clauses The syntactic grammar for relative clauses described in Section 5.4 exploits the strong parallels between relative clauses and wh-questions. This similarity continues at the semantic level as well. The logical form of a relative clause will be a unary predicate, and the rule for introducing them is (CNP SEM (X x (& (?semcnp ?x) (?semrel ?x))) -> (CNP SEM ?semcnp) (REL SEM ?semrel) This rule takes two unary predicates (one from the CNP, the other from the REL) and combines them to make a new unary predicate formed by conjoining the two. Now consider one of the rules for the REL category. The linguistic insight needed here is that the wh-term that introduces a relative clause corresponds to the argument of the unary predicate. Thus, in building the SEM for the REL who Jill saw, rather than having as the SEM of who, as in questions, a variable is used. When the embedded sentence is completed, a lambda is wrapped around it to form the SEM of the relative clause, which for this example might be (X y ( (NAME jl "Jill") y)) The wh-term variable is specified in a feature called RVAR in following rule. (REL SEM (X ?v ?sems)) -> (NP WH R RVAR ?v AGR ?a SEM ?semnp) (S [fin] INV - GAP (NP AGR ?a SEM ?semnp) SEM ?sems) The following parse tree fragment shows the derivation of this relative clause: REL SEM (A y ( si (NAME jl "Jill"; y)) NP SEMy RVARy WHR S SEM( si (NAMEjl "Jill") y) GAP (NP SEM y) who NP SEM (NAME jl "Jill") VP SEM (X z ( si z y)) I I GAP (NPSEMy) Jill 9.7 Generating Sentences from Logical Form Intuitively, once you have a grammar for parsing, it should be easy to reverse it and use it for generation. Such a sentence generator would be given a constituent with its SEM feature set to a logical form, and then it would use the grammar to decompose this constituent into a series of lexical constituents that would have Linking Syntax and Semantics 287 the appropriate meaning. Not all grammars are reversible, however. In fact, Grammar 9.3 is not reversible because it uses lambda reduction. To see why, consider an example. Say you want to generate a sentence that has the meaning ( si (NAME jl "Jill") ) Grammar 9.3 has only one S rule, and if you try to unify the SEM value in rule 1 with this logical form it will fail. The pattern (?semvp ?semnp) would match any proposition built out of a unary predicate and one argument, but the logical form specified is a proposition with three arguments. The problem is that lambda reduction was used to convert the original logical form, which was ((la( si a e (NAME jl "Jill") si a si (NAME jl "Jill") o)) There is nothing in rule 1 that indicates which is the right abstraction, but only the second will yield an appropriate sentence. On the other hand, the approach using features, as in Grammar 9.14, is reversible because it retains in the features the information necessary to determine how the logical form was constructed. It is easiest to understand how it does this by considering an example. In many ways parsing and realization are very similar processes. Both can be viewed as building a syntactic tree. A parser starts with the words and tries to find a tree that accounts for them and hence determine the logical form of the sentence, while a realizer starts with a logical form and tries to find a tree to account for it and hence determine the words to realize it. This analogy suggests that the standard parsing algorithms might be adaptable to realization. For instance, you could imagine using the standard top-down algorithm, where rules are selected from the grammar based on whether they match the intended SEM structure, and the words in the sentence are selected in a left-to-right fashion. The problem is that this approach would be extremely inefficient. For example, consider using Grammar 9.14 to realize an S with the SEM ( si (NAME jl "Jill") ) Rule 1 matches trivially since its SEM feature has a variable as its value. ?semvp. There are no other S rules, so this one is instantiated. The standard top-down algorithm would now attempt to generate the two subconstituents, namely (NP SEM ?semsubj) (VP SUBJ ?semsubj SEM ( si (NAME jl "Jill"))) 288 CHAPTER 9 Linking Syntax and Semantics 289 Initialization: Set L to a list containing the constituent that you wish to generate. Do until L contains no nonlexical constituents: 1. If L contains a constituent C that is marked as a nonlexical head, 2. Then use a rule in the grammar to rewrite C. Any variables in C that are bound in the rewrite should be instantiated throughout the entire list. 3. Else choose a nonlexical constituent C, giving preference to one whose SEM feature is bound, if one exists. Use a rule in the grammar to rewrite C. Any variables in C that are bound in the rewrite should be instantiated throughout the entire list. Figure 9.15 A head-driven realization algorithm But the SEM of the NP is unconstrained. The realizer could only proceed by randomly generating noun phrases and then attempting to realize the VP with each one and backtracking until an appropriate one is found. Worse than that, with a more general grammar, the algorithm may fall into an infinite loop. Clearly, this is an unworkable strategy. One method for avoiding these problems is to expand the constituents in a different order. For instance, choosing what term will be the subject depends on decisions made about the verb and the structure of the verb phrase, such as whether the active or passive voice is used, and so on. So it makes sense to expand the verb phrase first and then generate the appropriate subject. In fact, a good general strategy is to expand the head constituents first and then fill in the others. Figure 9.15 gives a simple algorithm to do this. The realization algorithm operates on a list of constituents much like the basic top-down parser described in Chapter 3. It continues to rewrite constituents in this list until the list consists only of lexical constituents, at which point the words can be generated. This algorithm can easily be generalized so that it searches all possible realizations based on the grammar using a backtracking technique. The reason it works well is that, by expanding the head constituents first, the algorithm moves quickly down to the lexical level and chooses the words that have the most influence on the structure of the overall sentence. Once the lexical head is chosen, most of the structure of the rest of the constituent is determined. Consider this algorithm operating with Grammar 9.14 and the initial input (S SEM ( si (NAME jl "Jill") si (NAME jl 'Jill") )) The nonlexical head constituents are indicated in italics. Thus the VP is expanded next. Only rule 3 will match the SEM structure. As a result of the match, the following variables are bound: ?semv^ ?v «-sl ?semsubj <- (NAME jl "fill") ?semnp <- Thus you obtain the following list of constituents after rewriting the VP and instantiating the variables throughout the list: (NP SEM (NAME jl "Jill")) (V[_np] SEM ) (NP SEM ) Since there is no nonlexical head, the algorithm now picks any nonlexical constituent with a bound SEM, say the first NP. Only rule 5 will match, yielding the lexical constituent (NAME SEM "Jill") and producing the constituent list (NAME SEM "Jill") (V[_np] SEM) (NP SEM ) The remaining NP is selected next. Rule 6 matches and the subconstituents (ART SEM THE) and (CNP SEM DOG1) are generated, producing the constituent list (NAME SEM "Jill") (V[_np] SEM ) (ART SEM THE) (CNP SEM DOG1) The CNP constituent is selected next and rewritten as a common noun with SEM DOG1, and the algorithm is complete. The constituent list is now a sequence of lexical categories: (NAME SEM "Jill") (V[_np] SEM ) (ART SEM THE) (N SEM DOG1) It is simple to produce the sentence Jill saw the dog from the lexical constituents. The grammar used in this example was very small, so there is only one possible realization of this sentence. With a larger grammar, a wider range of forms would be possible. For instance, if only the SEM feature is specified, the realization program would randomly pick between active and passive sentences when allowed by the verb. In other cases it might pick between different subcategorization structures. For instance, the same logical form might be realized as Jill gave the dog to Jack or Jill gave Jack the dog. Depending on the 290 CHAPTER 9 Linking Syntax and Semantics 291 number of word senses used and whether different words have senses in common, the realizer may also randomly pick between various lexical realizations of a logical form. For instance, there could be a logical form that could be realized by the sentence Jill gave the money to the Humane Society or Jack donated the money to the Humane Society. Each of these variations may have different effects in context, but these distinctions are not captured in the logical form. To force particular realizations, you would have to specify other features in addition to the SEM feature. For instance, you might set the VOICE feature to active to guarantee an active voice sentence. Summary This chapter developed a rule-by-rule model of semantic interpretation by introducing the new features SEM and VAR to encode semantic information. The representation depends on the use of lambda expressions, and every constituent has a well-formed semantic form. With two minor extensions to introduce unique VAR values and to perform lambda reduction, any chart parsing algorithm will compute a logical form as it builds the syntactic structure. A variant of this approach that uses additional features instead of lambda expressions was dis -cussed. This representation produces a reversible grammar, which can be used for semantic realization as well as semantic interpretation. A head-driven realization algorithm was used to generate sentences from a specified logical form. This expands the head constituents first to determine the overall structure of the sentence before filling in the details on the other constituents. Related Work and Further Readings Most work in semantics in the last two decades has been influenced by the work of Montague (1974), who proposed that the semantics for natural languages could be defined compositionally in the same way as the semantics for formal languages (1974). He introduced several key concepts, the most influential for our purposes being the adoption of the lambda calculus. A good reference for Montague style semantics is Chierchia and McConnell-Ginet (1990). Dowty et al. (1981) and Partee et al. (1993) give more detailed coverage. The compositional semantic interpreter described in this chapter draws from a body of work, with key references being Rosenschein and Shieber (1982), Schubert and Pelletier (1982), and GPSG (Gazdar et al., 1985). Other important references on compositional interpretation include Hirst (1987), McCord (1986). Saint-Dizier (1985), and Hwang and Schubert (1993b). The use of feature unification in place of lambda reduction is discussed by Pereira and Shieber (1987) and Moore (1989). A good example of this technique is (Alshawi, 1992). There is a substantial literature in natural language generation, an area that deals with two main issues: deciding on the content of what should be said, and then realizing that content as a sentence. The algorithm in Section 9.7 addresses the realization part only, as it assumes that a logical form has already been identified. By the time the logical form is identified, many of the most difficult problems in generation—such as deciding on the content and on the main predicates and mode of reference—must have already been resolved. For a good survey of work in generation, see the article by McDonald in Shapiro (1992) under the heading Natural Language Generation. For a good collection of recent work, see the special issue of Computational Intelligence 7, 4 (1991) and the papers in Dale et al. (1990). The head-driven algorithm described here is a simplified version of that described by Shieber et al. (1990). Exercises for Chapter 9 1. (easy) Simplify the following formulas using lambda reduction: cax(Px))A) (ax(xA))(?iy(Qy))) ((Xx(ay(?y))x))A) 2. (easy) Using the interpretation rules defined in this chapter and defining any others that you need, give a detailed trace of the interpretation of the sentence The man gave the apple to Bill. In particular, give the analysis of each constituent and show its SEM feature. 3. (medium) This question involves specifying the semantic interpretation rules to analyze the different senses of the verb roll, as in We rolled the log into the river. The log rolled by the house. The cook rolled the pastry with a large jar. The ball rolled around the room. We rolled the piano to the house on a dolly. a Identify the different senses of the verb roll in the preceding sentences, and give an informal definition of each meaning. (You may use a dictionary if you wish.) Try to identify how each different sense allows different conclusions to be made from each sentence. Specify the lexical entries for the verb roll. b. List the VP rules in this chapter that are needed to handle these sentences, and add any new rules as necessary. c. Given the rules outlined in Exercise 3b, and assuming appropriate rules exist for the NPs, draw the parse tree with the logical forms shown for each constituent. 4. (medium) Draw the parse trees showing the semantic interpretations for the constituents for the following questions. Give the lexical entries showing 292 CHAPTER 9 the SEM feature for each word used that is not defined in this chapter, and define any additional rules needed that are not specified in this chapter. Who saw the dog? Who did John give the book to? (medium) Consider the following rules that are designed to handle verbs" that take infinitive complements. 1. (VP VAR ?v SEM (X ag (?semv ag (?semvp ag)))) -> (v SUBCAT _vp-inf SEM ?semv) (VP WORM inf SEM ?semvp) 2 (VP VAR ?v SEM (X ag (?semv ag (?semvp ?semnp)))) -> (VSUBCAT_np_vp-inf SEM ?semv) (NP SEM ?semnp) (VP VFORM inf SEM ?semvp) 3. (VP VFORM inf ?v SEM ?semvp) -> TO (VP VFORM base SEM ?semvp) Note that the semantic structures built identify the subject of the embedded VP. For _vp-inf verbs, rule 1, the implicit subject is the subject of the sentence, whereas for _np_vp-inf verbs, rule 2, the implicit subject is the object of the main verb. Using these rules and the rules described in this chapter, what logical forms are produced for the following sentences? Jill wants to cry. Jill wants Jack to cry. Who does Jill want to see? (medium) Consider the following sentences. John is happy. John is in the corner. John is a man with a problem. Each of these asserts some property of John. This suggests that the phrases happy, in the corner, and a man with a problem all should have similar semantic structures, presumably a unary predicate. In some approaches, the PRED feature is extended to forms other than PPs as a way of capturing the similarities between all three of these cases. Assuming this approach, what should the interpretation of these three phrases be when they have the +PRED feature, as in this context? Write out the necessary rules to add to the grammar described in this chapter so that each sentence is handled appropriately. CHAPTER Ambiguity Resolution 10.1 Selectional Restrictions 10.2 Semantic Filtering Using Selectional Restrictions 10.3 Semantic Networks 10.4 Statistical Word Sense Disambiguation 10.5 Statistical Semantic Preferences 10.6 Combining Approaches to Disambiguation Ambiguity Resolution 295 The techniques developed in the last chapter didn't address the issue of resolving ambiguity. The word bridge, for example, has at least four distinct senses as a noun: the elevated structure that spans a body of water, the card game, the dental device, and the abstract notion of providing a connection. To distinguish between these different readings, different senses are introduced, say STRUCTURE 1, GAME27, DENTAL-DEV37, and CONNECT12. Any given sentence will allow only a subset of the possible senses. For example, the sentence The bridge crosses the river only makes sense with the STRUCTURE1 sense. But the mechanism for Semantic interpretation in the last chapter allows each of these readings, as there is nothing that defines what combinations of senses are coherent. This chapter develops some ideas and techniques that start to address the ambiguity problem. Section 10.1 discusses using selectional restrictions and type hierarchies to specify allowable combinations of word senses and develops a constraint propagation algorithm that eliminates impossible word senses. Section 10.2 then uses these techniques during parsing to filter out semantically impossible constituents. Section 10.3 discusses semantic network representations, which generalize the simple type hierarchies. Section 10.4 discusses some statistical methods for word sense identification using word collocations, and Section 10.5 looks at generalizing the notion of selectional restrictions as statistical preferences. Section 10.6 discusses some general issues in combining structurally-based techniques with statistically-based techniques. 10.1 Selectional Restrictions Word senses can be related in different ways based on the object classes they describe. Some senses are disjoint; that is, no object can be in both classes at the same time. DOG1 (the typical sense of dog) and CAT1 (the typical sense of cat) are disjoint senses. Other senses are subclasses of other senses. The class DOG1, for instance, would be a subclass of the class MAMMAL1, and a subclass of the class PET1 (house pets). Other senses will overlap, such as MAMMAL1 and PET1. All this general knowledge plays a role in semantic disambiguation. The subset relation defines an abstraction hierarchy on the word senses. This relation is important as it allows restrictions to be stated in terms of very broad classes of objects in a concise and intuitive way. For instance, an adjective such as purple only makes sense if it is modifying a physical object, such as a house or a piece of clothing. It doesn't make sense, for instance, to talk of purple ideas or purple events. This can be stated using one requirement, namely that purple must modify a physical object. All senses below physical object would thus be allowed. The modifier precise, on the other hand, only makes sense modifying an idea or action (or as a general characteristic of the way a person behaves), and the modifier unfortunate only makes sense modifying some event or situation. Figure 10.1 shows a fragment of the top of a type hierarchy that is useful for natural language. This is incomplete, and all the objects in the ontology 296 CHAPTER 10 Ambiguity Resolution 297 ■—TIME — NON-LIVING 1—PERSON — PHYSOBJ - 1—ANIMATE- — INDtVIDUAL- — LIVING-- — ANIMAL ALL—— — LOCATION — ABSTRACTOBJ — VEGETATIVE — ABSTRACTEVENT — SITUATION - — EVENT - — PHYSEVENT — COLLECTION Figure 10.1 A word sense hierarchy mentioned in the last few chapters would need to find a place here. Also note that hierarchies need not be tree structures; that is, senses may have multiple super-types. The classifications of MALE and FEMALE, for instance, apply at the same level as ANIMATE/VEGETATIVE and combine with all these subclasses across the entire subtree of LIVING individuals. With this, the sense MALE-PERSON would be a subtype of both PERSON and MALE. Given such a hierarchy, you can start exploring the restrictions that predi -cates put on their arguments. Consider the verb read. It has two principal arguments: the agent, which must be an object capable of reading (for example, something of type PERSON), and the theme, which must be an object that contains text (such as a book, newspaper, label, sign, and so on). To handle such verbs correctly, we might introduce a new type, say TEXTOBJ, under NONLIVING, which is a superset of BOOK1, ARTICLE/TEXT, and so on. These constraints allow you to select appropriate senses for the words in a sentence. For example, say the noun dishwasher has two senses, either a machine (D1SH-WASH/MACH1) or a person (DISHWASH/PERS), and the noun article can be a paper (ARTICLE/TEXT) or a part of speech (ARTICLE 1). These senses fit into the hierarchy as shown in Figure 10.2. Since these two words are ambiguous, the sentence The dishwasher read the article might appear to have four distinct semantic meanings, one for each combination of senses. But only one reading makes sense, namely (READS 1 [AGENT ] [THEME ]) A semantic interpreter can perform this form of disambiguation by using= what are called selectional restrictions. These are specifications of the legal-combinations of senses that can Co-occur and can be applied to eliminate " incoherent formulas constructed by the parser. -NOUNl ,—ABSTRACTOBJ —WORD-CLASS '—PHYSOBJ - -c I-MACMlINt - I— TF.XTORI- ARTICLE1 ■MACHINE-DISHWASH/MACH |—NON-LIVING —I |—ARTICLE/TEXT TEXTOBJ ■ —book1 —PERSON— DISHWASH/PERS I—ANIMATE — 1—LIVING- I—ANIMAL —VEGETATIVE Figure 10.2 A fragment of the hierarchy To incorporate this technique into the semantic interpreter, you need a mechanism for extracting the type information inherent in the logical form. This is obtained by examining each term in the logical form in terms of the unary and binary predicates that involve it. For instance, consider the logical form of The dishwasher read the article before applying any selectional restrictions: (READS 1 rl [AGENT ] [THEME ]) Unpacking the notation, the following unary and binary relations are found: (READS 1 rl) ({DISHWASH/MACH 1 DISHWASH/PERS} dl) ({ARTICLE/TEXT ARTICLE1} pi) (AGENT rl dl) (THEME rl pi) Finding the allowable combinations can be viewed as a constraint satisfaction problem. In this case the constraints include the semantic relationships between unary predicates (subclass, disjoint, and so on), which are expressed in type hierarchies such as those in Figures 10.1 and 10.2, and selectional restrictions on the semantic types of the arguments to the binary relations. The selectional restrictions for READS 1 are expressed as follows: (AGENT READS1 PERSON)—the agent must be a person. (THEME READS1 TEXTOBJ)—the theme must be a TEXTOBJ object. 298 CHAPTER 10 Ambiguity Resolution 299 If these are the only allowed relations for the READS 1 sense, several of the previously mentioned possibilities for the terms in the logical form can be eliminated. For (AGENT rl dl) to be valid, dl must be a person. Thus the unary constraints on dl can be simplified from ({DISHWASH/MACH1 DISH-WASH/PERS} dl) to (DISHWASH/PERS dl). Similarly, the interpretation of pi is simplified to (ARTICLE/TEXT pi). By transferring these constraints back into the logical form, we end up with a single unambiguous reading, as desired. Of course, in many cases it is more complicated than this. The verb read, for instance, might have two senses: READS 1, and, say, READS2, as a form of understanding a person's intentions, as in Jill can read John's mind. The selectional restrictions for READS2 might be (AGENT READS2 PERSON) (THEME READS2 MENTAL-STATE) With this additional sense, the initial logical form of The dishwasher read the article is ({READS 1 READS2} rl [AGENT ] [THEME ]) This additional ambiguity does not affect the final result because the READS2 sense requires a MENTAL-STATE as a THEME, and neither of the senses of article is a subclass of mental state. Thus, there is no combination of interpretations that allows the READS2 sense. We also need to extend this technique to proper names, pronouns, and adjectives. We will have to assume that we have type information for what proper names may refer to. For example, the type of John might be MALE, that is, an animate object that is MALE. Unknown names might just default to having the property INDIVIDUAL. The type of pronominal phrases is defined by the pronoun sense, thus SHE1 should be a subclass of FEMALE, whereas IT1 would be anything but PERSON. Adjectives act like verbs in that they impose restrictions on their arguments. Such modifiers can be handled by introducing the state variable representation for each adjective modifier and a new thematic relation MOD. For example, for the phrase happy dishwasher, instead of using the predicate-argument form (HAPPY 1 dl), we will use the unary relation (HAPPY-STATE hi) and the binary relation (MOD hi dl). This way, adjectives are treated the same way as verbs. Thus the set of relations derived from the sentence The happy dishwasher read the paper would be (READS! rl) ((D1SHWASH/MACII1 DISHWASH/PERS} ({ARTICLE/TEXT ART1CLEI} pi) (HAPPY-STATE hi) dl) Initialization Step Assign typestvariablei) to the list of possible senses for variable i. Iteration Step Iterate through each binary relation (rel variable 1 variable2): 1. For each sense ] in typestvariablej): a. find all selectional restrictions (rel sensey sense2) where sense2 intersects with some sense in types(variable2). b. If none found, remove sensej from types(variablej). 2. Eliminate from types(variable2) any sense that did not match at least one restriction in step 1. Termination Step If any changes were made to the types of the variables in the last iteration, then perform the iteration step once again. Otherwise, if type(variablej) is empty for any i, then fail. Figure 10 J A simple constraint satisfaction algorithm (AGENT rl dl) (THEME rl pi) (MOD hi dl) The selectional restriction for the adjective happy would be (MOD HAPPY-STATE ANIMATE)—HAPPY-STATE must modify an animate object. Every time a logical form is produced, the set of unary and binary semantic relations it contains can be checked against the selectional constraints. If there is no interpretation that satisfies the constraints, then the interpretation is anomalous and the constituent can be discarded. Otherwise, the simplified logical form, with the impossible senses removed, can be used as the SEM of the constituent. In the remainder of this section, we explore the constraint satisfaction algorithm in a little more detail. The algorithm is shown in Figure 10.3. To implement the algorithm, you need a procedure that matches senses. A match fails if the two types have no intersection at all. Otherwise, it returns a type that is the intersection of the two. If one is a subtype of the other, the subtype is the result. Given the hierarchy in Figure 10.2, the result of Match(PERSON, DISHWASH/PERS) is DISHWASH/PERS. In other cases the types might overlap and return a sense that is the subtype of both senses. For example, with appropriate definitions, the result of Match(MALE, PERSON) would be MALE-PERSON I. The algorithm starts by making a list of all the possible senses of each discourse variable and then iterates through the binary relations to eliminate any unary constraints that cannot satisfy at least one binary constraint. If a change is 300 CHAPTER 10 Ambiguity Resolution 301 made, the binary constraints are rechecked again. This continues until no further changes are made. As an example, consider running this algorithm on the sentence The dishwasher read the article. The initialization step produces the following types: r types(rl)= READS1, READS2 types(pl) = ARTICLE/TEXT, ARTICLE 1 types(dl) = DISHWASH/PERS, DISHWASH/MACH1 The binary relations for this problem are (AGENT rl dl) and (THEME rl pi). The first iteration step runs as follows: For (AGENT rl dl), we iterate through the senses of rl: READS 1—we find selectional restriction (AGENT READS 1 PERSON), and PERSON matches only DISHWASH/PERS (with result DISHWASH/PERS). READS2—we find selectional restriction (AGENT READS2 PERSON), and PERSON matches only DISHWASH/PERS (with result DISHWASH/PERS). Thus types(dl) becomes (DISHWASH/PERS); that is, DISH-WASH/MACH1 has been eliminated because it cannot satisfy any binary constraint. For (THEME rl pi), we iterate though the senses of rl: READS 1—we find selectional restriction (THEME READS2 TEXTOBJ) because TEXTOBJ matches ARTICLE/TEXT (with result ARTICLE/TEXT). READS2—we find no matching selectional restriction; that is, (THEME READS2 MENTAL-STATE) cannot be satisfied. Thus types(rl) becomes (READS 1); that is, READS2 is eliminated, and types(dl) becomes (ARTICLE/TEXT) because: ARTICLE 1 is eliminated. Since changes were made, we iterate again. The second iteration step runs as; follows: For (AGENT rl dl), only one sense of rl remains: READS 1—we find selectional restriction (AGENT READS 1 PERSON). For (THEME rl pi): READS 1—we find selectional restriction (THEME READS2 TEXTOBJ). Since no changes were made this time, we are done. The final types are types(rl)= READS 1 types(pl) = ARTICLE/TEXT types(dl)= DISHWASH/PERS as desired. Selectional restrictions are also very useful for further refining the type of unknown objects. For instance, the pronoun it inherently offers little restriction on the type of object it refers to except that it is not a person. The selectional restrictions on the verb, however, can give a strong indication of the type of the object it refers to. Consider the processing on the sentence He read it, assuming just the READS1 sense of the verb: (READS 1 r3 [AGENT (PRO il HE1)] [THEME (PRO nl (IT1 nl))]) The unary and binary constraints on the objects are (READS 1 r3), (AGENT r3 il), (THEME r3 nl), (MALE il), (IT1 nl) After applying the selectional restrictions for the sense READS 1, the type of he will be constrained to be of type MALE-PERSON1 (the intersection of MALE and PERSON), and the type of it will be constrained to be a TEXTOBJ (the intersection of IT1 and TEXTOBJ). This additional information could be extremely useful as you move on to contextual processing, and the referents of these pronouns need to be identified. To maintain this information, it is added to the logical form. Thus, after applying the selectional restrictions, the logical form of the sentence would be (READS 1 r3 [AGENT (PRO il (& (MALE il) (PERSON il)))] [THEME (PRO nl (& (TTl nl) (TEXTOBJ nl)))]) Selectional restrictions provide an important technique for disambiguation, and are used in some form in almost every computational system. The approach has been criticized, however, and it is worth examining the criticisms and then considering why the methods remain so useful in computational systems despite the problems. The main problem was that semantic well-formedness is a continuous scale rather than an all-or-nothing decision. If you are forced to make absolute conditions on well-formedness, either you will eliminate many possible interpretations or you will have so abstract a constraint that it eliminates almost nothing. Consider the following sentences: 1. 2. 3. 4. I ate the pizza. I ate the box. I ate the car. I ate the thought. Clearly, sentence 1 is good and sentence 4 has some serious problems. But what about the ones in between? If you had been asked to specify the selectional restrictions for the THEME role of eating events, you would probably have suggested a category like FOOD1. This would allow sentence 1 but make 2 302 CHAPTER 10 Ambiguity Resolution 303 ungrammatical. But clearly 2 is not ungrammatical. While not a normal thing ui do, I could eat a box, and 2 seems to be a perfectly good sentence. But what cate-2 gory includes FOOD and boxes (and paper, hats, and so on, that one might eat)? We might suggest PHYSOBJ, which would then accept sentences 1, 2, and 3. But- = in doing so we lose the information that foods are more likely to be eaten than? nonfoods, especially metal objects like cars. In particular, a word like chip that isi ambiguous between a food (potato chips) and some nonfoods (wood chipsJJ would remain ambiguous in / ate some chips. This means that selectional' restrictions will have little power for disambiguation (in this case distinguishing only between abstract and concrete objects). Z Selectional restrictions are also sensitive to the context of the proposition. For instance, in a negated context, restrictions can be violated. The sentence / could not eat a car, for example, should be perfectly fine even if the EATS 1 event requires a THEME of type FOOD1. Selectional restrictions also will clearly not apply for a sentence such as My car drinks gasoline where metaphor is used, although they still could be used to signal places where metaphor-based analysis should be attempted. Despite all these problems, selectional restrictions have turned out to be extremely useful in actual systems. One reason is that if a system operates in a limited domain, then unusual interpretations such as eating * boxes often do notarise, and thus the restriction (THEME EATS1 FOOD) may function perfecfly well. Of course, the technique may start to run into problems^ as the range of coverage grows and starts to allow such complications. One way to reconcile the approach with these problems is to view selectional restrictions as preferences rather than absolute requirements. In this case an interpretation that satisfies all the selectional restrictions will be preferred over ones that contain a violation, but if all interpretations violate some restrictions, then those that violate the least number are preferred. This type of scheme can be implemented by ranking each interpretation based on how well it satisfies the selectional restrictions and picking the one that violates the fewest restrictions. 10.2 Semantic Filtering Using Selectional Restrictions There are at least two ways that selectional restrictions can be added to a parser. The sequential model involves simply running the parser and then checking all -the complete S interpretations it finds. The incremental model involves checking each constituent as it is suggested by the parser and discarding it if it is seman-tically ill-formed. The incremental method can be significantly more efficient than the sequential method. This section demonstrates its efficiency using a simple example. In particular, we will assume that the word sense hierarchy is a simple tree. It is not difficult to generalize the model, but it would make the example more complicated and tend to mask the basic issues. Consider the following sentence, which syntactically has two PP attachment ambiguities: He booked a flight to the city for me. (S SEM ?semvp) -» (NP SEM ?semsubj) (VP SVBJ Tsemsubj SEM ?semvp) (VP SUBJ ?semsubj SEM ?semv) ->(V[_none] SVBJ fsemsubj SEM ?semv) (VP SUBJ ?semsubj SEM ?semv) -> (V[_np] SUBJ Tsemsubj OBJ ?semnp SEM ?semv) (NP SEM ?semnp) (NP VAR ?r SEM (PRO ?v (?sempro ?v))) -> (PRO SEM ?sempro) (NP VAR?v SEM ) -> (ART SEM ?semart) (CNP SEM ?semcnp) (CNP VAR ?v SEM (?semn ?v)) -» (N SEM ?semn) (CNP SEM (& ?semcnp ?sempp )) -» (CNP VAR ?v SEM ?semv) (PP PRED + ARGVAR 7v SEM ?sempp) 8. (VP SEM (& ?semvp ?sempp) -» (VP VAR ?v SEM ?semvp) (PP PRED + ARGVAR ?v SEM ?sempp) 9. (PP PRED ■+. ARGVAR ?vl SEM (?semp ?vl ?semnp) -> (P SEM ?semvp) (NP SEM ?semnp) Head features for S, VP, NP, CNP: VAR Grammar 10.4 A small grammar allowing PP attachment ambiguity With a grammar that allows PPs to be attached to either VPs or CNPs, this sentence has five different structures: The PP to the city may modify the verb booked or the noun flight, and the prepositional phrase for me may modify the noun city, the noun flight, or the verb booked. There are five ways to combine these possibilities into a legal syntactic structure. If you consider the semantic meaning of the sentence, however, there is only one plausible reading: The flight is to the city and it was booked for me. This intuition can be captured by the selectional restrictions for the verb book and the nouns flight and city. Grammar 10.4 is a small grammar that allows the different forms of PP attachment. It uses the techniques described in Section 9.6 and does not require the use of lambda reduction. Only the features relevant to semantic interpretation are shown. Rules 7 and 8 introduce the PP modifiers. The feature ARGVAR is used to pass a discourse variable into the PP so that the appropriate proposition can be constructed. Otherwise, this grammar is quite similar to Grammar 9.14. Figure 10.5 shows some lexical entries and a hierarchy of word senses for use in semantic checking. The predicates HE1 and ME1 encode the semantic restrictions for the pronouns he and me, respectively, which are approximated for this example by placing them below the sense PERSON. The selectional restrictions relevant for this section are the following: (AGENT BOOKS1 PERSON1) (THEME BOOKS 1 FLIGHT!) (BENEFICIARY ACTION! PERSON1) (DESTINATION FLIGHT1 CITY 1) (NEARBY PHYSOBJ PHYSOBJ) (NEARBY ACTION PHYSOBJ) 304 CHAPTER 10 Ambiguity Resolution 305 TOP ■ PHYSOBJ ACTION . ■ HE1 . PERSON - — INANIMATE-BOOKS1 ■ ME1 FLIGHT 1 ■ COLLEGE 1 CITY1 a (ART AGR 3s SEM INDEF1) booked (V SUBCAT _np VFORM past SUBJ ?subj OBJ ?obj SEM (& (BOOKS1 *) (AGENT * ?subj) (THEME * ?semobj))) city (N AGR 3s SEM CITY1) college (N AGR 3s SEM COLLEGE1) flight (N AGR 3s SEM FLIGHT1) for (PPFORM for SEM BENEFICIARY) he (PRO AGR 3s SEM HE1) me (PRO AGR Is SEM ME1) near (P PFORM near SEM NEARBY) the (ART AGR (3s 3p} SEM THE) to (P PFORM to SEM DESTINATION) Figure 10.5 A small lexicon and word sense hierarchy These require that the agent of a booking action must be a person, and the theme must be a flight. The third restriction states that the beneficiary relation can hold between any action and a person, and the fourth restriction allows the destination relation to relate flights to cities. The last two restrictions allow the nearby relation to hold between two physical objects and between an action and a physical object, as in He sighed near the boat. This is all the information needed to support the examples. Semantic filtering is added to the chart parser when an entry is to be added to the chart. Before it is added, a check is made to ensure that no relation present in the logical form violates the selectional restrictions. If the SEM value is fully instantiated, then the algorithm specified in the last section can be applied directly. If the SEM value contains variables, then only the relations in the SEM that do not contain variables are checked. If the SEM satisfies the restrictions, then the constituent is added as usual. If it doesn't satisfy the restrictions, the constituent is discarded. Not only does this eliminate the constituent from the chart, but it also means that any larger constituents that would have been built from this constituent will never be considered by the parser. Thus this technique can lead to significant efficiency improvements even with a bottom-up parsing technique. For example, given Grammar 10.4, consider the bottom-up chart parser on ■ the sentence He booked the flight to the city for me. Without semantic filtering, the parser finds five different interpretations and generates 52 constituents on thez chart. With semantic filtering, it finds one interpretation and generates only 33 constituents. Consider the first constituent suggested by the parser that is rejected by semantic filtering: (VP SEM (BOOKS 1 v258 , [AGENT ?semsubj] [THEME ] [DESTINATION ]) VAR v258 SUBJ ?semsubj) This constituent is suggested by rule 8, once constituents for the VP book the flight and the PP to the city are built. It is rejected because it violates the selectional restrictions on the DESTINATION predicate, which cannot relate a booking event to a city; that is, the triple (DESTINATION BOOKS 1 CITY1) does not match any selectional restriction. More specifically, the analysis is as follows. The unary constraints on the variables are v258 BOOKS 1 v260 FLIGHT 1 v263 CITY1 From these, the relation forms generated from the SEM are (THEME BOOKS1 FLIGHT 1) and (DESTINATION BOOKS1 CITY1). The latter form is the one that does not satisfy any selectional restriction. Note that this constituent can be eliminated even though the subject is not yet bound. The savings become even more significant as more complex sentences are considered. For instance, on the sentence He booked a flight to the city near the college for me. the parser using the sequential strategy finds 14 different interpretations and generates 116 constituents on the chart. The parser with incremental semantic filtering finds only 3 interpretations and generates 63 constituents. These 3 readings are all semantically plausible—in the first the city is near the college, in the second the flight to the city is near the college, and in the third the booking event occurred near the college. The first reading is more natural given no information about the context of the sentence, but the others would be plausible in certain contexts. Since we are not considering context until Part III of the book, this is the best that you can expect using context-independent techniques. 10.3 Semantic Networks In the last section you saw that semantic type hierarchies are very important for ambiguity resolution. This section generalizes these ideas further and develops a representation of lexical knowledge called a semantic network. Semantic networks ease the construction of the lexicon by allowing inheritance of properties, 306 CHAPTER 10 Ambiguity Resolution 307 (non-animate) r—-\ (^N-LIVING^) (^VEt^TABLE^) Figure 10.6 Part of a type hierarchy and they provide a richer set of semantic relationships between word senses to support disambiguation. A semantic network is a graph with labeled links between labeled nodes The nodes represent word senses or abstract classes of senses, and the linK represent semantic relationships between the senses. The simple abstraction hiei -archy shown in Figure 10.6 indicates some semantic relationships between classes of physical and abstract objects. The s arc indicates the subtype relation -ship. This is just an alternative representation for type hierarchies, as seen in the last two sections. The selectional restrictions for semantic relations can also be stored in a network form using arcs. Thus you might say that all actions have an agent case filled by an animate object using the network in Figure 10.7. Introduced here is > new node type, an existential node, depicted by a square, which represents a particular value. In this case the agent case is restricted to be an object of type ANIMATE. Many such representations allow other information to be stored about case values as well, such as whether they are obligatory or not and whethei they represent a single filler or a group of fillers. An important property of semantic networks is the inheritance of propci -ties. For example, given the network shown in Figure 10.8, the action class RUNS 1 would inherit the property that every instance has an AGENT role filled by an ANIMATE object. Typically, inheritance is implemented as a graph search starting at the most specific node. To find the restrictions for a certain role R. you first check the most specific node (such as RUNS1) for an R relation, and only if one is not found do you move up the s links (for example, at node ACTION). JI an R relation is not found there either, you continue to search up the s hici.iii liv until either an R relation is found or you hit the top of the s hierarchy. -1 ' 3 1 agent ANIMATE Figure 10.7 All actions have an animate agent f^^^tion^^ AUtUNT / \ V ANIMATE \ Figure 10.8 a network showing inheritance of roles Inheritance hierarchies are extremely useful for expressing selectional restrictions across broad classes of verbs. Figure 10.9 shows the selectional restrictions for a set of verb senses that are subclasses of ACTION. Using the inheritance mechanism, you can see that the action class TRANSFER-ACTION allows the semantic relations AGENT, AT-TIME, and AT-LOC, inherited from the class ACTION; the cases THEME and INSTR, inherited from the class OBJ/ACTION; and the case TO-POSS, which is explicitly defined for TRANSFER-ACTION. Note that this representation defines all the allowable semantic relations, not just ones explicitly subcategorized for by the verb. Often a verb imposes stricter constraints on its arguments than what would be inherited. For example, the sense READS 1 fits under the OBJ/ACTION hierarchy and thus could inherit the roles AGENT, AT-TIME, AT-LOC, THEME, and INSTR. But we would like the THEME redefined to be of type TEXTOBJ. The inheritance algorithm sketched above uses the most specific information it has for any verb, since it stops as soon as the first information is found. So this new THEME restriction overrides the more general inherited one. Semantic networks are also used to represent other forms of general knowledge about the structure of words besides the subtype and argument relationships. Another important hierarchy is the part-of hierarchy, in which 308 CHAPTER 10 Ambiguity Resolution 309 AGENT ANIMATE TIME LOCATION PHYSOBJ ANIMATE TEXTOBJ Figure 10.9 Action hierarchy with roles objects are related to their subparts. This relationship is often expressed in English using the preposition of, or in noun-noun modification, or sometimes with the possessive form, as in The desk drawer The man's head The handle of the drawer (The drawer that is part of the desk) (The head that is part of the man) (The handle that is part of the drawer) To recognize such relationships, and to disambiguate word senses based on these relationships, you need to encode information about the structure of objects in the semantic network. This is done by introducing a new link (the part-of link) that encodes this relationship. Thus you can represent that a house has rooms and doors as subparts and that doors have handles, as in Figure 10.10. In this figure, rather than putting the type of the existential node on the node, it is encoded using an isa arc, which is found in most semantic network representations. A complete representation system must be able to indicate whether a subpart is a unique object (the handle on a door) or a set of objects (the rooms in a house). For instance, if a person's body has the subparts head, body, arms, and; (^DO DOORHANDLE DOOR SUBPART HOUSE SUBPART SUBPART ROOM Figure 10.10 Some subpart relationships legs, then you need to represent that the head is unique and is connected to the body, that there are two arms connected to the body, and so on. All this would be part of the subpart classification of BODY. In addition, the system must be able to represent the spatial and other relationships that hold between subparts. Further details on these issues, though, would lead this book too far afield into knowledge representation. Besides providing inheritance of properties, semantic networks can also be used for other purposes as well. For instance, one property that will become useful later concerns the semantic closeness of two different word senses. As an initial attack on this problem, we could evaluate the semantic closeness of two objects by finding the distance between the two within the type hierarchy. For instance, with a slightly expanded hierarchy from Figure 10.6, we could find that DOG1 and CAT1 are closely related, as they share the common supertype ANIMATE (only one link away from each). DOG1 and CARROT 1, on the other hand, are less closely related, as they share a common supertype PHYSOBJ, several links away from each. DOG1 and EVENT would be even further away. This technique has a few difficulties in that it depends somewhat on the whim of the person who constructed the network, and a different organization might give a different set of answers. As a consequence, a different measure has been suggested that uses the size of the smallest common superset to give a measure of closeness. This way, the individual classification produced by the programmer is not so influential. Basically, the smaller the common set that includes both senses, the more closely related the senses are. In this approach the same results would be obtained for the example above, as the size of the set ANIMATE (the smallest common superset of CAT1 and DOG1) is definitely less that the size of the set PHYSOBJ (the smallest common superset of DOG1 and CARROT 1), which in turn is definitely less than the size of the set ALL (the smallest common superset of DOG1 and EVENT). Of course, using the type hierarchy only captures one notion of semantic closeness. There are many other structures that would have to be encoded to capture the notion more adequately. For instance, you would expect an individual of type DOG1 to be semantically related to the event DOG-SHOW 1, but this isn't captured by the type hierarchy. For this sort of association, you would need 310 CHAPTER 10 Ambiguity Resolution 311 a closeness measure in terms of what objects typically participate in what events. Given the concepts developed so far, it might work to have a new semantic role, say PARTICIPANT, that relates events to the types of objects they involve. But the hierarchies developed in this section will be sufficient to illustrate the issues that follow. 10.4 Statistical Word Sense Disambiguation Selectional restrictions provide only a coarse classification of acceptable and= nonacceptable forms. As a result, many cases of sense ambiguity cannot be" resolved. To better model human processing, more predictive techniques must be^ developed that give a preference for the common interpretations of senses over? the rarer senses. This section explores the use of statistical methods to formalize the notion of common senses. As always, significant effort is necessary to balance the amount of information needed to calculate accurate statistics with the ability to obtain useful information. The simplest techniques are based on simple unigram statistics. Given a suitably labeled corpus, we can collect information on the usage of the different i senses of each word. For instance, we might find 5845 uses of the word bridge: 5651 uses of STRUCTURE1 194 uses of DENTAL-DEV37 Given this data, we would guess that bridge occurs in the STRUCTURE1 sense every time. If our training data is representative, this technique would give the right answer 97 percent of the time (5651 times out of 5845 tries). More gener-: ally, it has been estimated that this simple strategy would be correct about 70 percent of the time for a broad range of English. Of course, we would like to do much better than this by including some effect of context. Consider the rare sense DENTAL-DEV37. Although it occurs very rarely in the entire corpus, in certain texts (those on dentistry and orthodontics), it will be the most common sense of the word. If you could develop a method to reliably identify such contexts, you could select this sense and perform better than the naive algorithm. It turns out that the actual words used in a text give a good indication of the topic. For instance, in articles on dentistry, there will be many uses of words such as teeth, dentist, cavity, braces, orthodontics, and so on. We want to prefer the DENTAL-■ DEV37 sense when it is found in the presence of such words. Such information is concerned with word collocations, that is, what words tend to appear together. You might consider bigram probabilities, trigrams, or larger groups, say the five surrounding words, the entire sentence, or even larger contexts (some studies have been done using the nearest 100 words). The amount of text examined for each word is called the window. One way to try to do this might be to adapt part-of-speech tagging techniques to use word senses rather than syntactic categories. For this method you need a corpus of words tagged with their senses. Then you could compute unigram statistics (for example, the probability that word w has sense S), bigram statistics, and so on. Of course, building such a corpus requires a considerable effort, but work is underway to make such corpora available. Two factors, however, make directly applying the tagging techniques impractical: First, there are many more senses than syntactic categories; and second, to obtain reasonable results you must use a larger context than simple bigrams or trigrams. Both of these problems mean that it is essentially impossible to obtain sufficient data for training and that a different approach must be used. The basic idea is to estimate the probability of the senses of a word w relative to a window of words in the text centered on w. Given a window size of n centered on a word w, the words in the window are indicated as follows: wi W2 ... Wn/2 w wn/2+l - wn-l You want to compute the sense S of a word w that maximizes the formula PROB(w/S I wi W2 ... Wn/2 w Wn/2+i ■•■ wn_]) To estimate these probabilities, we rewrite the formula using Bayes' rule and then make independence assumptions, as we have done several times before. The formula becomes PROB(wi ... wn_i I w/S) * PROB(w/S) PROB(wi ... wn_i) Since the denominator doesn't change for each sense, it can be ignored. Furthermore, assuming that each wj appears independently of the other words in the window, PROB(w i ... wn_i I w/S) can be approximated by rii=i,n-l PROBn(wj I w/S) where PROBn(wj I w/S) is the probability that word wi occurs in an n-word window centered on word w in sense S. Putting all this together, the best sense S will be the one that maximizes the formula PROB(w/S) * n.i=i>n_i PROBn(wj I w/S) Because of the assumption that each event is independent of all the others, it turns out that the larger the window used the less data is needed, since more data can be collected from each window. Given a corpus, you collect the data by considering the n-word window for each word, counting the number of times the word occurs in each of its senses, and recording all the words in the window. Consider the hypothetical information in Figure 10.11 for the word bridge, using a window size of 11 on a corpus of 10 million words. Based on the definition of PROBn(wj I w/S), you can estimate its values as follows: PROBn(wj I w/S) = Count(# times Wj occurs in a window centered on w/S) Count(# times w/S is the center of a window) 312 CHAPTER 10 Ambiguity Resolution 313 with bridge/ with bridge/ in any STRUCTURE1 DENTAL-DEV37 window teeth 1 10 300 suspension 200 1 2000 the 5500 180 500,000 dentist 2 35 900 total occurrences 5651 194 501,500 - Figure 10.11 The counts for the senses for bridge in a hypothetical corpus Given the data in Figure 10.11, you get the following estimates: PROBn(teeth I bridge/STRUCTUREl) = 1/5651 = 1.77 * 1(H PROBn(teeth I bridge/DENTAL-DEV37) = 10/194 = .052 PROBn(suspension I bridge/(STRUCTUREl) = 200/5651 - .035 PROBn(suspension I bridge/DENTAL-DEV37) = 1/194 = 5.15 * 10~3 PROBa(the I bridge/STRUCTUREl) = 5500/5651 = .97 PROBn(*e I bridge/DENTAL-DEV37) = 180/194 = .93 PROBa(dentist I bridge/STRUCTUREl) = 2/5651 = 3.54 * 10"4 PROBR(dentist I bridge/DENTAL-DEV37) = 35/194 = .18 The context-independent probabilities of the word senses are easily estimated: PROB (bridge/STRUCTURE1) = 5651/501500 = .113 PROB(bridge/DENTAL-DEV37) = 194/501500 = 3.87 * lO""4 These two probabilities give the likelihood of each sense without considering any context, that is, using a window of length 1. Note that the probability estimates for the senses in a window that contains just the word the are very similar to the no-context estimate: PROBn(?Ae I bridge/STRUCTUREl) * PROB(bridge/STRUCTUREl) = .97*.113 = .109 PROBa(the I bridge/DENTAL-DEV37) * PROB (bridge/DENTAL-DE V37) = .93 * 3.87 * 10"4 = 3.6 * 10^* This shows that the word the has little discrimination power between the two ~: senses, which is what you would expect since the is so commonly associated with nouns. In general, function words are so common that they do not help in word sense disambiguation. It is content words, like teeth in this example, that have the most dramatic effect. In fact, the probability estimates in a window that contains just the word dentist reverse the preference: PROBn(den*isf I bridge/STRUCTUREl) * PROBtbridge/STRUCTUREl) = 3.54*10"4*.113=4* 10"5 PROBn(dentist I bridge/DENTAL-DEV37) * PROBtbridge/DENTAL-: DEV37) = .18 * 3.87 * 10^ = 6.97* 10"5 Of course, with a larger window, there are many more chances for content words that strongly affect the decision. For instance, consider the sentence The dentist put a bridge on my teeth. All the words except for dentist and teeth are common words that would co-occur about equally frequently with both senses. Thus they would not affect the relative probabilities for each sense. The words dentist and teeth together in the'same window, however, combine to strongly prefer the rare sense of the word bridge. In fact, the estimate for the sense DENTAL-DEV37 would be 3.6 * 10~^, considerably greater than the estimate of 7.08 * 10~7 for STRUCTURE 1. These values could be used to estimate the actual probabilities by normalizing them. Specifically, if there are only these two senses for the word bridge, then the probability that it is used in its DENTAL-DEV37 sense in this window would be 3.6 * 10"6 / (3.6 * 10~6 + 7.08 * 10"7) = .84. o Collocations and Mutual Information Much of the work in this area uses collocations, which measure how likely two words are to co-occur in a window of text. One way to compute such a measure is to consider a correlation statistic (where n is the window size) Cn(w/S, w') = _PROB (wIS & w' are in the same window)_ PROB (w/S in the window) * PROB(w' in the window) Note that this differs from the estimates used above only in the denominator. If K is the number of windows in the corpus, then each of the probabilities above could be estimated as Count(# times event occurs in window) / K. After substituting such estimates in for each probability used in Cn(w/S, w'), and simplifying, you get the formula Cn(w/S, w') = K * Count(# times w/S and w' co-occur in window) Count(# times w/S in window) * Count(# times w' in window) In our sample corpus K is 107. Based on the data in Figure 10.11, the estimates for Cn are as follows: Cn(bridge/STRUCTURE1, teeth) = (107 * 1)/(5651 * 300) = 5.9 Cn(bridge/DENTAL-DEV37, teeth) = (107 * 10)/(194 * 300) = 171.9 Cn(bridge/(STRUCTURE1, suspension) = (107 * 200)/(5651 * 2000) = 17.7 Cn(bridge/DENTAL-DEV37, suspension) = (107 * 1)/(194 * 2000) = 2.5 C„(bridge/STRUCTUREl, the) = (107 * 5500)/(5651 * 500,000) = 1.94 Cn(bridge/DENTAL-DEV37, the) = (107 * 180)/(194 * 500,000) = 1.84 Cn(bridge/STRUCTUREl, dentist) = (107 * 2)/(5651 * 900) = 3.9 Cn(bridge/DENTAL-DEV37, dentist) = (107 * 35)/(194 * 900) = 200 If the ratio is close to one, then the words co-occur at a rate expected by chance. If the ratio is greater than one, then they occur together at a rate better than chance. Note that the correlation figures for the two senses with the word the 314 CHAPTER 10 Ambiguity Resolution 315 are 1.94 and 1.84. This indicates that both co-occur with the at a rate slightly better than chance, but the yields no preference for one or the other. To better distinguish statistics based on ratios, work in this area is often presented in terms of the log of the ratio, which makes the numbers negative if less than one. This more intuitively reflects the preference or dispreference. For word ratios as described in this section, this measure is called the mutual information of the two words and is written as In(wi,W2): In(w l ,w2) = log Cn(w i ,w2) For the example involving the two senses of bridge, the mutual information statistics are I3(bridge/STRUCTURE1, teeth) = 1.77 I3(bridge/DENTAL-DEV37, teeth) = 5.14 I3(bridge/STRUCTURE1, the) = .66 Note that words that have no association with each other and co-occur together according to chance will have a mutual information number close to zero. If words are anficorrelated, that is, they co-occur together at a rate less than chance, then the mutual information number will be negative. If you compare different senses using mutual information statistics, you would add the scores together from different words in the window rather than multiplying, in order to account for the fact that they are log values. 10.5 Statistical Semantic Preferences The first section of this chapter discussed selectional restrictions and their use in disambiguation. One problem with that approach was that the all-or-nothing nature of the restrictions made it impossible to encode semantic preferences. The word bridge, for example, was ambiguous between the structure and the dental appliance. In the sentence / painted the bridge, however, it seems clear that the structure sense is intended. But the selectional restriction for PAINTS 1 would presumably be that the THEME must be a physical object, and both senses of bridge fall in that category and satisfy the restriction. So the current model offers no help in selecting the more natural interpretation. Another problem not addressed yet is structural ambiguity, such as with PP attachment. This section develops a model of semantic preference that can help resolve these problems. The general idea is to collect statistics on the frequency of semantic roles occurring between senses, so that the most common semantic combinations can be preferred when interpreting a sentence. We can start with the selectional restrictions developed in Section 10.1. There, restrictions were stated in terms of unary and binary relations between the discourse variables in the logical form. With a corpus annotated with semantic information, you could collect statistics on the frequency of each of these unary and binary relations. For the moment, let us assume that there is enough data to obtain reliable estimates for every relation. This assumption will be relaxed later in the section. Consider an example using the sentence I painted the bridge, which has the following initial logical form: (PAINTS1 pi, [AGENT (PRO il II)] [THEME ]) The unary and binary relationships are as follows: (PAINTS1 pi), (II il), ({STRUCTURE1 DENTAL-DEV37} bl), (AGENT pi il), (THEME pi bl) Let us estimate the semantic likelihood of a particular interpretation by the product of the likelihood of each of its subcomponents; that is, if a logical form LF consists of n relations R].....Rn, then PROB(LF)= Tli=\,r1PROB(R.i) Of course, this estimate is correct only if all the relations occurred independently of each other; clearly, this is not always a good assumption, but the technique is still useful. To elaborate on this, a method for computing the probability of the binary relations needs to be developed. We can estimate the probability of some triple (rein head arg) by comparing the number of times it appears to how many times the head appears. The standard maximum likelihood estimation is a reasonable start: PROB((rdn head arg) I head) - Count((reln head arg)) ° Countthead) Consider applying this method. There are two possible readings of / painted the bridge, one for each sense of bridge. Since the rest of the relations are the same in each case, the difference in probability between the two interpretations will reduce to the difference between the probabilities of (THEME PAINTS 1 STRUCTURE 1) versus (THEME PAINTS 1 DENTAL-DEV37). Given that painting bridges (across bodies of water) is a reasonably common activity and occurs in the database a few times and that painting dental bridges is a very unlikely activity and may not occur in the corpus at all, the former would be preferred. Of course, sentences that are structurally ambiguous may have significantly different semantic structures to be compared. The same techniques can be used, however, to compare these structures. Consider the sentences He saw the man with a telescope and He saw the man with a hat. The logical forms for these two sentences are shown in Figure 10.12. 316 CHAPTER 10 Ambiguity Resolution 317 1. He saw the man with a telescope. la. (SEES 1 pi [EXPERIENCER (PRO hi HE1)] [THEME ] [INSTR ]) lb. (SEES 1 pi [EXPERIENCER (PRO hi HE1)] [THEME )>}) 2. He saw the man with a hat. 2a. (SEES1 pi [EXPERIENCER (PRO hi HE1)] [THEME ] [INSTR ]) 2b. (SEES1 pi [EXPERIENCER (PRO hi HE1)] [THEME )>]) Figure 10.12 The logical forms for two ambiguous sentences Even though the semantic structures are different, they still share many of the same binary relations. For instance, sentences la and lb share the triples (EXPERIENCER SEES1 MALE1) and (THEME SEES1 MALE-PERSON 1). They differ only in that one contains (INSTR SEES1 TELESCOPE 1). presumably a commonly occurring pattern, while the other contains (WITHi MANI TELESCOPE1), where WITHi ranges over all the semantic relations that can be indicated by with (for example, part-of, accompanied-by, wearing, and so on). Since this also probably occurs frequently, both interpretations remain plausible. With sentence 2, comparing the two LFs reduces to comparing the likelihood of the relation (INSTR SEES1 HAT1), presumably a very rare relation if it occurs at all, with (WITHi MANI HAT1), which would occur frequently when WITHi is the ACCOMPANY (accompanied by) relation, since men wear hats. Thus, in the second case, interpretation 2b is selected and the interpretation of with is disambiguated in the process. Even though the co-occurrence statistics were not able to favor one interpretation over the other in the first case, the probabilistic model does allow you to quantify the degree of certainty (or ambiguity) in the sentence. In addition, these techniques can be added on as a disambiguation method that is applied to each major constituent as soon as its logical form is constructed. The value returned can be used to contribute to an overall score of the constituent in a best-first parsing strategy. The principal problem with this approach, however, is obtaining an accurate set of statistics. If you remember, this was a problem even when you were only concerned with the syntactic class of words. When dealing with word-Senses, the problems are much worse, as there are many different senses per PHYSOBJ — -NON-LIVING ■ -LARGE-PHYSOBJ - SMALL-PHYSOBJ ■ .55 03. STRUCTURE1 (bridge) ?HOUSEl 009 — COMBI .2 I—LIVING ENTAL-APPLIANCE- ,— DENTAL-DEV37 (bridge) .5 — TOOTHBRUSH Figure 10.13 A type hierarchy with probabilities Triple Number of Occurrences PROB (THEME PAINTS 1 NON-LIVING) 960 .96 (THEME PAINTS 1 SMALL-PHYSOBJ) 300 .3 (THEME PAINTS 1 LARGE-PHYSOBJ) 650 .65 (THEME PAINTS 1 STRUCTUREl) 10 .01 Figure 10.14 Statistics for some semantic relations word. As a result, techniques for estimating probabilities must be used, because acceptable triples that never occurred in the training corpus will be common. If there is a predefined abstraction hierarchy, then the hierarchy can be used to provide an estimate for unobserved triples. The basic idea is to collect data on triples not only for the exact senses involved, but also for abstractions of the senses. Consider the hierarchy in Figure 10.13, which also indicates how likely an object of a superclass is to fall into each of the subclasses. According to the figure, the probability of a SMALL-PHYSOBJ being a DENTAL-APPLIANCE, PROP, (DENTAL-APPLIANCE I SMALL-PHYSOBJ), is .005. Given such a hierarchy, if we observe a triple (THEME PAINTS 1 STRUC -TURE1), we not only record that this triple occurred, but also record an entry for the abstractions for STRUCTUREl, such as (THEME PAINTS 1 LARGE-PHYSOBJ) and (THEME PAINTS 1 PHYSOBJ). Thus, after analyzing an extensive corpus involving 1000 instances of PAINTS 1, you might obtain the data and probability estimates shown in Figure 10.14. There are, of course, many other triples not listed here that were observed. But these four suffice for the example. When a triple not previously observed is found, say (THEME PALNTS1 DENTAL-DEV37), we can use the information about the supertypes to produce an estimate. Consider again the sentence / painted the bridge. Since the DENTAL-DEV37 sense of bridge was never seen in the training data, the estimate of the probability of the triple (THEME PAINTS 1 DENTAL-DEV37) would be 0, 318 CHAPTER 10 Ambiguity Resolution 319 which would indicate that such a reading is impossible, contrary to intuition. But' we estimate a probability for this triple by considering abstractions of DENTAL--DEV37. Since DENTAL-DEV37 is a subclass of DENTAL-APPLIANCE, possibly we have an estimate for (THEME PAINTS 1 DENTAL-APPLIANCE). If not, consider abstracting again. Since a DENTAL-APPLIANCE is a SMALL-PHYSOBJ, then see if we have an estimate for (THEME PAINTS 1 SMALL-PHYSOBJ). We have an estimate for this type of triple, which is .3: PROB((THEME PAINTS1 SMALL-PHYSOBJ) I PAINTS1) = .3 Since DENTAL-DEV37 is a SMALL-PHYSOBJ, this estimate has some4 relevance. However, the .3 estimate is clearly too optimistic to use directly, given : that the set DENTAL-DEV37 is only a small subset of SMALL-PHYSOBJ. We can take this into account by considering the probability that an arbitrary0 SMALL-PHYSOBJ is a DENTAL-DEV. Using the data from Figure 10.13 we get PROB (DENTAL-DEV37 I SMALL-PHYSOBJ) = P/?OB(DENTAL-DEV37 I DENTAL-APPLIANCE) * PROB (DENTAL-APPLIANCE I SMALL-PHYSOBJ) = .005 * .2 = .001 Thus one estimate for PROB((THEME PAINTS 1 DENTAL-DEV37) I PAINTS 1) might be PROB ((THEME PAINTS 1 SMALL-PHYSOBJ) I PAINTS 1) * PROB (DENT AL-DEV37 I SMALL-PHYSOBJ) = .0003 This is just one example of using smoothing techniques to construct an estimate. You might find other techniques that produce better estimates by introducing additional multiplicative factors or combining other evidence. 10.6 Combining Approaches to Disambiguation This chapter has considered two different approaches to resolving ambiguity of word senses and semantic relations. The approach based on selectional restrictions assumes a set of hard constraints on what senses of words are allowed as arguments to other senses. These are based on semantic considerations of the meanings of each sense. This approach can then be generalized so that the constraints are indicated as preferences, and the interpretation that violates the fewest constraints is the best interpretation. The other kind of approach is based on statistical analysis of corpora. Preference is given to combinations of senses that occur most frequently in the data. In a purely statistical approach, the decision has no inherent connection to the underlying meaning of the sentences. Attachment decisions are made based on what patterns of lexical items occur most frequently in what structures. As such, they seem to be quite different approaches, and many researchers seem to assume that this is an either-or choice. In fact, the two can be viewed as complementary approaches, and we would expect an approach that uses both types of techniques to perform better than an approach based on only one technique. Section 10.5 described a technique that uses both approaches. It used statistical measures' from a corpus, but the word senses were organized into a type hierarchy that captures their semantic relationships. This information then allowed statistics to be collected on semantically abstract classes that then could be used to provide estimates for triples that had not been observed before. Thus the semantic knowledge of the type hierarchy enables an interesting approach to handling sparse data. You might expect that, given enough data, the statistical approach would eventually be able to capture all the information present in the semantically based approach; that is, that the common patterns of usage would generally reveal the semantic relations between words. If this is true, you could then build the equivalent of a type hierarchy based on statistical data alone. One issue that makes this unlikely is the problem of new words (or words new to a particular person). New words are constantly being encountered. If you are given a semantic definition of a word, then you can immediately use the word in appropriate circumstances, even though you have never encountered the word in that context before. As a simple example, say you learn that a preef is a type of sailboat. You can then understand the intended interpretation of the sentence We sailed the lake in my preef yesterday, namely that in my preef modifies the verb phrase rather than the lake. In contrast, if preef bad been defined as an estate or other large area of land, then the reverse attachment would be preferred. These decisions are made in virtue of the newly acquired semantic knowledge of the word preef. A purely statistical approach would not be able to make a reasonable decision, as the PP in my preef 'has never been encountered before and there have not been enough observations of preef'to associate it with a more abstract class, such as boats or estates, where you might have enough data. Of course, once you assume the existence of a semantic hierarchy to structure your statistics, as in Section 10.5, statistical methods can be used effectively to determine the appropriate attachment. For example, consider again the word preef defined as a kind of boat. In this case you would probably compare the likelihood of a boat being the location of a sailing activity versus the likelihood that a lake is in a boat. But this just supports the point that effective techniques will combine both methods. This combined method could be generalized further by augmenting it with the techniques for PP attachment described in Chapter 7. A simple way would be to assume that the two approaches yield independent results and simply consider the product of the two estimates as the final estimate. Clearly, other ways of combining the results could be developed as well, that attempt to exploit some of the similarities between the methods. 320 CHAPTER 10 Ambiguity Resolution 321 Summary Ambiguity resolution is a crucial problem facing computational systems. This chapter explored several different methods for handling ambiguity. Disambiguation is typically driven by a hierarchical representation of word senses, often captured in semantic networks. Using these hierarchies, you can define selectional restrictions and use these constraints to reduce the possible word senses. In addition, statistical methods can be used to encode preferences for individual word senses, given the surrounding words, and for particular interpretations, based on the relatively frequencies of semantic relations. Related Work and Further Readings Word sense disambiguation has been recognized as one of the most difficult problems facing computational models of language. As early as 1960, Bar-Hillel (1960) argued that it was the major stumbling block to creating natural language applications such as machine translation. Selectional restrictions were proposed in linguistics by Katz and Fodor (1963), and variations on using semantic features have been used in virtually every computational model since. One of the earliest proponents for semantic preferences was Wilks (1975), who created a system that used semantic templates to interpret a representation of the sentence in which the basic phrase structure had been analyzed. This system used heuristic techniques based on minimizing the number of semantic constraint violations. Systems that have concentrated on the ambiguity problem have usually combined a range of techniques in addition to selectional restrictions. Most of these map directly to the underlying representation rather than to a logical form, but the restrictions specified in the knowledge representation act similarly to selectional restrictions. Hayes (1977) and Hirst (1987) used a combination of selectional restrictions and semantic closeness measures computed from a semantic network of word senses. The BORIS system (Dyer, 1983) used a combination of top-down expectations and selectional restrictions to understand complex text. Constraint satisfaction was first introduced in AI for interpreting visual scenes by Waltz (1975), and the formal underpinnings of the approach were outlined in detail by Freuder (1982) and MacWorth (1977). Mellish (1985) used constraint satisfaction techniques both to disambiguate word senses and to identify the referents of noun phrases. Semantic networks for natural language were introduced by Quillian (1968) to represent word meanings and conceptual associations between words. The development of semantic networks sparked much work in network representations (for example, Woods (1975), Findler (1979), Sowa (1984; 1991)). Due to the attractiveness of the representation, semantic networks became general frameworks for representing knowledge and supporting general inference. The system presented here, however, is closer to its origins—the intention is to capture the conceptual associations between words. As a result, we have not tried to define any sort of general inference procedure using these networks, or even stated how to express specific facts about the world. This will be discussed in Chapter 13. A recent effort at building a comprehensive lexical database for English is the WordNet (Miller, 1990), available by ftp from Princeton University. Only very recently have statistical models been proposed to handle the ambiguity problem: Work on collocations was primarily an area of interest for lexicography. The concept of mutual information is developed in information theory (Fano, 1961), and has been used successfully in word sense disambiguation. A good example of current work is Yarowsky (1992). A good source of recent work in the use of corpora is the special issues of Computational Linguistics, Volume 19, Issues 1 and 2, 1993. Another approach to disambiguation is to use spreading activation methods in semantic networks, in which word senses that are related activate each other. An example of this approach can be found in Hirst (1987). Related to this are connectionist models in which the entire computational process is described in terms of the interactions in neural networks. Examples of this approach are Cottrell and Small (1983) and Pollack and Waltz (1985). A good introduction to distributed connectionist models is Rumelhart and McClelland (1986). There is a large literature on the attachment problems. Ford, Bresnan, and Kaplan (1982) suggest that lexical preferences on the verb determine likely attachment decisions. Crain and Steedman (1985) suggest filtering interpretations based on whether the noun phrases proposed make sense in the domain. Hirst (1987) describes a system that uses both lexical preferences and semantic plausibility. Jensen and Binot (1987) use an on-line dictionary to find common patterns of prepositional phrase attachment. Dahlgren (1988) develops a set of preference rules based on the different propositions and the semantic class of the object of the preposition. Recent work using statistical techniques similar to those described in Section 10.5 can be found in Grishman and Sterling (1992). Hindle and Rooth (1993) show how to collect statistics about PP attachment from unannotated text. Resnik (1993) has developed techniques for exploiting Word-Net to gather statistics to apply to various forms of syntactic ambiguity. Exercises for Chapter 10__ 1. (easy) List all the selectional restrictions for the verb sense PUTS1 given the hierarchy in Figure 10.9. 2. (easy) Using the Cn function described in Section 10.4, compute the score of each of the senses of the word bridge in the five-word window the suspension bridge the construction. 3. (easy) Using the technique for estimating the probability of triples in Section 10.5 and the data given in Figures 10.13 and 10.14, estimate the probability of the triple (THEME PAINTS 1 HOUSE 1), assuming it has not been seen before. 322 CHAPTER 10 Ambiguity Resolution 323 4. (medium) Implement a program to calculate the mutual information statistic between the words happy and dog, using a window of three words, where windows are allowed to cut across sentences, given the following minuscule corpus: I saw a happy dog by the river. I was never happy before I met my dog. It was sad that the dog left. A happy cat is a well-fed cat. Cows sit in pastures all day and eat grass. My dog is happy most of the time. Do these words occur together at a rate better than chance? Be sure to describe any assumptions you make in calculating your probability esli -mates and the mutual information value. 5. (medium) Define a data structure to represent simple type networks with case value restrictions. Write a program that, given such a hierarchy, computes the complete set of case value restrictions for any given type in the hierarchy. Test this on the network shown in Figure 10.9, and retrieve the case value restrictions for PUTS1. Make sure your data structures and algorithms are well described in your documentation. 6. (medium) Implement a system that maintains a type hierarchy and allows you to test whether two types are compatible. In particular, your system should allow type information to be added with the following function: (Add-Subtype Tl T2)—asserts that Tl is a proper subtype of T2 This information can then be used in the function (Test-Intersection Tl T2) which returns nil if the two types are not known to intersect and returns the name of the intersection if the types do intersect. In particular, if Tl is a subtype of T2, then Tl would be returned; if T3 is a subtype of both Tl and T2, then T3 is assumed to be the intersection and T3 is returned. Document your data structures and demonstrate the range of input and output thai your system can handle. 7. (medium) Extend the grammar, lexicon, sense hierarchy, and selectional restrictions given in Section 10.2 as necessary to appropriately interpret the following sentences: He gave the book to the college. He knows the route to the college. Describe why the erroneous interpretations would not be produced by the bottom-up chart parser using your data. 8. (medium) The technique for disambiguation in Section 10.5 was based only on the probability of the binary relations. How might you extend this to account for unary relations as well? Describe your algorithm in detail, and show how it would operate given the sentence He painted the suspension bridge at night. You may assume that all the words except for bridge are unambiguous. Make up any data that you need for your technique that is not present in Sections 10.4 or 10.5. (hard) Using your solutions to Questions 5 and 6 as a component, implement the constraint satisfaction algorithm described in Section 10.1, and test it on a range of selectional restrictions generated from ambiguous sentences, including The dishwasher read the article. He painted the bridge. Test your program using a type hierarchy that includes the fragments in Figures 10.1 and 10.2, the verb hierarchy in Figure 10.9, and extensions necessary to include the four different senses of the noun bridge, as well as the senses of the other words used in your test sentences. Construct two other sentences with ambiguous words that demonstrate features of your solution not illustrated by the above two sentences. As always, your program should be fully documented and tested. You do not need to start with the full logical forms for the sentences. Rather, your input may be the unary and binary type relations over the discourse variables. Thus the input for The dishwasher read the article would be the list of unary relations (((DISHWASH1 DISHWASH2) dl) ((READS 1 READS2) rl) ((ARTICLE 1 ARTICLE2) al)) and the binary relations ((AGENT rl dl) (THEME rl al)) CHAPTER Other Strategies for Semantic Interpretation 11.1 Grammatical Relations 11.2 Semantic Grammars 11.3 Template Matching 11.4 Semantically Driven Parsing Techniques Other Strategies for Semantic Interpretation 327 The previous chapters described methods of semantic interpretation in which the syntactic structure drove semantic interpretation in a rule-by-rule fashion. There are many other ways to organize a system that have other advantages. Some techniques allow for the relatively rapid development of systems for a specific application, whereas others provide interesting approaches to producing more robust systems that clo not fail when faced with ungrammaticality. This chapter considers some examples of other ways of organizing semantic interpretation that are found in the literature. These techniques range from loosely coupled syntax and semantics to techniques that are essentially semantically driven and use minimal syntactic information. The syntax-driven rule-by-rule approach to semantic interpretation discussed in Chapter 9 is attractive on theoretical grounds. It assumes a close linkage between syntax and semantics and allows both to be used simultaneously to build an interpretation of a sentence. It also is the approach most commonly used by researchers interested in formal semantics, because insights from linguistics can be readily incorporated into computational systems. But if you surveyed all the existing computational systems, you would find a wide range of other approaches used, depending on the goals of the project. Is it focusing on long-term research aimed at a comprehensive grammar for a language, or is it focusing on the shorter-term goal of producing a system that will work for a specific application? Generally, you will find that long-term research efforts tend to be based on rule-by-rule syntax-driven approaches, whereas shorter-term applied efforts use other techniques. The practical developer's approach is guided by two important concerns: • Building a syntax-driven rale-by-rule system is complex, and many theoretical issues arise that will slow the development of the system. • Purely syntax-driven systems have problems dealing with real language, which is often ungrammatical due to errors, metaphor, and many other phenomena. As a result, a syntax-driven system may not produce any analysis at all for a sentence when it cannot find a parse that covers the entire sentence. There are many different ways to deal with these problems. One of the most radical is simply to throw out syntax altogether and build a system that produces a semantic interpretation directly from the sentence. The other end of the spectrum would retain the syntax-driven approach but generalize the syntactic processing so that it handles ungrammaticality more robustly. Midway positions include approaches that do a partial syntactic analysis to produce a structured input for subsequent semantic interpretation, and approaches that generalize the notion of a grammar so that it parses based on semantic structure rather than syntactic structure. This chapter gives examples of each of these positions. The presentation starts with the approaches that retain syntactic processing and then moves to approaches that use less and less syntax. Section 11.1 328 CHAPTER 11 Other Strategies lor Semantic Interpretation 329 Jack bought a ticket. (si PRED BUYS 1) (si TNS PAST) (si LSUBJ (NAME j2 "Jack")) (si LOBJ ) A ticket was bought by Jill. (s2 PRED BUYS 1) (s2 TNS PAST) (s2 LSUBJ (NAME j2 "Jilt")) (s2 LOBJ ) Jill gave Jack a book. (s3 PRED GIVES!) (TNS s3 PAST) (s3 LSUBJ (NAME jl "Jill")) (s3 LOBJ ) (s3 IOBJ (NAME j2 "Jack")) Jill gave a book to Jack. (s4 PRED GIVES1) (TNS s4 PAST) (s4 LSUBJ (NAME jl "Jill")) (GIVES1 LOBJ ) (GIVES 1 TO (NAME j2 "Jack")) Jill thinks that Jack stole (s5 PRED THINKS 1) the book. (s5 LSUBJ (NAME jl "Jill")) (s5 LOBJ s6) (s6 PRED STEALS 1) (s6 TNS PAST) (s6 LSUBJ (NAME j2 "Jack")) (s6 LOBJ ) Figure 11.1 A representation based on grammatical relations describes approaches in which the parser produces a representation that is an abstracted syntactic structure that is then used as input to semantic interpretation. Section 11.2 describes semantic grammars, a method of writing grammars in terms of semantic rather than syntactic concepts. Section 11.3 describes methods that use a partial syntactic analysis and then use a template-driven semantic analysis. Section 11.4 describes an approach that directly semantically interprets the sentences. 11.1 Grammatical Relations The idea underlying this approach is that the parser produces an output that abstracts away the details of the actual sentence but retains the structure important for semantics as a set of grammatical relations or grammatical dependencies. The semantic interpreter then produces a meaning representation as a separate interpretation process that uses the grammatical relations as its input. You have already seen a hint of this type of analysis in the descriptions of ATN-based grammars in earlier chapters. The grammatical relations are often relations like logical subject (LSUBJ), logical object (LOBJ), indirect object^ (IOBJ), and relations based on prepositional phrases. Figure 11.1 shows some simple sentences and their representations in terms of grammatical relations. Each relation is of the form (discourse-variable relation value), where the value may be another discourse variable or a SEM structure. Pattern Logical Form 1. ( PRED ) -> (3 1) 2 ( AT ) —> (AT-LOC 1 3) 3. ( LSUBJ ) —> (AGENT 1 3) 4. ( LOBJ ) —> (THEME 1 3) 5. ( TO ) -» (TO-POSS 1 3) 6. ( IOBJ ) -» (TO-POSS 1 3) 7. ( LSUBJ ) -» (EXPERIENCER 1 3) 8. ( LOBJ ) —> (THEME 1 3) 9. (PAST ) —> (PAST 1) Figure 11.2 Some patterns for interpreting grammatical relations Note that the active/passive distinction is removed in the grammatical relation representation because the LSUBJ is always set to the actor of the action, and the LOBJ is the object acted upon, whether it appears as the subject (in passive sentences) or the object (in active sentences). It is not difficult to augment a context-free grammar to generate this type of representation as it parses. One method is to represent each of the grammatical relations as a feature. The resulting feature structure is easily converted into triples. The features (PRED BUYS1 LSUBJ (NAME j2 "Jack") LOBJ ) are easily converted into the triples (si PRED BUYS1) (si LSUBJ (NAME j2 "Jack")) (si LOBJ ) The semantic interpreter then may be a separate process that takes the representation of grammatical dependencies and produces a semantic representation. Let's assume that it produces a case-role-based logical form as used in earlier chapters. One common technique for building the representation uses patterns that relate the grammatical roles to the case roles used in the representation. Each verb may define its own mapping. Because the grammatical roles already capture part of the semantic structure, the mapping is often straightforward. For instance, Figure 11.2 shows the patterns needed for the verbs used in Figure 11.1, using the verb hierarchy shown in Figure 11.3. The rules consist of a pattern in which matches any element of type T. The right side specifics the semantic interpretation, where the number n indicates the value of the n'th clement in the pattern. For example, rule 2 contains the pattern «VliRB> AT ) Rule 1 matches the first triple, rule 3 matches the second, and rule 4 matches the third, producing the three logical form fragments (BUYS1 si) (AGENT si (NAME j2 "Jack")) (THEME si ) These three fragments conjoined give the full logical form, which in the abbreviated format used throughout this text would be written as (BUYS1 si [AGENT (NAME j2 "Jack")] [THEME ]) Similarly, the sentence Jill thinks that Jack stole the book would be parsed to the following dependencies, shown with their semantic translation generated.. by the following patterns: (s5 PRED THINKS 1) -> (sS LSUBJ (NAME jl "Jill")) -> (s5 LOBJ s6) -» (THINKS 1 s5) (EXPERIENCER s5 (NAME jl '-Jill':)] (THEME s5s6) (sfr PRED STEALS 1) (s6 TNS PAST) (s6 LSUBJ (NAME j2 "Jack")) (s6 LOBJ ) (STEALS 1 s6) (PASTs6) (AGENT s6 (NAME j2 "Jack")) (THEME s6 ) Merging these semantic translations would produce the following logical form in abbreviated form (ignoring tense): (THINKS 1 s5 [EXPERIENCER (NAME jl "Jill")] [THEME (STEALS 1 s6 [AGENT (NAME j2 "Jack")] [THEME s6 ])]) As presented here, the syntactic and semantic processes are separate and run in sequence. This assumption makes the presentation easier but is not necessary for the approach. It is possible to have the syntactic and semantic analysis occur concurrently as follows. Each time the parser builds a grammatical relation, it is passed to the semantic interpreter, which can then process it. If the grammatical relation has no semantic interpretation, this information is passed back to the parser, which then discards (or downgrades) the syntactic analysis that produced the relation as it is semantically anomalous (or unlikely). Thus you can do semantic filtering on the constituents as they are produced. This technique provides the advantages of a close coupling of syntax and semantics, while retaining separate modules for each. This section has ignored the issue of how noun phrases are interpreted. In many systems, a special process is used to interpret noun phrases (sometimes identifying the referent in context), and the result of this process is inserted in the dependency relations. It is also possible to extend the general approach to noun phrases, introducing dependency relations between adjectives and the nouns they modify, and so on. The semantic interpreter then must be extended so that it can perform semantic interpretation recursively on subconstituents and then use the results to construct the logical form for the current constituent. Approaches based on grammatical relations are attractive because the representation provides a convenient interface between the syntactic processing and complex semantic interpretation procedures, allowing them to operate independently. This allows the mapping rules to be considerably more complex than can be easily specified in a rule-by-rule approach. The mapping process could, for instance, involve extensive type checking, inference, and discourse processing in order to disambiguate the intended meaning of the sentence. Of course, the grammatical relations themselves can still be constructed compositionally by the parser using a rule-by-rule approach. One way to view this approach is that the representation based on grammatical relations is an alternative to the logical form used here, and the mapping rules are performing contextual interpretation into the final meaning representation language. 332 CHAPTER 11 Other Strategies for Semantic Interpretation 333 11.2 Semantic Grammars When building a system for a particular application, there are often techniques that can be used to improve the efficiency and performance of the parsing and semantic interpretation. These techniques take advantage of the predetermined context of the application in order to reduce or eliminate much of the processing -required to handle ambiguity in general. This section describes a technique for • building a custom-tailored grammar for the application. A general grammar of English will contain many constructs that areJ necessary for wide coverage of the language but may not be needed in the application at hand. Certain constructs may appear only with a specific semantic context. In these circumstances the general syntactic rule might be replaced in the grammar with a more specific semantically motivated rule. Consider an application that supports queries to an airline database about flights. The following noun phrases referring to flights occur in this domain: the flight to Chicago the 8 o'clock flight the first flight out flight 457 to Chicago To handle these noun phrases, a general grammar must contain the following rules. (Examples appear in parentheses.) NP -> DET CNP CNP ->N CNP -> CNP PP CNP -> N PART CNP -> PRE-MOD CNP NP -> N NUMB (the/light) (flight) (flight to Chicago) (flight out) (8 o'clock flight) (flight 457) Now, for cities in this domain, we find the following types of noun phrases: Chicago the nearest city to Dallas These phrases can be handled by the general grammar we just created, with the addition of one more rule to handle proper names. The problem with this is that now you have to restrict rules to apply to the appropriate categories. For example, the following noun phrases would not occur in this domain (or possibly in any other), even though the rules would suggest that they are legal: *the city to Chicago *the 8 o'clock city *the first city out *city 567 To handle such cases in a general grammar, you would need to add selectional restrictions and features on the words to restrict the possible syntactic ; structures and modifiers. In a limited domain, however, it is often simpler to introduce new specialized lexical categories based on their semantic properties, such as FLIGHT-N (that is, those nouns with senses that indicate flights). With such lexical categories, the general grammar might be rewritten as follows. (Examples are given in parentheses.) FLIGHT-NP -» DET FLIGHT-CNP FLIGHT-CNP -> FLIGHT-N FLIGHT-CNP -»FLIGHT-CNP FLIGHT-DEST FLIGHT-CNP -» FLIGHT-CNP FLIGHT-SOURCE FLIGHT-CNP -> FLIGHT-N FLIGHT-PART FLIGHT-CNP -> FLIGHT-PRE-MOD FLIGHT-CNP FLIGHT-NP -» FLIGHT-N NUMB CITY-NP -»CITY-NAME CITY-NP -> DET CITY-CNP CITY-CNP ->CITY-N CITY-CNP -> CITY-MOD CITY-CNP CLTY-MOD-ARG (the flight) (flight) (flight to Chicago) (flight from Boston) (flight out) (8 o'clock flight) (flight 457) (Boston) (the city) (city) (nearest city to Dallas) Of course, many other rules are needed, but these use the semantic categories as well. For instance, the FLIGHT-DEST category allows prepositional phrases that specify the destination cities of flights: FLIGHT-DEST -> to CITY-NP FLIGHT-DEST^ for CITY-NP Higher-level syntactic structures can be similarly tailored to these categories, such as the rule TIME-QUERY -» When does FLIGHT-NP FLIGHT-VP A grammar that is cast in terms of the major semantic categories of the domain is called a semantic grammar. Clearly, there is no precise line between a syntactic grammar and a semantic grammar; rather, there is a continuum between the two extremes. While semantic grammars are considerably larger than their syntactic counterparts, it is generally simpler to define the rules because of the limited context, and complex features are not needed to sort out which forms can go with which semantic categories. You can augment a semantic grammar to produce a logical form in the normal way. Typically, the interpretation rules will be straightforward because you know most of the semantic information needed directly from the structure of the rule. An alternative method, however, is simply to use the parse tree itself as the logical form. Since it contains the semantic information already, there is little advantage for converting to another notation. For instance, Figure 11.4 shows the parse tree for the query When does the flight to Chicago leave? 334 CHAPTER 11 Other Strategies for Semantic Interpretation 335 FLIGHT-NP FLIGHT-VP DET FLIGHT-CNP FLIGHT-CNP FLIGHT-DEST' \ C1TY-NP when does the flight to Chicago leave Figure 11.4 The parse tree for When does the flight to Chicago leave? In LISP form this tree might be represented as (TIME-QUERY (FLIGHT-NP (DET the) (FLIGHT-CNP (FLIGHT-CNP flight) (FLIGHT-DEST to (CITY-NP Chicago)))) (FLIGHT-EVENT leave)) This structure would be at least as easy to convert into a database query as the full logical form for the sentence. So there is no advantage to further semantic analysis. Semantic grammars combine aspects of syntax, semantics, and selec-tional restrictions in a simple uniform framework. As a result semantic grammars have proven useful for the rapid development of parsers in limited application domains. The down side, however, is that they do not port well to new domains and cannot handle applications in broad domains. Generally, a new domain will require a completely new semantic grammar, whereas most of a syntactic grammar for one domain will apply to another domain. 11.3 Template Matching In certain applications, the task is specific enough that special-purpose techniques that exploit the domain structure can be used. For example, you might be interested in a system that can create summaries of certain stock transactions described in newspaper stories. The information required might fall into a fixed format, such as who bought what from whom and for what price. Since the TERRORIST INCIDENT DATE date LOCATION city/state/country TYPE e.g., bombing STAGE of EXECUTION e.g., accomplished, planned INSTRUMENT e.g., bomb, gun PERPETRATOR NAME e.g., "FMLN" PHYSICAL TARGET e.g., car, house HUMAN TARGET e.g., president NATIONALITY TARGET e.g., San Salvador EFFECT e.g., no injury Figure 11.5 A simplified template for summarizing terrorist attacks system is not required to understand anything in the text except when it concerns this specific information, techniques that extract information based on local information can be used. Such techniques, while limited to the specific task, often produce a more successful system than one based on general-purpose techniques, where each sentence would have to be completely parsed and interpreted before information can be extracted. This section explores some of these techniques. The basic idea with such limited domain systems is that you can specify simple patterns that indicate key pieces of information in the domain. These patterns reliably indicate information that can be used to fill in templates that represent the task. In the stock transaction domain, for example, the preposition to usually indicates the recipient of the stock. In a different application you would find a different interpretation. In an airline schedule query domain, for example, the preposition to usually indicates the destination of the flight. The task is typically defined as follows: The domain is specified as a set of templates, one for each possible class of input. In a domain where the system must generate summaries of articles reporting terrorist attacks in South America, the information desired might be summarized in a template such as that shown in Figure 11.5. The basic idea is to define patterns on the input that identify the different slots in the template. For instance, we might have a pattern such as take hostage —> (TERRORIST-INCIDENT HUMAN-TARGET 1) which indicates that any sentence containing the sequence consisting of the verb take followed by a phrase that describes a human, followed by the word hostage, indicates the value of the HUMAN-TARGET slot in a TERRORIST-INCIDENT template. To make this approach viable, the input must be parsed at least to the extent of identifying noun phrases (for example, a noun phrase that could match 336 CHAPTER 11 Other Strategies for Semantic Interpretation 337 in our pattern) and producing canonical forms for words (so that alfi forms of take would match in our pattern). The partial parsing technique^ described in Chapter 6 can be used here to good effect. Of course, other patterns may match other parts of the input and suggesfj other values for slots in the templates. Once all the possible patterns have been" matched, the final stage of the analysis would merge the partial templates. The rest of this section explores each component of this approach in more detail. Consider first the partial parsing. If we are introducing a parser to find a5 partial parse of the sentence, why not use a full parser and obtain as much infor mation as possible? There are several reasons why this might not be a good idea -in these applications. First, a full parser would be computationally very expenjri sive, especially in the text-understanding task where sentences are typically 30' words or longer, and the system needs to process millions of words a dayp Second, it is not currenfly possible to construct a complete grammar for realistic domains, so many sentences will be unparsable. In addition, even if the sentence is parsable, it will likely be ambiguous between many different interpretations. The partial parser takes advantage of the fact that certain parts of sentences "J can be parsed fairly reliably. Thus the grammar only handles these parts, and the j rest of the sentence remains unanalyzed until the pattern-matching stage. As ~J. mentioned in Chapter 6, two major structures can be reasonably detected and" parsed: the verb group, which consists of those words starting with the first auxiliary verb to the head verb, including adverbial modifiers; and the noun group, which consists of the beginning of noun phrases from the initial word up to the head word, including prenominal modifiers but not noun complements. In addition, the parser can reliably identify function words such as prepositions, particles, and conjunctions, and a few other important classes such as proper names, which are often indicated by capitalization. To support pattern matching; based on semantic categories, the parser needs to compute the semantic type of '"„| the phrases it produces. These are typically taken from features on the head word. Grammars 11.6 and 11.7 give simple grammars for the verb groups and noun; groups that compute a feature TYPE that is used in the pattern matching. Figure * 11.8 gives a small lexicon for use in the examples. Given these grammars, the partial parsing algorithm described in Chapter 6 could find all possible noun groups and verb groups, filtering out NG and V' i constituents that are completely enclosed within a larger constituent of the same type. Words that are unknown or otherwise unanalyzable are ignored. Figure 11.9 shows all the constituents produced for the sentence Guerrillas attacked Merino's home in San Salvador five days ago with explosives. The next stage of analysis uses the semantic patterns on the chart. The simplest^! form of pattern consists of a sequence of features, typically the TYPE feature,-that may match anywhere in the input. If it matches, it produces a fragment of a,; template. 1. (VG TYPE ?typ LEX ?1) -> (VG1 VFORM {pres past) TYPE ?typ LEX ?1) 2 (VG TYPE ?typ LEX (?U ?12)) -» (MODAL LEX ?ll) (VG1 VFORM base TYPE ?typ LEX 712) 3... (VG1 TYPE Ttyp LEX (?11 712)) -> (ADV LEX ?I1) (VG1 TYPE ?typ LEX ?12) 4. (VG1 TYPE ?typ LEX (?11 ?12)) -H> (AUX ROOT have LEX (VG1 VFORM pastprt TYPE ?typ LEX ?12) 5. (VG1 TYPE ?typ LEX (?11 ?12)) -> (AUX ROOT be LEX ?ll) (VG1 VFORM {pastprt ing) TYPE ?typ LEX ?U) 6. (VG1 TYPE ?typ LEX ?1) -> (V TYPE ?typ LEX ?t) Head features for VG.VGl: VFORM Grammar 11.6 A simple grammar for verb groups 7. (NG LEX ?1) -> (PRO LEX ?1) 8. (NG LEX ?1) -> (N TYPE {TIME LOC} LEX ?1) 9. (NG LEX ?1) (N AGR 3p LEX ?1) 10. (NG LEX (?11 ?12)) -» (DETP AGR ?a LEX ?1) (N AGR ?a LEX ?1) 11. (NG LEX (?11 ?12 ?13)) -» (DETP AGR ?a LEX ?1) (ADJP LEX ?1) (N AGR ?a LEX 71) 12. (NG LEX ?1) -»(NAMEP LEX ?I) 13. (NG LEX ?1) -»(DATEP LEX ?1) 14. (DETP LEX ?1) -» (ART LEX ?1) 15. (DETP LEX (?1 's)) -> (NG LEX ?1) 's 16. ADJP -> ADJ 17. ADJP -> (V VFORM pastprt SUBCAT _np) 18. (NAMEP LEX ?1) -> (NAME LEX ?1) 19. (NAMEP LEX (?11 ?12)) -> (N TITLE + LEX ?11) (NAMEP LEX ?12) 20. (DATE LEX (BEFORE-NOW ?11 ?12)) -> (NUMB LEX ?11) (DATEUNIT LEX ?12) AGO Head features for (NG LEX ?L): TYPE, LEX Head features for NAMEP: TYPE Grammar 11.7 A simple grammar for noun groups and names 338 CHAPTER 11 Other Strategies for Semantic Interpretation 339 ago (AGO) attacked (V VFORM past TYPE attack) days (DATEUNIT LEX days) explosives (N AGR 3p TYPE WEAPON LEX explosives) five (NUMB LEX 5) ' • Guerrillas (N AGR 3p TYPE HUMAN-GROUP LEX Guerrillas) home (N AGR 3s TYPE LOC LEX home) in (P TYPE IN) Merino (NAME TYPE person LEX Merino) San-Salvador (NAME TYPE LOC LEX San-Salvador) with (P TYPE WITH) Figure 11.8 A sample lexicon (NG TYPE HUMAN-GROUP LEX Guerrillas) (VG TYPE ATTACK VFORM past LEX attacked) (NG TYPE LOC LEX (Merino home)) (P TYPE IN) (NG TYPE LOC LEX San-Salvador) (NG TYPE DATE LEX (BEFORE-NOW 5 days)) (P TYPE WITH) (NG TYPE WEAPON LEX explosives) Figure 11.9 The constituents found by the partial parser For example, the pattern -> (LOCATION 2) would match the input anywhere a constituent of type IN is followed by a constituent of type LOC. If found, the input that was analyzed as the LOC constituent is used as the value of the LOCATION slot in the template. The four patterns required to analyze the example sentence are shown in Figure 11.10. These patterns would match against the chart to produce the following partial templates (INCIDENT ATTACK PERP "Guerrillas" TARGET "Merino's home") (LOCATION "San Salvador") (DATE "five days ago") (INSTRUMENT "explosives") In this case there are no conflicting partial templates, so the final analysis would,; be a template that includes all the information specified in the partial templates, PI -> (INCIDENT ATTACK PERP1 TARGET 2) P2 -> (LOCATION 2) P3 (DATE 1) P4 -> (INSTRUMENT 2) Figure 11.10 A few sample patterns Of course, sentences are not always this simple to analyze. As domains become more complicated, the patterns must be more selective. For instance, if the system also has to handle articles about weapons trading, then the pattern might not always refer to the instrument of an attack but might refer to the cargo of a shipment, as in A truck loaded with explosives was stopped at the border. To handle such cases, we would need more complex patterns that include the verb in the analysis. For instance, we could allow patterns to skip over constituents using rules like ... -> (INSTRUMENT 3) which would match any sentence with a verb of type attack, followed sometime later by the prepositional phrase. Then, another pattern such as (CARRIER 1 CARGO 4) would handle the other sense. Note that other features are allowed in patterns in addition to type restrictions. For example, you would not want the same pattern to match the sentences The guerrillas were attacked by the police and The guerrillas attacked the police. As patterns become more complex, they also can start to make distinctions between information in relative clauses and information in the main clause, since each may describe separate incidents. For example, the sentence The weapons, which were smuggled in by the guerrillas, were destroyed yesterday should be analyzed to produce two incidents rather than one combined incident in which the guerrillas both smuggled in and destroyed the weapons. Some systems allow patterns to be specified as arbitrary finite state machines, giving significant power to the programmer to handle such complications. Once this step is made, the system looks quite similar to the semantic grammar approach except that it is operating from preparsed constituents. Of course, a practical system would have to do additional processing rather than simply filling in the slots with the words. For instance, it might reason about dates, and using the date of the article, calculate exactly what day the phrase five days ago refers to. In addition, it might construct richer semantic representations of noun phrases while parsing, and these analyses might be used as the values of 340 CHAPTER 11 Other Strategies for Semantic Interpretation 341 BOX 11.1 Evaluating Message-Understanding Systems As mentioned at the beginning of this section, one of the principal applications of pattern-based techniques is information extraction from text, where the system has to process large quantities of text (say the AP news wire) and extract templates for any articles that deal with a specific topic, such as stock transactions. Systems have been evaluated on this task and reported on in a series of annual conferences (the MUC conferences). The systems tested use different combinations of pattern-matching and general-purpose techniques to produce the analyses of the articles. Such systems can be evaluated on several dimensions, including • system recall—how many articles on the topic it identifies and how much of the template it can fill in • system precision—how often it produces an incorrect template or an incorrect entry in a template • system efficiency—how long it takes to process a set of articles In the 1993 evaluation the systems had 1500 texts to train on, all annotated with a template indicating the best answer. They were then tested on 100 messages not previously seen by the systems. The best systems obtained about 62 percent recall with 53 percent precision. The pattern-based systems are considerably more efficient than the general-purpose systems, being able to process thousands of words per minute. For more information on MUC evaluations, see Chincor et al. (1993). the slots in the templates. You could also use these same techniques to builc logical forms for sentences if templates are organized around verbs and the'i thematic roles, rather than around incidents. Systems based on these techniques tend to be robust in that they produce some interpretation for nearly any input. On the other hand, their analyses can be fairly superficial and prone to error. Given the state of research, however, sue I: techniques will probably play a large role in practical systems for the foreseeable future. The big research question to be answered is what happens as the problem-, scale up. What if information needs to be extracted on a broad range of topics, i >i if more detailed summaries of the content are required that are not easiK expressed as simple templates? Some systems attempt to gain the best of both worlds: They first use a full parser and semantic interpreter, and only if that fails they use pattern-matching techniques to extract what information they can from the sentence. This has the advantage of handling some sentences in detail but remaining robust when the full parse fails. The disadvantage involves computational efficiency. If you tmH process millions of sentences a day, partial parsing and template matching may be the only viable techniques. BOOK.1 "book" -» (RESERVING * [AGENT 1]) BOOK.2 -» 1 a (RESERVING * [THEME 2]) BOOK.3 -> 1 a (RESERVING * [BENEFICIARY 2]) BOOK.4 "for" -> 1 a (RESERVING * [BENEFICIARY 2]) S.end "." -> pop Figure 11.11 Sample lexicon 11.4 Semantically Driven Parsing Techniques The last section used a limited syntactic preprocessor to prepare the input for semantic interpretation. It is also possible to eliminate the syntactic processing stage altogether, except for basic morphological analysis, and do everything using semantically driven patterns. This section briefly describes such a technique. In such systems the grammatical and semantic information is stored in the lexicon entries for the words. In particular, the lexicon will contain essential information such as the different senses possible for the word, including case-frame information for verbs and adjectives, and a specification of a procedure for disambiguating the word and integrating it into larger semantic structures by combining it with other words. Here you will consider a system where this information is stored as a set of pattern-action rules operating on a buffer similar to the deterministic parser described in Chapter 6. The buffer contains the input constituents and is updated by the pattern-action rules. Whenever you enter a new word, the rules in that word's lexicon entry are made active. For example, the rules associated with the word booked might be those shown in Figure 11.11. Each rule contains a pattern that tests for semantic categories and individual words in the buffer. When a pattern matches, the buffer elements involved in the match are removed and replaced by the structures specified in the action part. In addition, the action part of the rule may activate and deactivate additional rules. For example, rule BOOK.l in Figure 11.11 will match a buffer with an entry of type ANIMATE followed by a buffer containing the word book, and it will replace these two buffers with an entry of form (RESERVING * [AGENT 11), where * is instantiated to a new discourse variable. As in the last section, numbers in the interpretation rules are used to refer to the buffer entries that matched the pattern. Thus 1 refers to the value of the first buffer involved in the match. 342 CHAPTER 11 Other Strategies for Semantic Interpretation 343 The operator ( a) is used to construct a new interpretation by merging two partial descriptions. Consider how the sentence John booked me a flight to Chicago is interpreted using the preceding rules. For the moment, assume that all NPs have been preprocessed before this parse commences. The parser starts with the single rule S.end active (which checks for the end of the sentence) and with the first two buffers filled with the terms (NAME jl HUMAN "John") and booked, respectively. The rules for booked are made active, and BOOK.l succeeds. This replaces the first two buffers with the value (RESERVING rl [AGENT (NAME jl "John")]) The next buffer is filled with the structure (PRO ml HUMAN me). This time; rule BOOK.3 succeeds. Its action merges a new structure with the value of the first buffer, producing the following structure as the new value of buffer 1: (RESERVING rl [AGENT (NAME jl "John")] [BENEFICIARY (PRO ml ME1)]) The next input is the analysis of the noun phrase a flight to Chicago, . The rule BOOR.2 matches, and the result is again merged with the first buffer, producing the following analysis: (RESERVING rl [AGENT (NAME jl HUMAN "John")] [BENEFICIARY (PRO ml HUMAN "me")] [THEME ]) When the final period is read in, rule S.end fires and signals the completion of the parse with the preceding analysis. To extend this interpreter to handle the analysis of noun phrases, you need to introduce a mechanism similar to the attention-shifting mechanism in the deterministic parser. In particular, the rule for a is ART.l "a" -> INDEF1 In the preceding example, if the noun phrase a flight to Chicago had not been preprocessed, the parser would have reached the following position: Buffer 1: (RESERVING rl Buffer 2: [AGENT (NAME jl "John")] [BENEFICIARY (PRO ml ME1)]) None of the currently active rules (BOOK.2, BOOK.4, and ART.l) would match this configuration. However, rule ART.l could match if patterns are allowed to start matching at positions other than the first. Such matches are allowed and create an attention shift. The rules that were active are temporarily removed from the active list, and the parse continues as though buffer 2 were the first buffer. The parser is now in a position to use the rules for flight, to, and ART.l "a" -» INDEF1 (activate rules NP.endl and NP.end2) FLIGHT. 1 "flight" -> <1* (FLIGHT *)> FLIGHT.2 "to" -> MAN.l "man" —><1 * (MAN *)> CHICAGO. 1 "Chicago" -> (NAME cl "Chicago") ME.l "me" -» (PRO * HUMAN "me") NP.endl "." —> pop (a period signals the end of an NP) NP.end2 —> pop (a verb signals the end of the subject NP) Figure 11.12 Some rules for noun phrase analysis Chicago shown in Figure 11.12 to construct the NP analysis. When the NP is completed, the original state of the parser is restored by a new action called pop. Trace the parse continuing from the point shown earlier where the word a enters the second buffer. Rule ART.l fires and an attention shift is made, resulting in rules BOOK.2 and BOOK.4 being deactivated temporarily, and buffer 2 is set up as the first buffer for matching. When you enter the word flight, you have the following situation: Buffer 1: (RESERVING rl [AGENT (NAME jl "John")] [BENEFICIARY (PRO ml ME1)]) Buffer 2: INDEF1 **patterns start matching here** Buffer 3: flight The active rules are FLIGHT. 1 and FLIGHT.2. FLIGHT. 1 matches and replaces the contents of buffer 3 with **patterns start matching here** to (NAME cl "Chicago") Rule FLIGHT.2 matches and replaces buffer 2 with the value . Next a period is entered into buffer 3, and rule NP.endl fires and executes a pop, resetting the parser to the state before the NP was begun. Rules BOOK.2 and BOOK.4 are reactivated and the parse continues as shown earlier. While parsers can be built quickly in such a framework to handle some specific set of sentences, problems arise in viewing these systems as a general model of parsing. Since all the rules are indexed by individual words, it is 344 CHAPTER 11 Other Strategies for Semantic Interpretation 345 difficult to capture linguistic generalizations in a convenient way. For example rule BOOK.l identifies the first noun phrase found as the AGENT of the action RESERVING, one sense of booked. But this seems to be an instance of a general ~-j rule that applies for all action verbs. When the rules are accessed solely through lexical entries, however, you have no choice but to specify such a rule with each action verb. Thus the extensive grammars can be cumbersome to construct. More importantly, these systems can use only local syntactic information, because the only state they maintain consists of the current case-frame structure '-being specified and the current input. Thus, to disambiguate a word, you can at * best inspect a word or so before it and a few words after. In practice, as theses rules become more complex, they apply to fewer and fewer situations, and more equally complex rules need to be added to handle simple syntactic variants. For example, the definition of the verb booked earlier considered only its use as the main verb of a sentence in the simple past. In fact, booked is also a past participle and thus can be used in a passive sentence and can introduce a relative clause, as in The flight booked by the travel agent leaves at three. Work aimed at remedy -_:2 ing these deficiencies often reinstates a syntactic component that can be used to aid the interpretation in complex sentences. You can maintain syntactic context by running a syntactic parser in tandem with this interpreter. The actions of the parser, however, will be suggested by the semantic analyzer. The syntactic parser can simply return whether the syntactic operation identified by the semantic analyzer is a possible next move or not. If it isn't, the semantic interpreter attempts to find a different analysis. It is not clear, however, whether the semantic analyzer can handle complex sentences involving movement in a clean way. When the syntactic analyzer is controlling the processing, the syntactic component can be used to eliminate these complexities before the semantic analyzer is invoked. With the control scheme suggested here, the semantic analyzer itself would have to be able to handle the complexities, and only then would it have the analysis verified by the syntactic component. BOX 11.2 Conceptual Analysis: A Semantically Driven Parser Many semantically driven parsers have been developed based on work by the Yale natural language understanding group. The analyzer described in Birnbaum and Selfridge (1981) is probably the best reference for an introduction to the approach. The analyzer is driven by a case-based representation called conceptual dependency. The key idea is that certain words, especially verbs, identify meaning structures that have cases, and the parser's main job is to identify these words and use the rest of the sentence to fill in the values of these cases. While the details of the parser are different in style from the semantically driven parser described in this section, the type of operations performed are similar. The system is organized as a set of pattern-action rules called requests that are associated with lexical items. When a word is read, its requests become active. In general, a request can test for two types of information: It can check for a particular word or phrase in the input, or it can check for certain semantic properties on the structures in the buffer, which is called the C-LIST. The actions allowed include the actions implicit in our presentation. In particular, they may add a new item to the C-LIST, fill in a slot in some structure on the C-LIST (a function performed by merging structures), and activate or deactivate other requests. Summary There are many different strategies for combining syntactic and semantic information beyond the rule-by-rule approach discussed in Chapter 9. These approaches generally use pattern-matching techniques to semantically interpret sentences, and differ on how much explicit syntactic processing is done. One approach is to use a full syntactic processing to compute an intermediate representation between syntax and semantics called the grammatical form. This is then interpreted by a separate semantic interpretation process. Another approach is to use a syntactic preprocessor that produces a partial analysis of the sentence and then perform semantic interpretation on lhat output. Yet another approach dispenses with any syntactic processing altogether and works by dircctiy matching semantic patterns to the input. A slightly different approach uses a grammar specified directly in terms of semantic constructs, yielding a semantic grammar. There are advantages to each of these approaches. Generally, the less syntax used, the more domain-specific the system is. This allows you to construct a robust system relatively quickly, but many subtleties may be lost in the interpretation of the sentence. In addition, such systems typically do not generalize well to new domains. In some applications, however, the domain-dependent pattern-matching approach may be the only way to attain reasonable performance for the foreseeable future. Related Work and Further Readings_ The RUS system (Bobrow and Webber, 1980) describes an ATN-based grammar that constructs grammatical relations. It allows actions on the arcs that invoke a semantic interpreter that incrementally constructs a case-frame based semantic representation of the sentence. The semantic interpreter may accept or reject various arcs, thus affecting the behavior of the syntactic parse. Woods (1980) provides a more formal analysis of such interleaved parsers using a formalism called cascaded ATN grammars. Another ATN-based system with interleaved semantic processing is described in detail in Ritchie (1980). Hirst (1987) is a good example of an approach that performs semantic interpretation on a representation based on grammatical relations. It composi-tionally interprets grammatical relations into a frame-based representation. Another good example is the KERNEL system (Palmer et al., 1993), A linguistic 346 CHAPTER 11 Other Strategies for Semantic Interpretation 347 treatment of grammatical relations is lexical functional grammar (Kaplan andí Bresnan, 1982). In LFG the syntactic rules use features for grammatical relations, which are then used for semantic interpretation in a rule-by-rule fashion. * Semantic grammars were proposed in the 1970s as a technique to build aí fairly robust system that operates in a limited domain. The technique was first-used in the SOPHIE system (Brown and Burton, 1975), a tutorial system for debugging electronic circuits. The grammar had terminal types such as REQUEST, TRANSISTOR, JUNCTION/TYPE, and so on. Each rule was associated with a" function that would produce LISP code that would perform the appropriate operation. The LIFER system (Hendrix et al., 1978) applied the same techniques.' to database query applications, constructing a different grammar for each data-" base application using the concepts relevant to that database. One of the earliest uses of partial understanding techniques for text summarization was the FRUMP system (DeJong, 1982). A considerable literature has developed in the last few years as a result of a focus on message-understanding applications. The specific techniques described in Section 11.3 are based on the FASTUS system (Appelt et al., 1993). This system uses a series of finite state", machines to recognize noun and verb groups and to implement the patterns for extracting information. Another good example is the SCISOR system (Jacobs and Rau, 1990). Wilks (1975) provides a good example of template-driven semantic interpretation with minimal syntactic processing. The semantic information in this system consists of a set of semantic templates defining the possible semantic relationships in the system and a semantic formula for each lexical entry. The essential operation in this system is matching these templates to a representation of the sentence where the basic phrase structure has been extracted. The semanti-cally driven parser in Section 11.4 is modeled after the conceptual analyzer of Birnbaum and Selfridge (1981) (see Box 11.2), which itself is a descendant of the earlier parsers (Schank, 1975; Riesbeck and Schank, 1978). An interesting semantically driven parser that drives a syntactic component to verify the analysis is described by Lytinen (1986). Exercises for Chapter 11_ 1. (easy) Show the logical form generated from the grammatical dependency representation of the sentence A ticket was bought at the theater, using the data shown in Figures 11.1, 11.2, and 11.3. 2. (easy) Draw the parse tree produced by the semantic grammar described in Section 11.2 for the query When does the 8 p.m. train from Boston arrive in Chicago? Show the additional rules that you have to add to the grammar so that it will accept this sentence. 3. (medium) Extend the treatment based on grammatical relations described in Section 11.1 so that it handles the beneficiary case, as in Jack bought me a ticket. Jack bought a ticket for me. Give the parser output for each sentence as well as the final logical form produced by your patterns. 4. (medium) Complete the semantic grammar sketched in Section 11.2 so that it handles all of the example noun phrases found in that section. Give four other noun phrases (two well-formed and two ill-formed) that are structurally different that your grammar also handles correctly. Design a set of features that can duplicate the restrictions found in your semantic grammar while retaining a purely syntactic grammar. Give a syntactic grammar using these features that is equivalent to your semantic grammar. Which grammar would be easier to construct from scratch for a new domain? Which grammar would be easier to adapt to a new domain? 5. (medium) Extend Grammars 11.6 and 11.7 and give a set of pattern-based interpretation rules that can extract the necessary information from each of the following sentences in the terrorist application. You may assume that the templates given in Figure 11.5 specify all the information required for the application. For each, specify the sequence of constituents produced by your grammar in a format similar to that in Figure 11.9. Give the output constructed by your patterns for each of the sentences. Treat each sentence as a separate input, and not as part of the same paragraph. a Guerrillas used explosives to attack San Salvador yesterday. b. No one was injured when the guerrillas bombed the government buildings. c. In San Salvador several terrorists were thwarted in their plan to bomb the home of the president. Discuss at least one difficult problem that arose in defining these patterns, and describe the strengths and limitations of your solution. 6. (medium) Implement the pattern-matching component of a message-understanding system that can interpret sentences such as those in Exercise 5, given an input sequence of preparsed constituents. Test your system on the sentences in Exercise 5 plus three other sentences of your own choosing that demonstrate the scope of your implementation. 7. (medium) Design a set of pattern-action rules as described in Section 11.4 that perform the same task as described in Exercise 5 but do not use any syntactic preprocessing. What are the most difficult issues that need to be addressed that are easier with a syntactic preprocessor? Does the approach based on direct semantic interpretation have any advantages? CHAPTER Scoping and the Interpretation of Noun Phrases 12.1 Scoping Phenomena 12.2 Definite Descriptions and Scoping 12.3 A Method for Scoping While Parsing 12.4 Co-Reference and Binding Constraints 12.5 Adjective Phrases 12.6 Relational Nouns and Nominalizations 12.7 Other Problems in Semantics Scoping and the Interpretation of Noun Phrases 351 This chapter considers some additional issues in semantic interpretation, mostly those that affect the interpretation of noun phrases. In particular, it considers issues in the scoping of quantifiers and operators, the generation of co-reference constraints based on the structure of the sentence, and assorted issues in interpretation of adjectiye phrases and other modifiers in noun phrases. Section 12.1 examines the issues that affect quantifier scoping decisions. Section 12.2 extends this discussion further by examining properties of definite descriptions, which sometimes act like quantified phrases. Section 12.3 then presents a technique for computing likely scope ordering while parsing. Section 12.4 introduces the general issues that affect co-reference constraints between pronouns and other referring noun phrases, and describes a technique for generating such constraints while parsing. Section 12.5 addresses issues in semanti-cally interpreting adjective phrases and other noun modifiers, and Section 12.6 considers special classes of noun phrases involving relational nouns and nominalizations. Finally, Section 12.7 briefly describes some other important issues which have not yet been considered. 12.1 Scoping Phenomena This section discusses issues that affect scoping in language. Scope ambiguity occurs with quantifiers, logical operators, modal operators, and adverbials in sentences. Speaking loosely, we will refer to each construct that exhibits scoping behavior as an operator. Determining the scope of quantifiers and operators in the logical form is a very complex problem that not only involves using lexical, syntactic, and semantic information but also is strongly influenced by context. There are, however, some constraints and preferences for interpretations that arise without considering context. Scoping is a pervasive problem in semantic interpretation. Its presence can be illustrated by a series of sentences that are ambiguous due to scope phenomena. Any sentence that contains two or more quantifiers will likely be ambiguous. For example, A dog entered with every man. may describe at least three different situations: (1) There are many dogs, each accompanied by a man; (2) a single dog walked in when all the men walked in as a group; and (3) a single dog entered many times, each time with a different man. These readings arise from the relative scoping of the two quantifiers and the implicit existential quantifier associated with the event. Ambiguity also arises from the scoping of logical operators and quantifiers, as in We didn't see every dog. which may assert that we didn't see any dog, that is, (EVERY dl : (DOG dl) (NOT (SEE1 (PRO wl WEI) dl))) 352 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 353 or that there was at least one dog we didn't see, that is, (NOT (EVERY : dl (DOG dl) (SEE1 (PRO wl WEI) dl))) An example of an ambiguity with a logical connective is Everyone thought that Fido or Fifi would win. which has one reading where either everyone thought Fido would win or everyone thought Fifi would win, and another reading where each person thought that either Fido or Fifi might win. The difference here is the scoping between the universal quantifier and the disjunction. Ambiguities also arise with tense operators and adverbials, as in He will feed the hungriest dog tomorrow. in which it may be that the dog is hungriest now and will be fed later (possibly when it is not hungry), or that the dog will be the hungriest tomorrow when it is fed (and possibly is not hungry now). The sentence A fat dog always loses the race. displays a scope ambiguity on the adverbial. It may be a statement that the loser is always a fat dog or that there is a particular fat dog so slow that it loses every race. Classifying Quantifiers Noun phrases serve many different functions in language, and it is important to distinguish these functions when considering scoping issues. There are at least three major classes to consider. Those involving definite reference indicate that the hearer should in principle be able to identify the object or set. Definite reference occurs with determiners such as the, as in the dog (an individual) or the fat men (a specific set), and with possessive determiners such as their ideas or John's book. The second class of quantifiers are existential or indefinite, and they typically introduce new individuals or sets into the discourse. Typical examples of indefinite reference involve determiners such as a in a good book and indefinite sets constructed with quantifiers such as some and several. One test for existential quantifiers is whether they can appear in sentences of the form There are Q men who like golf. which allows Q to be any existential (plural) quantifier, such as some, many, a few, several, two, seven, no, and so on. Note that quantifiers such as all. each, every, and most would be ill-formed in this context. Such quantifiers form the third class, the universal quantifiers, which involve properties of all, or nearly all, members of a set. Another major distinction in those quantifiers that involve a set is the collective/distributive distinction. Collective interpretations treat the set as a unit, as in The men met at the comer, in which it is the set of men who met. It makes no sense to say that any individual man in this set met. Distributive interpretations range over the members of the set, as in Each man bought a suit, in which each man individually bought a suit, rather than them buying one suit as a group, which would be the likely interpretation of The men bought a suit to give to Harry. Some quantifiers only support a particular reading (for example, each must be distributive), but most are ambiguous as to the distributive/collective distinction. Consider the following four sentences that involve sets: Each man lifted the piano. Every man lifted the piano. All the men lifted the piano. The men lifted the piano. The first sentence only has a distributive reading, in which each man lifted the piano individually. The last strongly prefers a collective reading, in which the set of men lifted the piano together. The other cases fall in between these two readings and might be interpreted either way. This tendency toward a distributive or collective reading affects the possible scoping decisions, because other quantifiers may depend on a noun phrase only when it is in the distributive reading. For example, consider the sentence Each man lifted a piano. There may be many different pianos that are referred to here, one for each man. We say that the noun phrase a piano depends on the noun phrase each man, or that each man takes wider scope than a man, drawing from the way quantifiers are scoped in first-order logic. The dependency is possible only when the wider scoped noun phrase is distributive. If the first noun phrase is collective, as in Together, the men lifted a piano. then there is no dependence and the phrase refers to a single piano. Of course, many constructs are ambiguous, such as in A piano was lifted by each man. which plausibly has a reading in which there are many pianos or a reading in which there is a single piano that is lifted many times. Note that in both cases the noun phrase each man is distributive. But the noun phrase a piano depends on it only in the first reading. In the second the piano remains the same, but each distributes across a set of lifting events (one by each man). Making Scoping Decisions The quantifier scoping problem is typically presented as the problem of finding a full scoping of the operators in a nested form as found in FOPC. Note, however, that such a full scoping form does not entail dependence of a variable on the 354 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 355 NP, VP / \ V NP 2 DET CNP I /\ NP3 N PP /\ NP, Jill read Mary's book about the depression Figure 12.1 A parse tree for demonstrating local domains variables that outscope it. For instance, a sequence of indefinite noun phrases, as in a man lifted a piano, has the same meaning no matter how it is scoped. It is an artifact of the syntax of FOPC that you are forced to pick an ordering. Other representations are possible in which only the dependencies that are necessary are indicated, and variables that are unordered remain independent of each other. Such representations have the advantage of not making a sentence look more ambiguous than it really is. For the sake of maintaining a simple presentation, however, we will assume in the rest of this section that the task of scoping is to construct a full ordering on the quantifiers. To examine scoping phenomena more closely, it will be useful to define the local domain of a constituent. The local domain of a constituent C is the set of constituents contained in the closest S or NP that contains C. Consider the tree in Figure 12.1. The local domain defined by the top S constituent contains NP| and NP2, while the local domain defined by constituent NP2 contains NP3 and NP4. The S or NP constituent defining the domain is called the dominating constituent of the constituents it contains. Two constituents are said to be horizontally related if they belong to the same local domain. Two constituents are said to be vertically related if the first is the dominating constituent of the local domain of the second. This terminology will be extended to quantifiers as well; that is, two quantifiers are horizontally related if the constituents that contain them are horizontally related. Horizontal Scoping One common approach to handling horizontal scoping is to compare each pair of operators and see if there is a pattern where one usually outscopes the other. This tendency is often expressed in terms of quantifier "strength," where the- "stronger" quantifier usually outscopes the "weaker." Many researchers have suggested the following hierarchy of quantifier strength: each > every > all, some, several, a Of great importance in question answering, wh-terms can also be given a strength. Most question-answering systems place wh between each and every, as shown by the sentences: Who saw every dog? Who saw each dog? The first sentence would be treated as asking for a single person who saw all the dogs; the second would be treated as asking for a list of people—one or more for each dog. It turns out, however, that such uniform preference schemes are fairly inaccurate in many cases because they do not take into account the structural relationships between the quantifiers. For example, the preferences might change depending on which quantifier occurs as the subject. Consider these sentences: Every man saw a dog. A man saw every dog. The preferred reading of the first sentence allows for a different dog to be seen by each man (that is, every outscopes a). The preferred reading of the second sentence, however, involves one man (that is, a outscopes every). Thus the position of quantifiers in the sentence also affects scoping preferences. The following general preferences for wide scope based on position have been suggested, although these preferences may not apply to particular quantifiers: preposed constituents > surface subjects > postposed adverbials > direct/indirect objects Resolving Horizontal Scoping Ambiguity Algorithms that generate fully scoped logical forms are often specified in terms of the operation of pulling out a quantifier from its unscoped form to its fully scoped form—a process that involves taking the quantified expression out of the formula, leaving the marker, and wrapping it around a proposition that originally contained the term. By changing the ordering in which quantifiers are pulled out, different scopings are generated. For example, given the unscoped form ( ) you could generate the preferred reading by pulling out dl to produce (A dl : (DOG dl) ( dl)) 356 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 357 and then pulling out ml to produce (EVERY ml: (MAN1 ml) (A dl : (DOG dl) 11 ml dl)) The tense operator could then be pulled out last, producing the fully scoped logical form (PRES (EVERY ml: (MAN1 ml) (A dl : (DOG1 dl) (LOVES 1 11 ml dl)))) If, on the other hand, ml were pulled out first and dl pulled out next, you would obtain a different reading. The challenge is to develop an algorithm that uses the weights to select the best interpretation. It will sometimes be easier to work with a different notation that concisely captures the ordering of the quantifiers. Specifically, we will use a list indicating the quantifier ordering that is attached to the unscoped logical form. Thus the scoping order in the fully scoped logical form would be written as [ml dl]. There are many algorithms suggested in the literature that attempt to combine quantifier strength and syntactic role preferences in a heuristic way. For instance, you might first list the quantifiers in the order in which they appear in the sentence to obtain a default order [Vj ... Vn]. Then variables are exchanged when a variable Vj+i has a stronger quantifier than the variable Vj. Such exchanges continue until no more exchanges can be made. Vertical Scoping Relations The scoping process becomes more complicated when you consider vertical relations. In particular, a quantifier in a nested constituent may be lifted vertically over its dominating operator and then need to be scoped with respect to the other quantifiers at its new level. Thus the vertical scoping algorithm may change the set of quantifiers that need to be scoped with the horizontal scoping algorithm. Whether or not a quantifier is lifted depends on the syntactic structure that leads to the vertical scoping relation. For example, the sentence Some man rewarded a boy who gave each dog a bone. has two finite clauses, as seen in its logical form: ( rl gl bl ))>) Based on the syntactic structure, there are two local domains of interest: one containing ml and bl, and the other containing dl and b2. Furthermore, dl and b2 are vertically related to the operator in their dominating constituent, bl, i ■ Full relative clauses are often called scope islands because they tend to prohibit quantifiers from moving out of the clause. In this example, for instance, the most intuitive reading involves one boy, rather than a different boy for each dog. If this constraint always held, then such sentences could be scoped by running the horizontal algorithm twice, once to scope the relative clause and once to scope the main clause. But relative clauses are not always scope islands, as seen in The dogs that won each race are hungry, which has a reading in which there are different dogs in each race (that is, each outscopes the). In this reading the embedded quantifier must be lifted out of its local domain into the next higher domain, where it is horizontally scoped with the quantifiers at that level. Specifically, ignoring tense for the moment, the unscoped logical form for the sentence is (HUNGRY 1 hi ))>) The top-level local domain just contains dl, and the local domain for the relative clause contains r2, which is also vertically related to dl. If the quantifier for r2 is not lifted over the quantifier for dl, the most likely fully scoped interpretation is (THE dl : (& ((PLUR DOG1) dl) (EACH r2 : (RACE1 r2) (RUNS-IN1 rl dl r2)) (HUNGRY hi dl))) If the quantifier is lifted, then it is horizontally scoped at the top level. In this case it would produce the interpretation (EACH r2 : (RACE1 r2) (THE dl : (& ((PLUR DOG1) dl) (RUNS-IN1 rl dl r2)) (HUNGRY 1 hi dl))) Note that if a quantifier is lifted up to the next horizontal level, it must take wider scope than the quantifier that used to dominate it. Otherwise, an ill-formed expression would result. For instance, if the quantifier for r2 were lifted over the quantifier for dl but then horizontally scoped narrower than dl, the result would be as follows: ??? (THE dl : (& ((PLUR DOG1) dl) (RUNS-IN1 rl dl r2)) (EACH r2 : (RACE1 r2) (HUNGRY 1 ml dl))) Note that an instance of the variable r2 appears outside the scope of its quantifier, creating an uninterpretable expression. So far, the local domains in the examples have been defined by S constituents, namely the main sentence and relative clauses. Local domains are also 358 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 359 defined by NPs, which becomes relevant when they have prepositional phras. modifiers. Consider the sentence The man in every boat rows. The unscoped logical form (ignoring tense) for this sentence would be (ROWS1 rl ) Z There are two local domains, one just containing ml and the other just containing5 bl. Furthermore, bl is vertically related to ml. If the quantifier for bl is not > lifted, it remains in the restriction of the definition description, and the final logical form would be (THEml:(& (MAN 1ml) (EVERYbl : (BOAT1 bl) (IN ml bl))) (ROWSI rl ml)) Here there is one man who somehow is in every boat. If the quantifier is lifted, then it must outscope the quantifier for ml, producing the logical form (EVERY bl : (BOATI bl) (THE ml : (& (MAN1 ml) (IN ml bl)) (ROWS1 rlml))) In this interpretation, there may be different men, one in each boat. This seems to be the preferred reading for this sentence without considering context. Different vertical embedding relationships have very different preferences with respect to lifting embedded quantifiers to the next higher level. The following ordering on embedding constructs has been suggested, based on the likelihood that embedded quantifiers will be vertically lifted: possessives > PP modifiers > reduced relative clauses > relative clauses To analyze vertical relations accurately, however, you have to examine ; each construct on a case-by-case basis and see how the different operators interact with it. Let's consider PP modification in more detail. When a distri- j butive quantifier, such as every, is embedded in a PP modifier, the tendency to be lifted out is quite strong, as in A man in every boat was singing. which has a strong preference for the distributive reading in which there are many men singing. But if the embedded quantifier is not distributive, such as a, , in the new approach a feature QS (for quantifiers) is set to and the SEM is set to the discourse variable wl. The techniques for constructing SEM forms from subconstiluenls is exactly as before, and the only extension needed is to define how the QS feature is constructed. Once all the quantifiers in the same local domain have beer gathered, they can be sorted and the fully scoped SEM form constructed. You could modify the parser to invoke the scoping algorithm each time an S or NP constituent is constructed. The method here, however, will leave greatet flexibility to the grammar designer. A new binary feature SCOPEPOS is defined such that, whenever it is +, the parser invokes a procedure to sort the quantifiers. This procedure decides which quantifiers to lift to a higher horizontal context, and then sorts the remaining quantifiers and inserts them into the SEM. For example, consider the question When does each plane fly? Initially, this question would be parsed to construct an unscoped representation of the form (ignoring tense operators for the moment): (S SCOPEPOS + QS ( ) SEM (& (FLEES 1 fl pi) (AT-TIME fl tl))) Since this representation has the SCOPEPOS feature, the quantifier scoping algorithm is invoked. This top-level S requires no quantifiers to be lifted, and after sorting, one of the new constituents generated might be (S SCOPEPOS -QS ml SEM (EACHfl : (PLANE 1 pi) (WH tl : (TIME tl) (& (FLIES 1 fl pi) (AT-TIME fl tl))))) If multiple interpretations are possible, then multiple new S constituents could be constructed by the scoping procedure and added to the chart. As another example, consider the treatment of relative clauses, as in the noun phrase The flights that each man took. The embedded sentence that is built as the relative clause might initially have the form (S SCOPEPOS + QS () SEM (TAKES 1 tl mix)) where x is the variable for the relative pronoun. If relative clauses are treated as absolute scope islands, then the only interpretation possible would insert the quantifier into the SEM, producing the constituent (S SCOPEPOS -QS nil SEM (EACH ml (MAN1 ml) (TAKES 1 tl ml x))) If the quantifier is allowed to be lifted, then another constituent is also possible, namely (S SCOPEPOS - QS () SEM (TAKES 1 tl mix)) The same method can be used for scoping quantifiers in NP-dominated local domains. To treat scoping differently depending on the form of the modifier, new features are introduced to collect embedded quantifiers—those that arise in PP modifiers (the feature QSPP) and those that are lifted out of relative clauses (the feature QSREL). This allows different scoping strategies to be used for each case. Note that quantifiers that are not lifted out of an NP context are inserted around the restriction of the quantifier and not into the SEM, as is done for sentences. 362 CHAPTER 12 For example, consider the NP constituent constructed for the flights to each city before scoping: (NP SCOPELOC + QS QSPP SEM fl) If the quantifier EACH1 is not lifted, the new NP constituent would be (NP SCOPELOC- QS<^THEfl(& (FLIGHT 1 fl) (EACH cl (CITY1 cl) (DEST fl cl)))> SEM fl) If it is lifted, then the new constituent would be (NP SCOPELOC- QS (ORDERED ) SEM (k ag (SEES1 si ag dl))) These would be passed up to the S structure, which for the sentence The man saw: a dog would be (S SCOPELOC + QS ( ) SEM (SEES1 si ml dl)) This could then be scoped in the normal fashion, inserting the PAST operator wherever the algorithm deems best. To enable this approach to scoping operators, the lexical entries and morphological rules must be modified to set the QS register appropriately. For example, the rule for regular past-tense verbs would set the QS register, as in (V QS SEM ?semv) (V SEM ?semv IRREG-PAST -) +ED Scoping and the Interpretation of Noun Phrases 363 An Example Consider using this technique in a simple database query application. Quantifiers are used extensively in questions to databases, so the domain is a good choice as a test case. The preferences for horizontal ordering in S-dominated domains will be indicated using a, weighting scheme. Each quantifier is assigned a particular weight that shows its tendency to take wide scope. All horizontally related quantifiers are then sorted according to these weights. The weights used will impose the ordering tense operators > die > each > wh > other quantifiers > negation Thus a query such as When does each plane receive maintenance? is interpreted as asking for a listing of the maintenance times for each plane individually, whereas When does every plane receive maintenance? is interpreted as asking for a time when all planes receive maintenance simultaneously. In a sentence, quantifiers that have the same weight are scoped in the order in which they appear. The tense operators are taken to have the widest scope, because sentences with tense operators with a narrow scope typically do not appear in a database query. The negation operator is taken to have the narrowest scope. This means a form such as (S QS ( ) SEM (SERVES 1 si fl ml)) will be resolved to the fully scoped form (WH fl : ((PLUR FLIGHT 1) fl) (Ami: (MEAL1 ml) (NOT (SERVES 1 si fl ml)))) The vertical relationships in NP-dominated domains are handled using simple strategies: All quantifiers within a PP modifier are scoped wider than the quantified phrase they modify, and all quantifiers in relative clauses remain within the clause except for the definite quantifier (that is, relative clauses are strong scope islands). These heuristics are a little too simple for a real database query system but will serve to illustrate the techniques. They could easily be extended using the same framework. Grammar 12.2 is a small grammar that can parse some example questions that involve horizontal relationships. You have seen variants of most of these rules before, in Chapter 9. The only real change is the addition of the QS features. Rule 1 is the standard S rule, except that the S is marked +SCOPELOC, so the scoping algorithm will be invoked on all the quantifiers gathered from the NP and the VP. Rule 2 handles questions that begin with wh-phrases such as when and where. Rule 3 handles inverted S forms needed in wh-questions. Rule 4 is the standard rule of transitive verbs, modified by adding the QS feature. Rule 5 handles noun phrases that start with a determiner. Note the use of the QSPP and 364 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 365 (S SCOPELOC + INV - QS (?qsnp ?qss) SEM (?semvp ?semnp)) -»• (NP QS ?qsnp SEM ?serrrnp) (VP QS ?qsvp SEM ?semvp) (S SCOPELOC + QS (?qspp ?qss) SEM (& (?sempp ?v) ?sems)) -> (PP WH Q QS ?qspp SEM ?sempp) (5 INV + VAR ?v QS ?qss SEM ?sems) (S INV + QS (?qsaux ?qsnp ?qsvp) SEM (?semaux (?semvp ?semnp))) h> (AUX QS ?qsaux SEM ?semaux) (NP WH Q QS ?qsnp SEM ?semnp) (VP QS ?qss SEM ?semvp) (VP VAR ?v QS (?qsv ?qs) SEM (1 a3 (?semv ?v a3 ?semnp))) -» (V[_np] QS ?qsv SEM ?semv) (NP QS ?qs SEM ?semnp) (NP SCOPELOC + VAR ?v SEM ?v QSPP ?qspp QSREL ?qsrel QS ) -> (DET SEM ?semart) (CNP QSPP ?qspp QSREL ?qsrel SEM ?semcnp) (CNP SEM ?semn) -> (NSEM ?semn) (PP WH Q QS ?qs SEM ?sem) -> (PP-WRD WH Q QS ?qs SEM ?sem) Head features for S, VP, NP, CNP: VAR Grammar 12.2 A simple grammar with QS and SEM features QSREL to store the embedded quantifier, and QS to store the quantifier for the dominating constituent (that is, the new NP). Rule 6 allows unmodified CNP structures, and rule 7 allows the PP-words such as when and where in questions. ;:. Figure 12.3 shows the parse tree for When does the flight leave each city? This sentence contains one nontrivial local domain, which contains all the quantifiers and the tense operator. Note how the operators are passed up the tree: to the S structure, where the scoping algorithm produces the fully scoped form. Grammar 12.4 shows some additional rules that handle PP modification and relative clauses. As previously mentioned, quantifiers in prepositional phrase modifiers are always lifted, whereas quantifiers in relative clauses arc never lifted. Rule 8 is the standard rule for predicative prepositional phrases augmented;, with the QS feature. Rule 9 handles PP modification of CNPs and sets the QSPP feature. Rule 10 handles relative clauses and sets the QSREL feature, while rule 11 handles the actual relative clause. Using this grammar, the question When does the flight to each city leave? would produce the logical form (EACHcl :(CTTY1 cl) (THE fl : (& (FLIGHT 1 fl) (DEST fl cl)) (WH tl : (TIME tl) (& ( 11 fl cl) (AT-TIME fl tl)))) - (S SCOPELOC -QS- SEM (PRES (THE fl: flight1 (EACH cl : CTTYl (WH tl: TIME (& (LEAVES 1 11 fl cl) (AT-TIME fl tl))))))) ) SEM (& (LEAVES 1 II fl cl) (AT-TIME fl tl))) (NP WH Q (S INV + QS QS ( ) SEM (k x (AT-TIME x tl))) SEM (LEAVES 1II fl cl)) AUX NP QS (VP QS QS S™ n J?™ & x (LEAVES 1 11 x cl))) When does the flight leave each city Figure 12.3 The parse tree for When does the flight leave each city? (PP PRED + QS ?qs SEM (X x (?semp x ?semnp)) -> (P SEM ?semp) (NP QS ?qs SEM ?semnp) (CNP SEM (X nl (& (?semcnp nl) (?sempp nl))) QSPP ?qspp QS ?qscnp QSREL ?qsrel) -» (CNP QS ?qscnp QSREL ?qsrel SEM ?semcnp) (PP PRED + QS ?qspp SEM ?sempp) (CNP QSPP ?qs QSREL ?qsrel SEM (X n2 (& (?semcnp n2) (?semrel n2)))) -» (CNP QSPP ?qs SEM ?semcnp) (REL QS ?qsrel SEM ?semrel) (REL QS ?qs RVAR ?v SEM (X ?v ?sems)) -> (NP WH R VAR ?v SEM ?semnp) (S QS ?qs GAP (NP SEM ?semnp) SEM ?sems) Head features for NP, CNP: VAR 10. 11 Grammar 12.4 Rules to handle simple PP modification and relative clauses Specifically, the question is asking about many different flights, one to each city. This scoping is a result of the treatment of PP modifiers. In contrast, the question When did the flight that each man took leave? asks about one flight that all the men took. This reading is the result of the treatment of relative clauses. 366 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 367 12.4 Co-Reference and Binding Constraints There are significant syntactic constraints on how noun phrases may co-refer which to a first approximation means they both refer to the same object. This is an area of active study in linguistics, and only some of the basic results will be discussed here. In particular, consider the following example sentences, where the subscripts on noun phrases indicate that they co-refer, and asterisks indicate ill-formed interpretations, as usual. la. *When Jackj arrived at the party, she; was drunk. lb. When Jill j arrived at the party, she, was drunk. 2a. *Hei said Jackj wants to leave. 2b. Jack; said he i wants to leave. 3a. *Jilli saw her; in the mirror. 3b. Jillj saw herself j in the mirror. 3c. * Jill; thought that Jack saw herself,. 3d. Jilli thought that Jack saw her;. 4a. * Jackj thought the tired mam. was dying. 4b. Jackj thought he j was dying. In case of co-reference, the first phrase is often called the antecedent and the second the anaphor. These examples illustrate intrasentential anaphora-two co-referring expressions occuring in the same sentence. The other major form of anaphora is discourse anaphora or intersentential anaphora, which will be discussed in Chapter 14. To see the difference, consider the ambiguity in the sentence Jack found his hat, in which the pronoun his may refer to Jack (intrasentential) or to someone else who was mentioned in a previous sentence (intersentential). Since we are not considering context yet, we only study the intrasentential cases here. Tn the next few paragraphs, some preliminary constraints are suggested and problems with them are discussed. This will motivate the development of a bettei set of constraints in the remainder of the section. Sentences la and lb can be accounted for by an obvious constraint: For two noun phrases to co-refer, the} must agree in gender, number, and person. Sentence la is ill-formed because Jack and she disagree in gender. One tempting constraint to explain sentences 2a and 2b might be thai antecedents must precede pronouns that refer to them. Thus 2a is ill-formed because the pronoun occurs first. Unfortunately, this constraint is not valid, a-can be seen by the following examples: In itsi dreams, the goldfish lives in the ocean. Before she; went to the party, Jillj bought some wine. Most of herj friends think that Jill; will win. A better constraint will be developed later. Sentences 3a through 3d reveal constraints on the use of the reflexive, h turns out that the notion of local domain defined earlier provides a good start. i1 14, here. A reflexive pronoun must co-refer with a phrase in the same local domain, whereas a nonreflexive pronoun must not co-refer with a phrase in the same local domain. Sentence 3a is ill-formed because Jill and her are in the same local domain, whereas 3c is ill-formed because Jill and herself are not in the same local domain. An example of the reflexive use not involving the subject of a sentence is Jill talked to Jackj about himself j. But the constraint as expressed does not explain why the following is ill-formed: *Jill talked to himself j about Jackj. Finally, sentences 4a and 4b indicate that in some circumstances, pronouns must be used. A starting constraint here might be that nonpronominal noun phrases in the same sentence cannot co-refer. But this is too general, as the following sentence shows: After Jill had been questioned for hours, Sue took the tired witness out to lunch. A better set of constraints can be obtained by using some additional concepts from linguistics. The definition of local domain from the previous sections is a start. In addition, a new concept called C-commanding will be used: A constituent C C-commands a constituent X if and only if 1. C does not dominate X. 2. The first branching node that dominates C also dominates X. To understand this definition, consider the syntactic trees in Figure 12.5. The first branching node that dominates NPi is the S; thus NPj C-commands NP2 and NP3. The first branching node dominating NP2, on the other hand, is the VP node, so NP2 C-commands NP3. In the second tree, NP4 C-commands NP5, NP6, and NP7, while NPg C-commands NP7. The constraint on reflexivity can now be recast in terms of the C-command relationship: 1. A reflexive pronoun must refer to an NP that C-commands it and is in the same local domain. 2. A nonreflexive pronoun cannot refer to a C-commanding NP within the same local domain. To explore the implications of these constraints, consider the sentences in Figure 12.5 and some variations. In the sentence on the left, the pronoun her cannot refer to either Jill or Mary because both C-command it and both are in the minimal S that contains the pronoun. In Jill told Mary about herself, however, the reflexive pronoun could refer to either Jill or Mary. In Jill told herself about Mary, the reflexive pronoun can only refer to Mary, since the pronoun is not C-commanded by Mary even though they share the same local domain. In the 368 CHAPTER 12 np, VP NR, VP V NP) PP /\ P NP, Jill told Mary about her ' NPc DET CNP I /.\ NP6 N PP nb, Jill read Mary's book about her /\ Figure 12.5 Syntactic trees demonstrating the C-command relationships sentence on the right, the pronoun her can refer to Jill because, although NP4 C-commands NP7, it is not in NPy's local domain. For NP7 to co-refer with Mary, on the other hand, the pronoun would have to be reflexive. This provides a reasonable account of the problems raised by reflexives, but we still have sentences 2a, 2b, 4a, and 4b to explain. These can be handled by a constraint on nonpronominal noun phrases as follows: 3. A nonpronominal NP cannot co-refer with an NP that C-commands it. This constraint accounts for all the previous examples. Sentence 2a is ill-formed because the pronoun He C-commands Jack. Sentence 4a is ill-formed for the same reason: The subject Jack C-commands the noun phrase the tired man. Note that with the sentence After Jilli had been cross-examined for hours, Sue took the tired witness i out to lunch, there is no C-command relation between the two noun phrases, so they can co-refer. Interactions with Universal Quantifiers Pronouns interact with quantifiers as well. Consider the sentence Every iriani thought he; would win the race. in which each man expects to win the race himself. In this case the pronoun he.*-falls within the scope of the universal quantifier and refers to each of the indi^" viduals being quantified over. This is called the bound variable interpretation of «| the pronoun. Such readings can appear in many constructs, such as " Scoping and the Interpretation of Noun Phrases 369 A reflexive pronoun must refer to an NP that C-commands it and is in the same local domain. A nonreflexive pronoun cannot refer to a C-commanding NP within the same local domain. A nonpronominal NP cannot co-refer with an NP that C-commands it. Bound variable constraint: a nonreflexive pronoun may be bound to the variable of a universally quantified NP only if the NP C-commands the pronoun. Two co-referential noun phrases must agree in number, gender, and person. Figure 12.6 A summary of the binding constraints Every catj ate its; dinner. Each manj shot himself;. Generally, a bound variable interpretation must be C-commanded by the universal quantifier as follows: 4. Bound variable constraint: A nonreflexive pronoun may be bound to the variable of a universally quantified NP only if the NP C-commands the pronoun. Nonuniversally quantified expressions, such as the existentials, and especially the definite quantifier THE, do not require this constraint and may serve as the antecedent of a pronoun in any situation allowed by names and other referring terms. Figure 12.6 summarizes the set of principles that have been discussed. There are still some counterexamples to these constraints that indicate that further refinement is necessary to capture all the cases. For instance, the sentence He saw a book about himself. appears to violate the reflexive constraint, as the local domain for the reflexive pronoun himself would be the noun phrase the book about himself. Note that in the sentence He saw John's book about himself. the reflexive pronoun would co-refer with John and not the subject of the sentence. One possibility is to refine the definition of the local domain so that it includes only certain noun phrase forms, such as those with possessive determiners, and excludes noun phrases with "simple" determiner structures. Such strategies have been explored extensively in the linguistics literature. 370 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 371 BOX 12.1 Donkey Sentences Some more complex constructions appear not to follow the general patterns outlined in this chapter. These are often called the "donkey sentences," from a set of examples that was used to illustrate the problem. Consider the sentence Every man who owns a donkey; beats it;. The natural reading here involves it being a bound variable co-referring to a donkey, which itself ranges over a .set of values as it is within the scope of the uni- _; versally quantified NP every man. The problem arises because the pronoun does not seem to fall within the scope of the noun phrase to which it is bound. To see this, consider the unscoped logical form (PRES (BEAT [AGENT )>] [THEME dl])) If the quantifier for dl is lifted over ml, there will only be one donkey that every man owns, not the intended interpretation. But if dl stays within the scope of ml then the quantifier is trapped within the relative clause and thus shouldn't be available for the bound variable interpretation of it. These examples have motivated much work (for example, Kamp (1981) and Heim (1982)). Computing Co-Reference Constraints To compute the co-reference constraints, you need to know the set of constituents that C-command each constituent, and of these, which ones are in the local domain. With this information, the co-reference constraints can be generated for the different types of noun phrases. The co-reference constraints will be encoded as additional restrictions in the logical form, using three new predicates. EQ-SET states that the first argument must be equal to one of the terms listed in the second argument. For example, (EQ-SET bl (gl dl)) asserts that bl must be equal to either gl or dl. NEQ-SET, on the other hand,; asserts that the first argument cannot be equal to any of the terms listed in the second argument. For example, (NEQ-SET bl(gldl)) asserts that bl cannot be equal to gl or to dl. Finally, the predicate BV-SET lists the C-commanding noun phrases, which will include all the possible bound variable interpretations for nonreflexive pronominal forms. (The possible bound variable interpretations for reflexive forms will already be included in the EQ-s SET for the pronoun.) —i These predicates serve to place restrictions in co-reference relations but do not completely define the possibilities. With nonreflexive pronouns, for instance, we only state what noun phrases the pronoun cannot co-refer with based on the syntactic structure. In particular, other intrasentential antecedents may be eliminated because of agreement restrictions. Since the antecedent may follow the pronoun, there Is not an easy way to generate all the possible intrasentential referents while parsing. Furthermore, since the agreement check needs to be done for discourse readings as well, it seems best to leave these cases for subsequent contextual processing. But pronouns can take a bound variable reading only when it is explicitly allowed in the restrictions. For example, the following are the logical forms that should be generated for the noun phrases in the sentence Every boy thought he saw him: Every boy: he: (PRO hi (& (HE1 hi) (BV-SET hi (bl)))) him: (PRO h2 (& (HE1 h2) (NEQ-SET h2 (hi)) (BV-SET h2(bl)))) In other words, hi may have a bound variable interpretation, co-referring with bl, or it may have a discourse intersentential reading. Pronoun h2 may also take a bound variable reading or a discourse reading, but in no case can it co-refer with hi. These restrictions have successfully captured the binding constraints imposed by the syntactic structure of the sentence. As a final example, consider the sentence The woman who owns every record bought it for herself. The reflexive pronoun must refer to the woman, but the pronoun it cannot have a bound variable reading with every record because the quantified phrase does not C-command it. The logical forms are as follows: every record: The woman who owns every record: ))> it: (PROil (IT1 il)) herself: (PRO hi (& (SHE1 hi) (EQ-SET hi (il wl)))) Note that there are no restrictions placed on il. It cannot, however, co-refer with rl because this would create a bound variable reading, which is not allowed unless explicitly stated in the restrictions. Note also that hi must co-refer with either il or wl. If the parser checked agreement restrictions, il could be eliminated here and hi would have to co-refer with wl. It is relatively straightforward to modify a grammar to keep track of the C-command relationships and local domains using two new features: CC, the discourse variables of all C-commanding constituents not in the current local domain; and LCC, the variables of the C-commanding constituents in the local domain. 372 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 373 The equations can be derived by considering two observations: • Since S and NP structures define new local domains, they "reset" the LCC variable. The values in the LCC feature become part of the CC value for the subconstituents. • No LCC or CC values generated within a subconstituent of an S or NP need to be passed back up outside the constituent. 12.5 Adjective Phrases Adjectives generally introduce restrictions on the object referred to in the NP, but in many cases the analysis is somewhat complex. Adjectives can be classified into two classes: intersective adjectives and nonintersective adjectives. Inter-sective adjectives are so called because they can be analyzed in terms of set intersection. For example, to generate the set of green balls, you could intersect the set of balls with the set of green objects. Such adjectives can be treated as predicates that are added as restrictions to the noun phrase. For example, the logical form for the green ball would be There are several classes of nonintersective adjectives. Examples of the first class are large, slow, and bright, which are relative to the noun phrase being modified. For instance, a slow dolphin travels considerably faster than a fast snail. Similarly, a person, when considered as a college professor, might he seen as a fast runner, but when considered as an athlete, might be seen as slow. Thus you cannot simply model slow as a one-place predicate true of all objects that are slow in some absolute sense. These constructs will be handled in the logical form by using predicate modifiers, which have been used once before to handle plural nouns. A predicate modifier is a function from one predicate to a new predicate. Semantically, these are functions from sets to sets. If DOLPHIN 1 is a predicate representing the set of dolphins, then SLOW1 takes a set such as DOLPHIN 1 and produces a subset consisting of those dolphins that are slow (relative to the set oi dolphins). Because they always produce a subset, these adjectives are often called subsective adjectives. Using the notation developed in Chapter 8 for predicate modifiers, the logical form for the slow dolphin would be A small grammar showing how intersective and subsective adjectives can be treated is shown in Grammar 12.7. The lexical entry for slow would be (ADJ ROOT slow ATYPE SUBSECTIVE SEM SLOW1) Rule 2 would then produce the following SEM for the CNP slow dolphin: (SI.OW1 DOLPHIN 1) which could be used in a standard NP rule to build this SEM for the slow dolphin 1. (CNP SEM (X x (& (?semadjp x) (?semcnp x)))j -» (ADJP ATYPE INTERSECTIVE SEM ?semadjp) (CNP SEM ?semcnp) 2. (CNP SEM (?semadjp ?semcnp)) -» (ADJP ATYPE SUBSECTIVE SEM ?semadjp) (CNP SEM ?semcnp) 3. (ADJP ATYPE ?a SEM ?sem) -» (ADJ ATYPE ?a SEM ?sem) Grammar 12.7 Some interpretation rules for adjectival modification The adjectives in the next class of nonintersective adjectives do not have a simple relation to the sets that they modify. These include adjectives such as average, toy, and alleged. An average grade may not be an actual grade, a toy gun is not really a gun, and an alleged murderer may not be a murderer. Rather, each of these modifiers relates to the property it modifies in a different way. As another example, the adjective former, modifying a property like astronaut, would produce a set of individuals who were once astronauts but are not now. The adjective toy modifying gun would produce a set of objects that are made to look or act like guns. An adequate semantics for these constructs is beyond the scope of this book, but we will consider one class of particular interest for computational applications such as a database query system. These adjectives define a quantity that is the result of a computation over the modified set, like average. To handle such constructs, an explicit set construction operator must be introduced into the logical form. This can be defined as follows: (SET (X x Px)) is the set of all objects that satisfy the predicate (X x Px) Given this, the rule for adjectives like average are handled as follows: (CNP SEM (?semadj (SET ?semcnp))) -> (ADJ ATYPE SET-PROPERTY SEM ?semadj) (CNP SEM ?semcnp) Given this rule, the logical form for the NP the average grade in the class is ))))) al)> Adjective phrases may also have modifiers and complements, although complements to adjectives when used as prenominal modifiers is quite rare. But adjective phrases such as light red arc common. Such adjectival modifiers can be treated as predicate modifiers and handled without much problem. Thus the rule for simple adjective modifiers would be (ADJP SEM (?semadv ?scmadjp)) -> (ADV ATYPE PREMOD SEM ?semadv) (ADJP SEM?semadjp) 374 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 375 If the SEM for the adjectival modifier light is LIGHT 1, then the logic form of the adjective phrase light red would be the predicate (LIGHT 1 RED1) This adjective phrase would then combine, with other constituents as usual. Fm example, the logical form for the light red bus would be Comparatives A class of adjective phrases important for computational applications is theZ comparative forms. These are complex constructions, and only the simple cases' will be considered here. Comparative adjective constructions appear in many; forms, such as He is larger than the boy is. He is larger than the boy. The more brilliant woman won the prize. A woman more brilliant than the first speaker has won the prize. I am happier than I have ever been. These cases should be distinguished from constructs such as I own more horses than George does. Jack gave as much money as he could. which are also called comparative constructs but are noun phrases in which the more ... than and as much ... as constructs are better treated as quantifiers than adjective phrases. These cases are considered in Box 12.2. Comparative adjective phrase constructs all implicitly identify some scale by which different objects can be compared. For example, happier identifies a scale of happiness, in which the adjective happy identifies a range at one end of the scale and sad a range at the other end. An object X is happier than an object Y only if X is higher on the scale (that is, closer to happy) than Y is. Note tbof being happier does not entail being happy. Both X and Y might be sad, but \ could be happier than Y. Of course, the adjectives may be ambiguous as to Lie actual scale used. For instance, the sentence Canada is larger than the U.S. true with respect to the area scale but false with respect to the population scale Different comparatives may identify the same scale. For example, larger and smaller refer to the same scale but differ in what direction on the scale the objects are being compared. With this background we define comparative adjectives as being binary predicates that order two objects on a specific scale. Given a scale S, the predicate (MORE S) is a binary predicate that is true when the first argument is greale: than the second on scale S. The SEM of heavier would be (MORE WEIGHT SCALE), the SEM of higher would be (MORE HEIGHT-SCALE), and the SEM of happier would be (MORE HAPPY-SCALE). Many comparative constructs take a complement that uses the complementizer than followed by an S/ADJP or an NP. A rule for the NP case would be (ADJP SEM Xx (?semadjp x ?semnp)) -» (ADJP ATYPE COMPARATIVE SEM ?semadjp) (THAN) (NP SEM ?semnp) in which (ADJP ATYPE COMPARATIVE) would be defined to include constructs such as more brilliant and happier. Given this, the logical form of the adjective phrase happier than John would be X x . ((MORE HAPPY-SCALE) x (NAME jl "John")) 12.6 Relational Nouns and Nominalizations Nouns like sister, boss, and author are quite different from other common nouns like dog or idea. In particular, they do not refer directly to a type of individual; rather, they are defined in terms of a relationship to other individuals. For example, it makes little sense to talk of the denotation of sister as the set of all people who are sisters of someone else. Rather, each time the noun sister is used, it is explicitly or implicitly defined in terms of some person whose sister we are considering. Standard constructions indicate this object, such as the possessive, as in John's sister, and a PP complement using of, as in the author of this book. This suggests that a common set of subcategorizations and semantic interpretation rules should be defined for such nouns. One approach is to define each relational noun by a nonrelational type and a binary relation that uses a subcategorized NP. For example, the meaning of author might be X b X p (& (PERSON p) (AUTHOR-OF b p)) where the first variable is bound to the book involved, and the second is the person who is the author. This could be used with a rule like rule 1 in Grammar 12.8 to give the following logical form to the CNP author of the book: X p (& (PERSON p) (AUTHOR-OF p)) The possessive form can be treated similarly by a special rule for relational nouns that overrides the standard rule for the possessive described earlier. In this case some feature would have to be introduced to indicate a subcategorized possessive NP, which would then have as its SEM the SEM of the NP embedded in it, just as subcategorized PPs took their SEM from their NP complement. If we use the PRED feature for this, then the rules for the possessive with relational nouns would be rules 2 and 3 in Grammar 12.8. The grammar also needs to 376 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 377 BOX 12.2 Numerical Quantifiers and More Numerical quantifiers play an important role in database query applications. For the most part they are straightforward, but there are some complexities that need to" be pointed out. Numerical quantifiers fall into the existential class of quantifiers and appear as in the following noun phrases: Three men rowed the boat. At least three men rowed the boat. More than three men rowed the boat These are handled by allowing more complex quantifiers such as (EXACTLY 3), : (AT-LEAST 3), (MORE-THAN 3), (LESS-THAN 3), and so on. Thus the logical form of the noun phrase Three men would be <(EXACTLY 3) ml (PLUR MAN1)> Noun phrases such as more men than were at the concert and less money than I expected are more complex. One treatment is to assume that the determiner structure is discontinuous; for example, the quantifier in the first example is more ... than were at the concert. The interpretation uses the operator MORE-THAN just defined, but in this case the cardinality of a set is specified as its argument rather than a simple number. Thus the logical form of more men than were at the concert would be something like <(MORE-THAN (SIZE (X x (& ((PLUR MAN1) x) (AT x )))) ml (PLUR MAN 1))> handle cases where there is no subcategorized constituent present, as in the noun phrase the author. To handle this case, we would introduce an anaphoric element to stand for the missing object. This is done in rule 4, where REL-N1 is a nev sense for this form of inserted anaphoric element. Given this grammar and the meaning of author above, the noun phrase thi book's author would have the logical form al))> whereas the noun phrase the author would have the logical form Words like murderer, actor, and thief, often ending in the suffixes -er oi -or, also fall into this class and identify people by the acts they perform. Thus a murderer performs one or more murders, an actor acts, and a thief steals things There are other nouns that also identify cases other than the AGENT case: Foi instance, the noun recipient indicates the TO case in transfer acts, donation _ identifies the THEME case in a donating act, and in general words with the suffix -ee, such as addressee and rentee, indicate cases filled by animate objects that arc not the AGENT in two-person interactions. These nouns also subcategorize foi 1. (CNP RELATIONAL - SEM (?semcnp ?sempp)) -> (CNP RELATIONAL + SEM ?semcnp) (PP PRED - PTYPE of SEM ?sempp) 2. (NP POSS + PRED - SEM ?sem) -> (NP POSS - SEM ?semp) 's 3. (NP RELATIONAL - VAR ?v SEM ) -» (NP POSS + PRED - SEM ?semnp) (CNP RELATIONAL + SEM ?semcnp) 4. (CNP RELATIONAL - SEM (?cnp (PRO x REL-N1))) -> (C2VP RELATIONAL + SEM ?cnp) Grammar 12.8 Rules for simple relational noun phrases constituents that identify other cases in the action. For example, in the murderer of John, the NP John intuitively fills the THEME case of the murder act. Thus an analysis of murderer in terms of the action of murdering seems to capture many generalities. Specifically, the meaning of words ending in -er could be systematically derived from the verb forms. For example, assume that the verb murder has the SEM X o X a (MURDER 1 * [AGENT a] [THEME o]) This same form can be used for the relational noun murderer and, given the rules in Grammar 12.8, the logical form for the murderer of John would be Not all -er forms occur with action verbs with agents. For instance, owner would be defined in terms of the verb own, which does not have an agent role. The lexical entries for verbs would be extended to define the appropriate interpretation of the -er form. Another important class of nouns includes the nominalizations of verbs themselves, such as destruction, that can refer to either an act of destroying or to the result of a destroying act. Such nouns also allow PP complements with of, as in the noun phrase the destruction of the city, which would indicate the THEME role. Similarly, an AGENT case can be introduced using the preposition by, as in the destruction of the city by the Huns, which has one analysis that refers to a particular destroying act: ] [THEME ])> The possessive form can signify either the AGENT case (the Huns' destruction) or the THEME case (the city's destruction). 378 CHAPTER 12 12.7 Other Problems in Semantics This section briefly describes a few additional problems in semantic interpretation that are important but cannot be considered in detail in this book: mass terms, generics, intensional operators, and noun-noun modification. A thorough discussion of these issues would easily take another chapter or two. Mass Terms Nearly all the noun phrases used so far in the examples have been count noun phrases, such as the dog, a few clowns, and so on. These are called count noun phrases because they refer to specific individuals that can be counted. Thus you can have one dog, three dogs, or more dogs than someone else has. Singular noun phrases involving count nouns refer to individuals, while plural ones refer to sets. In contrast, mass noun phrases such as some water and the hot sand do not refer to an individual or to sets but rather to substances that occur in quantities. Mass terms are measurable in terms of units of measure, as in the noun phrases three quarts of water or five pounds of sand. Mass nouns do not occur in the plural form without a change of meaning and cannot be used in noun phrases with the quantifiers each and every. It is important to note that, while nouns are classified as either count nouns or mass nouns, most can be coerced into changing forms by the noun phrases in which they appear; For example, since English allows only singular bare noun phrases (such as sand, very clean water) in the mass interpretation, if you use a count noun in such a construct, you force a change of meaning into a mass term. If you say We ate deer for dinner, then the NP deer is a mass term, presumably referring to deer meat. Conversely, while water is a mass noun, the NP many, waters forces a count interpretation, presumably to a set of lakes. Generics Another complicated form of sentence involves the generic usage. Sentences involving generics make general claims about the world, as in the following sentences that can all be used to make a claim about lions in general: Lions are dangerous. The lion is a dangerous animal. A lion can be dangerous. In the generic interpretation the noun phrases neither refer to individuals nor quantify over a set of individuals. The sentence Lions are dangerous can be true even if there are some tame lions. In fact, a generic statement may be true even if most objects in the class do not have the property. For instance, Carlson I l'ri>i points out that Sea turtles lay approximately 100 eggs may be true even if most sea turtles are male and lay no eggs at all. Scoping and the Interpretation of Noun Phrases 379 1 Rather, generics seem to make claims about the type or kind of object being described. The discussion of how you may reason with such facts will be delayed until later chapters dealing with knowledge representation and reasoning. For the moment, however, we need to introduce the distinction into the logical form language. Although there are several different theories on how to do this, most involve introducing the notion of a kind into the ontology. A kind is separate from an individual or set of individuals and refers to the species, substance, or type of individuals. This allows kinds to have certain properties without committing to whether the individuals of that kind have the properties or not. Of course, it is often the case that if a kind has a property, all or most of the individuals of that kind also have the property. For example, if we know that Dogs have four legs, we can infer that most individual dogs have four legs; however, this doesn't exclude the possibility that there is a three-legged dog. Kinds may also have properties that do not hold of their individuals. For instance, saying Dodos are extinct does not mean that any individual dodo is extinct. Rather, being extinct is a property of the kind (in this case species) denoted by dodos. Note that all bare plural noun phrases are not necessarily treated as generics. A common distinction is found in sentences such as Large dogs have large noses. The property being asserted of the kind large dogs is not that each one has a number of large noses (a set reading), or that they possess the substance nose (the generic reading), but that they have a particular nose (an existential reading). Thus bare plural noun phrases may be interpreted differently depending on their functions. Many other syntactic factors can indicate generic readings besides the bare plural forms. In fact, as shown at the beginning of this section, generic sentences are possible using singular definite and indefinite noun phrases as well. Another common indicator of generic sentences is the use of the simple present tense. For instance, consider the difference between Large dogs run over the hill and Large dogs ran over the hill. The first is most likely a generic statement, while the second is most likely a description of a particular event. However, both of these could occur in the generic or nongeneric readings. Temporal modifiers that indicate extended duration also indicate the generic, as in Large dogs ran over the hill in those days, which is a generic statement about the past. Other adverbials, such as often and rarely, are also strong indicators of generic readings. Intensional Operators and Scoping A verb such as believe or want that takes a finite clause as a complement will interact with definite and indefinite descriptions in complicated ways. The sentence Jack believes a man found the ring, for example, is ambiguous between the case in which Jack knows of a particular man whom he believes found the ring and the case in which all Jack believes is that the ring was found by some man. These ambiguities are often captured by scoping differences. To see this, you must examine the logical form of such sentences in more detail. 380 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 381 As described in Chapter 8, the semantics of such verbs is complex becauv they do not support the normal inference rule of substituting equals. For example, if John is the tallest man, and you are given the fact that John kissed Sue. you can conclude that the tallest man kissed Sue. A similar inference, however, does not work for belief sentences. Given that Sam believes John kissed Sue, you eanno. conclude that Sam believes the tallest man kissed Sue, since Sam may not know-that John is the tallest man. Worse yet, he may believe that George is the tallesi Because equal terms cannot be substituted for each other in the scope of such -operators, they are often called referentially opaque, while normal predicates are said to be referentially transparent. Opaque operators introduce a new form of scoping ambiguity: A quantifier can be inside the scope of the operator as well as being outside its scope. This form of scoping can be used to capture the ambiguities in John believes a man kissed Sue, which has the unscoped logical form (PRES (BELIEVE bl LEXPERIENCER (NAME jl "John")] [THEME (PAST (KISS kl (NAME si 'Sue")))])) The first reading involves John knowing who the man is; it could be para- ' phrased by There is a particular man whom John believes kissed Sue. Its form is ■-■ (Aml: (MAN ml) (PRES (BELIEVE bl [EXPERIENCER (NAME jl "John")] [THEME (PAST (KISS kl ml (NAME si "Sue")))]))) This involves what is often called a de re belief, that is, a belief about a particular man. The second case involves John simply believing that some man kissed Sue. though he doesn't know who. It would have the following form: (PRES (BELIEVE bl [EXPERIENCER (NAME jl "John")] [THEME (PAST (A ml: (MAN ml) (KISS kl ml (NAME si "Sue"))))])) This involves a de dicto belief, that is, belief about a proposition. There are many other examples of this phenomenon, some of which can be quite complicated. For example, some time ago I was at a party and someone walked into the kitchen and said I'm waiting for a phone call. This expression later led to some confusion because I had understood it in the usual interpretation that he was waiting for a call from someone in particular. It turned out, however, that the speaker simply wanted to hear the phone ring, so any call would do! Interestingly enough, the intent of the original sentence could have been conveyed unambiguously by the rather unusual sentence I'm waiting for any phone-= call. But this sentence is unusual simply because the contexts in which it is " appropriate are very rare. Noun-Noun Modifiers NPs with noun-noun modifiers such as car paint, soup pot handle, water glass, and computer evaluation are notoriously difficult to analyze. The syntactic form provides little guidance, and general semantic constraints like those used for analyzing multiple, adjectives do not seem to be present. There are two main problems. One is determining what modifies what, which is similar to the problem with multiple adjectives. The other problem is detecting the semantic relationship between the modifying noun and the modified noun. There are several common patterns on how the nouns are related. Some common relationships for an NP consisting of a sequence of two nouns, X and Y, are • Y is a subpart of X (pot handles). • Y is used for some activity involving X (car paint). • Y is made out of material X (stone wall). Thus a pot handle is a handle that is part of a pot, car paint is paint used to paint cars, and a stone wall is a wall made out of stone. Nouns that have associated cases, or that identify the case of a particular verb, can be analyzed in terms of those cases. For example, with the NP the contest winner, the noun winner identifies the AGENT case of a WIN action, and contest can be seen to fit the THEME case. Thus this NP might be analyzed as ))> The NP the car evaluation, on the other hand, refers to an evaluation event directly, and car is seen to fit the THEME case. If the modifier can fill more than one case, the NP will be ambiguous. For example, the computer evaluation could be either the evaluation of the computer or the evaluation by the computer. In the final analysis many noun-noun modifications appear not to be resolvable by strategies based on reasoning, but rather are learned directly from use in context. In the logical form, unless there is strong evidence for a more particular relationship, an ambiguous predicate modifier N-N-MOD is used to capture the fact that there is some relationship, but it cannot be determined without context. For example, a noun phrase such as the smoke box, which has no obvious modification relation, would have the logical form and it would be left for contextual interpretation to determine the type of the object being described and its other properties. Summary This chapter considered many issues that were ignored in the developments in the earlier chapters. Many of these issues concern the scoping of operators and constraints on reference. We showed that the choice of quantifier and syntactic 382 chapter 12 Scoping and the Interpretation of Noun Phrases 383 structure can greatly affect the allowed scopings and the preferred interpretation of sentences. In addition, there are significant structural constraints on intra-sentential co-reference between pronouns and other referring expressions. Thel chapter also considered adjectival modification and showed that you must dis- = tinguish between several different classes of adjectives to obtain appropriate^ interpretations. Finally, we considered nouns that exhibit a complex semantic ~~ structure, defining them in terms of relations or in terms of related verb forms. Related Work and Further Readings_ Determining the scoping of quantifiers has been a central concern for those building natural language database query systems, and many of the heuristics =i3 described in this book have been used in work by Woods (1978), Pereira (1983), Saint-Dizier (1985), Grosz et al. (1987), and Alshawi (1992). VanLehn (1978) performed a set of experiments to test the preferences that have been suggested in --i the literature. While humans displayed clear preferences in some cases, many times it appeared that contextual effects dominated the process. Hobbs and Shieber (1987) describe an algorithm for generating all possible scopings, eliminating only those that are linguistically unacceptable in any context. More general treatments of scoping, including adverbs and sentential operators, have been examined by McCord (1986). One of the first systems to use quantificational structures separate from the semantic form during semantic interpretation was the LUNAR system (Woods, 1978) within the ATN framework, and it has been used in many approaches since. Cooper developed a theory within formal semantics using a technique called quantifier storage that uses the same intuitions. Pereira and Pollack (1991) provide an interesting formalization of this type of approach based on storing and '■ discharge assumptions about the discourse. There is little computational work specifically aimed at dealing with intrasentential anaphora. One of the few algorithms in the literature is that described by Hobbs (1978). This algorithm effectively uses the constraints in this chapter to eliminate candidates based on retlexivity constraints, and then imposes a heuristic ordering on the remaining candidates, roughly based on the proximity of the antecedent to the pronoun (that is, the closest antecedent is selected). There is a considerable literature on quantification, anaphora, and binding theory in linguistics. The best place to start is a textbook such as Chierchia and McConnell-Ginet (1990, Chapter 3). May (1985) is a good example of quantifier scoping using syntactic transformations (which he calls quantifier raising). Reinhart (1983) was one of the first to develop an analysis of anaphora based on binding constraints. A more general reference on the underlying linguistic theory. Government and Binding, can be found in Chomsky (1981). Computational work using this approach can be found in Berwick and Weinberg (1984) and Stabler_-(1992). A general discussion of the formal semantics of adjective phrases can be found in Chierchia and McConnell-Ginet (1990). More detailed discussions can be found in Kamp (1975) and Klein (1980). There is a good general discussion of comparative adjectives in McCawly (1993). Computational accounts of comparatives can be found in Ballard (1988) and Alshawi (1992). Relational no'uns are discussed in de Bruin and Scha (1988). Noun-noun modification has been considered by Finin (1980). Hobbs et al. (1993) use a vague predicate NN to encode noun-noun modification relationship, which is later resolved using contextual information. The notion of kinds and their use in generic sentences was introduced by Carlson (1979; 1982), and there has been considerable work in this area since. A good survey of several approaches can be found in Schubert and Pelletier (1987). There is an excellent collection on mass nouns in Pelletier (1979), and a good overview of some of these approaches to mass nouns and generics in McCawley (1993). Another important area of semantic interpretation not discussed here is the interpretation of phrases (mainly prepositional phrases) describing locations. A good reference for this area is Herskovits (1986). Exercises for Chapter 12__ 1. (easy) Specify all logically distinct interpretations of each of the following sentences by considering scoping ambiguity. If two different scopings give the same interpretation, however, list only one. Each man gave a boy a present. Some man from a club stole every present. Most people from all companies ate some peaches. 2. (easy) Specify all the possible antecedents for the pronouns with indices i or j in the following sentences, considering all the constraints specified in Section 12.4. Jack i realized that he [ would see Sanvj at the party3. After Jacki left the roonrj, Sam3 saw him; run to hisj car4. Even though Sami apologized, Jack2 wants to kill hinii. 3. (easy) Classify each of the following adjectives as intersective, subsective, or neither. Justify your answers. Give a plausible logical form for the object noun phrase in the corresponding sentences. fake I bought a fake turtle, sluggish I bought a sluggish turtle, expensive I bought an expensive turtle, dead I bought a dead turtle, smelly I bought a smelly turtle. 384 CHAPTER 12 Scoping and the Interpretation of Noun Phrases 385 4. (medium) Draw out the parse trees to the level of detail shown in Figure -12.3 for the sentences . . . . When does the flight to each city leave? When did the flight that each man took leave? using Grammars 12.2 and 12.4 and the scoping strategy described ins Section 12.3. 5. (medium) Give the semantic analysis of each noun phrase, including co-reference constraints, produced for the following sentences. Jack saw the man beside him. Sue saw Jill's book about herself. Every man gave his book to him. 6. (medium) a Write a program that, given a logical form, generates all possible quantifier scopings consistent with the constraints described in Section 12.1. Thus, given a representation of the following as input. (PRESfSITsl [AGENT ] [AT-LOC ))>])) the program would output (ml (kl hi)) ((kl hi) ml) (ml hi kl) (hlmlkl) (hi kl ml) Construct a set of test cases that demonstrate your algorithm's capabilities. b. Extend your program to eliminate logically redundant scopings to produce a smaller set of interpretations. Identify the equivalences you use to identify redundancy and justify that the alternate forms are really logically equivalent. 7. (medium) Consider the unscoped logical form for the sentence In every house, a man fixed his sink. (PAST (FIX1 fl [AGENT ] [THEME ] [AT-LOC ])) Draw the syntactic tree for this sentence, and identify the possible antecedents for each of the pronominal forms. What are the actual co-reference relations for the most intuitive reading of this sentence? Is the definite description treated referentially or existentially? 8. (medium) Section 12.5 discussed the use of adjectives, like average, that are functional on some set. Average can also be a noun, as in the class average or the average of the best students. Give an analysis of the properties of such nouns, and develop a set of semantic interpretation rules for them. PART III Context and World Knowledge 388 part in Part III concerns contextual processing of language. Topics range from knowledge representation and reasoning to discourse structure and pragmatics. The term discourse will be used to refer to any form of multi-sentence or multi-utterance language. The most important forms of discourse in computationnl applications are text, such as articles and books, and dialogue, which involves,i conversation between two or more conversants, whether spoken or by keyboard. Many discourse models apply equally to written and spoken language, so tlv words sentence and utterance will be used interchangeably and the terms speakei and hearer will be used for writer and reader, respectively. It will be useful to divide the notion of context into the situational context and the discourse context. Consider a conversation. The situational mnuvi consists of the specific setting, which includes the time and location and the objects visible to each conversant, and the general setting, which involves the conversants' background knowledge of the world. The discourse context consisK of the local discourse context (or just local context), which is detailed information about the preceding sentence, and the global discourse contcxi. which is information about the overall conversation, including what topics ate being addressed. The following table gives an example of some contextual information during a conversation between two agents, John and Mary, just after Mary says My coffee is cold. Specific/Local Context General/Global Context Situational Context Mary was the speaker. John was the hearer. It is 3 o'clock. They are in the student union. They each have a cup of coffee on the table between them. ... They are in the USA. The President is Bill Clinton. There is a single moon that rotates around the earth. Birds typically can fly. ... Discourse Context The last sentence parse tree is (S (NP my coffee) (VP is cold)). My coffee refers to the cup of coffee next to Mary. Mary asserted that the coffee is cold. John and Mary have been discussing a project for the physics class they arc working on. They then began discussing the quality of the food at union. Chapter 13 provides an introduction to the issues in knowledge representation and reasoning, supplying a foundation for the discussions of how knowledge and reasoning affect the interpretation of language. Chapter 14 addresses a central issue in contextual interpretation, namely the identification ol the referent of noun phrases such as pronouns and definite descriptions, primarily using the local discourse context. Chapter 15 addresses issues in the situational context, exploring methods of using knowledge of the world and the specific situation to interpret language. Chapter 16 addresses issues relevant to the global discourse context, focusing on identifying the structure of the discourse itselj Finally, Chapter 17 discusses issues in specifying a conversational agent, looking at how the underlying intentions of speakers can be identified from iheii utterances, and how intentions can be realized by linguistic means. CHAPTER Knowledge Representation and Reasoning . V 13.1 13.2 13.3 13.4 * o 13.5 3 o 13.6 1 ■ o 13.7 i o 13.8 Knowledge Representation and Reasoning 391 Many problems that arose in earlier chapters were not resolved because they required knowledge of context. Two important aspects of context are general knowledge about the world and specific knowledge about the situation in which the linguistic communication is occurring. To analyze these, you need a formalism for representing knowledge and reasoning. This area of study is called knowledge representation (KR). Knowledge representation means different things to different researchers. For some, knowledge representation concerns the structure of the language used to express the knowledge—whether it is in logic, semantic networks, frames, or other specially designed representational formalisms. For others, knowledge representation concerns the content of sentences—what predicates are needed and how are they organized. Both of these issues are important. Sometimes, what seems to be a vigorous debate about knowledge representation is actually the result of each of the debaters focusing on one of the aspects of representation without considering the concerns of the other. There is not the space here for an extended discussion of knowledge representation formalisms. Rather, the central concern is how knowledge and reasoning can be used to facilitate language understanding. Because of this, an abstracted representation based on the first-order predicate calculus will be used. This will enable the discussion of relevant representational issues with the minimum introduction of new material and notation. This does not mean that the underlying knowledge representation used in a system would have to directiy represent logical formulas or use theorem-proving techniques as a model of inference. The underlying representation system could be a semantic network, a description logic, a frame-based system, a connectionist model, or any other formalism, as long as the system has at least the expressive power described here. Many modern knowledge representation systems fulfill this requirement. Section 13.1 discusses some general issues in knowledge representation. The next two sections develop the abstract knowledge representation language that will be used throughout the rest of the book: Section 13.2 discusses a simple representation based on the FOPC, and Section 13.3 discusses a framelike representation as a way of clustering information and representing defaults about stereotypical objects and situations. Section 13.4 discusses an important issue in mapping the logical form language to the knowledge representation language, namely the representation of the complex quantifiers found in natural language. The rest of the chapter is optional but is important if your focus is knowledge representation. Section 13.5 examines the representation of temporal information in language, as revealed by tense and aspectual class. The remaining sections give examples of some different reasoning strategies that are used in knowledge representation systems. Section 13.6 discusses some basic concepts in automated deduction, which are used in systems based on deductive techniques as well as systems based on logic programming. Section 13.7 discusses an approach thai uses a procedural semantics and illustrates its use in a question-answering system. Section 13.8 discusses some issues in developing hybrid reasoning 392 CHAPTER 13 Knowledge Representation and Reasoning 393 systems, which allow different reasoning techniques to be used depending on what sort of information is required. This chapter assumes that the reader has a basic understanding of the first- -order predicate calculus, at least to the level introduced in Appendix A. 13.1 Knowledge Representation There are two forms of knowledge that are crucial in any knowledges representation system: general knowledge of the world and specific knowledge of ? the current situation. Some aspects of general world knowledge have already been considered, such as type hierarchies, part/whole relationships, and so on. This consists of information about general constraints on the world and the semantic definition of the terms in the language. For the most part, general knowledge is specified in terms of the types or kinds of objects in the world and does not concern information about specific individuals. For instance, it might encode that OWN! is a relation between people and objects but not than particular person, say John, owns a particular car. This latter information, knowledge about individuals, is equally important in language understanding and is a major component of what we call the specific setting of the sentence being understood. Virtually all knowledge representation systems support reasoning at both of these levels in one way or another. General world knowledge is essential for solving many language interpretation problems, one of the most important being disambiguation. For example, the proper attachment of the final PP in the following two sentences depends solely on the reader's background knowledge of the appropriate time needed for reading and for evolution: I read a story about evolution in ten minutes. I read a story about evolution in the last million years. Specific knowledge of the situation is important for many issues, including determining the referent of noun phrases and disambiguating word senses based on what makes sense in the current situation. We will sometimes talk informally of the knowledge representation as encoding the knowledge and beliefs of the understanding system. But since the terms knowledge and belief will be given more precise technical definitions later,, more neutral terminology is introduced. A knowledge representation consists of a database of sentences called the knowledge base (KB) and a set of inference techniques that can be used to derive new sentences given the current KB. A set of inference techniques is sound if it only derives true new sentences when the original sentences in the KB are all true. Not all useful inference techniques need]; be sound, however, as you will sec later. The language in which the sentences in the KB are defined is called lhe.,-5 knowledge representation language (KRL). The KRL could be the same as the.:; logical form language, but there are practical reasons why they often differ. The* -J* '.via two languages are driven by different needs. The logical form language must be very expressive in order to simplify the semantic interpretation process and to allow the effective resolution of ambiguity. The knowledge representation language, on the other hand, must support efficient and predictable reasoning within a specific domain. In other words, it should be relatively easy to define the set of sentences that can be inferred from a particular KB and to build computational models that perform such inferences in a reasonable amount of time. For example, consider the treatment of quantifiers. In the logical form language a wide range of quantifiers was introduced, closely corresponding to the different word senses of English quantifiers. This allows disambiguation techniques, such as those needed for scoping quantifiers, to use subtle differences between the actual quantifiers (say between each and every). In most current knowledge representation languages, however, there are usually only a few quantifiers and often only one—a construct allowing universal quantification. Thus inference processes can be easily defined. By keeping the languages separate and defining a mapping function between them, you can have the advantages of both. Of course, the success of this approach will depend on whether or not the mapping function can be effectively defined. Substantial research will be needed before the best way to satisfy the need for an expressive logical form and an effective knowledge representation can be determined. Given the current state of knowledge, maintaining a separate logical form and knowledge representation languages seems the best compromise. When a formula P must be true given the formulas in a KB, or given formulas representing the meaning of a sentence, then we say that the KB (or septence) entails P. Many of the conclusions that need to be drawn to understand language are not entailments, however, but are implications of the sentence. Implications are conclusions that can typically be drawn from a sentence but that could be explicitly denied in specific circumstances. For instance, the sentence Jack owns two cars entails that Jack owns a car (that is, this fact cannot be denied), but only implies that he doesn't own three cars, as you could continue by saying In fact, he owns three cars. KR systems must support both these forms of inference. Types of Inference Many different forms of inference are necessary to understand natural language. Inference techniques can be classified into deductive and nondeductive forms. Deductive forms of inference are justified by the logical notion of entailment. Given a set of facts, a deductive inference process will make only conclusions that logically follow from those facts. Nondeductive inference falls into several classes. Examples include inference techniques that involve learning generalities from examples (inductive inference) and techniques that involve inferring causes from effects (a form of abductive inference). 394 CHAPTER 13 Knowledge Representation and Reasoning 395 Abductive inference can be contrasted with deductive inference by= considering the axiom AdB Deductive inference would use this axiom to infer B when given A. AbductiveT inference would use it to infer A when given B, since A is a reason that B is irue: Many systems allow the use of default information. A default rule is an inference rule to which there may be exceptions; thus it is defeasible. If you. write default information using the notation A => B, then the default inference"-rule could be stated as follows: If A => B, and A is true, and -iB is not provable, then conclude B. It has been suggested that default rules may provide a good-account of generic sentences. For example, the meaning of the sentence Birds flyA could be represented by the FOPC formula V x BIRD(x) FLIES(x) This has the effect that whenever there is a bird B for which it is not provable that -•FL1ES(B), then it can be inferred that FLIES(B). In other words, a specific bird will be assumed to be able to fly unless it is explicitly stated that it cannot. Defeasible rules introduce a new set of complexities in a representation. Without such rules, most representations are monotonic, because adding new assertions only increases the number of formulas entailed. Specifically, in a monotonic representation, if the knowledge base KB1 entails a conclusion C. and if you add an additional formula to KB 1 to form a new consistent knowledge base KB2, then KB2 will also entail C. This is not true of a representation that uses default rules, and hence they are called nonmonotonic representations. For example, consider a knowledge base K consisting of the formulas Cat( Sampson) TabbyCat( Sampson) V c . Cat(c) => Purrsic) Sampson is a cat. Sampson is a tabby cat. Cats purr. Given this KB, you can conclude PurrsiSampson) using the default rule because" there is no information to contradict Purrs (S). On the other hand, if you add a new fact that no tabby cats purr, then the extended knowledge base would no longer entail that Sampson purrs. There are other useful techniques for introducing nonmonotonic conclusions besides default rules. For instance, the closed world assumption (CWA) asserts that the KB contains complete information about certain predicates. For example, for a predicate P for which the CWA holds, if a proposition involving P cannot be proven from a KB, then its negation is assumed to be true. Consider a database query application for airline schedules. The KB stores information about flights that exist—say, that flight FDG100 flies from Rochester to Boston—but it doesn't explicitly contain negative information—say, that flight FDG100 doesn't fly to Chicago or that there is no flight FDG455. Such information can only be~' concluded if the inference process makes the closed world assumption on flights. BOX 13.1 A Semantics for Nonmonotonic Logic You can develop a model theoretic semantics for many nonmonotonic constructs using the concept of minimal models. For example, consider the closed world assumption for a predicate P. We can define an ordering on all the models of the KB as follows: ml 3 a . AGENT(x, a) & ANTMATE(a) 2. V a 3 x . ACTION(x) & AGENT(x, a) z> ANIMATE(a) 3. V x . OBJ/ACTION(x) 3 ACTION(x) 4. V x . OBJ/ACTION(x) d3o. THEME(x, o) & PHYSOBJ(o) 5. V o 3 x . OBJ/ACTION(x) & THEME(x, o) z> PHYSOBJ(o) Using these axioms, you can prove that the class OBJ/ACTION "inherits"' the AGENT role. In other words, for any object A such that OBJ/ACTlON(A) is true, you could prove that A has an agent role; that is, 3a.AGENT{A, a) & ANIMATE(a) '.J using axioms 3 and 1. I As described in Chapter 10, a procedural version of this would be a program that starts at the specified node OBJ/ACTION, finds all roles attached at that node, and then follows the S arc up to the supertype ACTION and finds all the roles attached there. The complete set of roles gathered by this procedure is the answer. Thus any OBJ/ACTION has an AGENT role inherited from the class ACTION. Both these techniques compute the same result, but the first does it by using deduction over logical formulas, while the second uses a program that performs a graph traversal. The first technique seems more rigorously defined, but the second is probably more efficient. In cases like this, in which you can prove that the two techniques obtain the same results, you can have the best of both approaches: a rigorously defined semantics and an efficient procedure to perform that form of inference. 13.2 A Representation Based on FOPC The KRL used in this book will be an extended version of the first-order predicate calculus. Note that by choosing the language, you are not committed to any particular form of inference. For example, later sections will show how the KRL can be used with both deductive and procedural inference techniques. The syntax of FOPC was introduced earlier and will not be presented again here. We will focus on the extensions to standard FOPC that are needed to represent the meaning of natural language sentences, and comment on the differences between this language and the logical form language. The terms of the language consist of constants (such as Johnl), functions (such as father{Johnl)), and variables (such as x and y). Note that the logical form language did not use constants. Rather, everything was expressed in terms of discourse variables to keep the representation context independent. In the KB, constants are used to represent the specific individuals. For example, the logical form term (NAME jl "John") represents the meaning of a phrase whose referent is named "John." The actual person referred to in a given context might be represented by the constant Johnl in the KB. It is convenient to use restricted quantification in the KRL, making it similar to the generalized quantifier notation in the logical form language. Restrictions follow the quantified variable separated by a colon. As mentioned in Chapter 8, for the existential and universal quantifiers, this notation can be treated as an abbreviation and does not extend the expressive power of the language. Thus 3 x: Man(x) Happy(x) is equivalent to 3 x . Man(x) & Happy (x) and V x: Man(x) Happy(x) is equivalent to V x. Man(x) z> Happy(x) We will also need the equality predicate, {a = b), which states that terms a and b have the same denotation. Given a simple proposition Pa involving a constant a, 398 CHAPTER 13 Knowledge Representation and Reasoning 399 : b, then PD must be true as well, where Pn is the same as PiS except that a has been replaced by b. Many knowledge representation systems do not explicitly use quantifier*. They do include variables, however, which act like universally quantified^ variables with wide scope. For example, a formula such as (P ?x A) in a Kl} would correspond in meaning to the FOPC formula V x . P(x, A). Existential!, quantified variables are handled by a technique called skolemization, which replaces the variable with a new constant that has not been used before. For example, the formula 3 y V x. P(x, y) would be encoded in the KB in a formula "J such as (P ?x Ski), where Ski is a new constant that has not been used before that stands for the object that is known to exist. Quantifier scoping dependencies are indicated using new functions, called Skoleni functions. For example, the -jS formula V y 3 x. P(x, y) would be encoded as a formula such as (P (Sk2 ?y) ?; i in the KB, where Sk2 is a new function that produces a (potentially) different" object for each value of ?y. Often, formulas will be written in a format merging these two approaches, where the universal quantifiers are still present but the existential variables have been skolemized. For example, the formula V y 3x P(x, >•) may be written as V yP(Skliy), y). It can be proven that all these different forms of representation are equivalent. Saying that the basic representation language is FOPC does not place many restrictions on the style of the representation. In particular, it says nothing about what the predicates are. There is a wide range of possibilities in selecting the predicates. At one end you could have a different predicate for each word sense essentially the strategy used in the logical form language. At the other end you could have a preset set of predicates, called the primitives, and every word sense would have to be defined in terms of these primitives. Consider some of the advantages of each position. By allowing a predicate for each word sense, you ; are able to capture subtle differences between semantically close terms. For, example, you might have information in the KB defining SAUNTERS 1 as an-action that involves walking slowly, using a manner that suggests a carefree state of mind. Thus the sentence Jack sauntered down the street might have different implications than Jack walked down the street. Of course, you pay for this power by having a wide range of predicates that for the most part have very similar axioms defining them. Specifically, the definitions for SAUNTERS 1 and WALKS 1 would overlap significantly. In an approach using primitives, on the other hand, both of these senses would be reduced to the predicate (or set of predicates) that captures the basic .v action, say MOVE-BY-FOOT. Inference rules are then defined only on the primitive predicates. This approach allows the commonalities between words to captured very succinctly. Without inference rules on the word senses, however, it. is very difficult to capture the subtle distinctions between senses. Of course, voir" would have to define a new primitive to capture the distinction between saunter-g ing and walking, say a new primitive concerning state-of-mind. Sauntering mighty then be defined as MOVE-BY-FOOT and CAREFREE-STATE. The moftP complex the decomposition of the senses, however, the less advantageous the primitive representation becomes, because the number of primitives grows significantly, driven by examples. More crucially, inference rules would have to be based on complex clusters of primitives rather than single predicates, so the inference process is no longer so simply defined. As with many issues, there is considerable middle ground to be explored. Specifically, many of the advantages of a primitive-based representation can be captured by using type hierarchies. If you assert that SAUNTERS 1 and WALKS 1 are both subclasses of the more abstract action MOVE-BY-FOOT, then they could inherit most of their common properties from MOVE-BY-FOOTwithout the need for additional axioms. This still leaves you free, however, to add other axioms for SAUNTERS 1 to cover its special characteristics. This approach would also allow you to handle incomplete knowledge. Say the system only knows that SAUNTERS 1 is a type of walking. It doesn't know any additional information about the word but can still make most inferences required about it using information inherited from MOVE-BY-FOOT. In addition, it knows that SAUNTERS 1 is somehow different from WALKS 1, even though if doesn't know why. If, at a later stage, the system acquires additional knowledge about sauntering, this can be added incrementally. In addition to hierarchical relations, a knowledge representation should also be able to take advantage of other ways to define word senses. Sometimes a complete definition of a term is known. For example, you could define the predicate father as a male parent, that is, V x. FATHER(x) = 3y PARENT(x, y) & MALE(x) But most words are not so precisely defined. For instance, there is no set of properties that precisely defines most natural kinds, such as dogs, cats, chairs, and so on. These can be classified into type hierarchies, and axioms stating necessary conditions can be stated, but no absolute definition is possible. Viewed as FOPC axioms, this means that such definitions involve a one-way implication. For instance, an axiom for DOG 1 might be V x. DOG](x) 3 CANINE(x) & DOMESTIC-PET(x) where CANINE itself is defined as a type of MAMMAL, and so on. Such axioms capture much of the important properties of being a dog but do not define the concept completely. For instance, someone might have a pet wolf that satisfies all the properties of being a dog but still isn't a dog. From the point of view of generating sentences, the more the predicates in representation language are abstracted from the words in the language, the harder it is to produce sentences based on meanings. For instance, assume you are given the formula V p : ((MaleHuman p) & 3 c. Parent(p, c)). . MoveByCarip, LI) & Building (LI) & Used-for-teaching(Ll) 400 CHAPTER 13 Knowledge Representation and Reasoning 401 which has a natural realization as the sentence All fathers drove to the school. ToF generate such a sentence, the system would have to be able to realize the formula.' ((Male p)&3 c. Parent(p, c)), which literally might be realized as wale human?-who have a child, as the word father, and realize the proposition MoveByCar(p; LI) & Building (LI) & Used-for-teaching(Ll), which literally might be real ized as" moved by car to a building used for teaching, as the phrase drove to school.^ Clearly, this would require substantial knowledge about the meanings of theS specific words father and drive, and a complex process of matching formulas irS the KRL to these predicates. If there are no predicates corresponding to these word meanings in the KRL, then this process is especially complicated. If suchs predicates are included, the hierarchical organization would suggest methods for identifying possible realizations of a formula. Specifically, given an abstract-predicate, say MaleHuman, you could consider all the predicates below it in the abstraction hierarchy to see if any of them more concisely capture the desired" meaning. In this case Father would be a good choice as it not only entails^ MaleHuman but also another part of the meaning, namely 3 c. Parent(p, c). The trick in designing an effective knowledge representation is to choose the set of predicates so as to make the hierarchical relationships most effective. Often, the best representation will mirror linguistic generalizations that can be made. This aids both in interpreting sentences and in generating sentences from expressions in the KB. 13.3 Frames: Representing Stereotypical Information Much of the inference required for natural language understanding involves making assumptions about what is typically true of the objects or situations being discussed. Such information is often encoded in structures called frames. In its most abstract formulation, a frame is simply a cluster of facts and objects that) describe some typical object or situation, together with specific inference strategies for reasoning about the situation. The situations represented could range from visual scenes, to the structure of complex physical objects, to the typical; method by which some action is performed. Frame-based systems usually offer . facilities such as default reasoning, automatic inheritance of properties through hierarchies, and procedural attachment. In some implementations all reasoning is , accomplished by specialized inference procedures attached to the frame; in others the frames are mostly declarative in nature and are interpreted by a more uniform inference procedure. Either way, the key idea is the clustering of information to characterize the properties of commonly occurring objects and situations. The principal objects in a frame are assigned names, called slots or roles v (similar to the thematic roles in the logical form). For instance, the frame for a. house may have slots such as kitchen, living room, hallway, front door, and so_.i on. The frame also specifies the relationships between the slots and the object* represented by the frame. For example, the kitchen slot of the house frame has I _ be physically located within the house, and it contains various appliances needed j BOX 13.2 Conceptual Dependency: A Primitive-Based Representation Several very influential early semantic representations were based on small sets of primitives that were used to support a set of specialized reasoning techniques. One of the most influential was conceptual dependency (Schank, 1975; Schank and Riesbeck, 1981). This representation primarily focused on action verbs and posited a small set of action types. Specifically, the major action types included three notions of transfer: ATRANS—abstract transfer (as in transfer of ownership) PTRANS—physical transfer MTRANS—mental transfer (as in speaking) There were also primitives based on bodily activity, PROPEL (applying force) MOVE (moving a body part) GRASP INGEST EXPEL as well as the mental actions, CONC (conceptualize or think) MBU1LD (perform inference) These primitives, together with a set of case roles and a few causal connectives, essentially completed the representation. In early works, it was claimed that this representation was adequate to express the meaning of all action verbs, but in later work primitives were used as building blocks to construct larger structures to capture the meaning of verbs (for example, see Section 15.5). It was found that inference had to be specified in terms of these larger structures rather than in terms of the primitives. Thus the advantages of the primitive-based representation were lost. for preparing meals. You can view each of these slots as a function that takes an object described by the frame (an instance of the frame) and produces the appropriate slot value. Thus a particular instance of the house frame—say, HI— consists of a particular instance of a kitchen, which can be referred to as "the kitchen-slot of HI," or kitchen(Hl), plus particular instances of all the other slots as well. As an example, the definition of a frame type for personal computers might look as follows: Define Object Class PC(e): Roles: Keyb, Disk], MainBox Constraints: Keyboard(Keyb), DiskDrive(Diskl), CPU(MainBox) This structure means that all objects of type PC have slots of type keyboard, disk drive, and CPU (which are identified by the functions Keyb, Diskl, and MainBox, 402 CHAPTER 13 Knowledge Representation and Reasoning 403 Keyboard Disk Drive CPU With this definition, an instance of PC—say, PC4-KEY14, DD12, and CPU07, would be written as -with slot values Figure 13.2 A semantic network defining the slots of PC respectively). This is the same style of representation used in many semanfk network systems. In fact, you could easily represent this structure in a semanfk network notation as well, as shown in Figure 13.2. An instance of the type PC—say, PC3—having the subparts KEYS 13, DD11, and CPU00023 would be represented in the frame notation as (PC3 isa PC with Keyb = KEY13, Diskl = DD11, MainBox = CPU(XX)23 , _J This definition can be viewed as an abbreviation of the FOPC formula PC(PC3 &Keyb(PC3)= KEY13 & Diskl(PC3) = DD11 & MainBox(PC3) = CPU00023 Slots with Restrictions In general, you need more than a superficial knowledge of the structural components of a PC. For instance, the PC frame might contain more information about how the slot values typically interrelate. You might want to assert that each slot is a subpart and indicate how the parts are connected: the keyboard, foi example, as well as the disk drive, plug into the CPU box at the appropriate-connector. To assert this, you would have to define the CPU itself as a frame structure with slots such as KeyboardPlug, DiskPort, PowerPlug, and so on. The notation is extended as in the following example that redefines the class of PCs so that the keyboard and disk are subparts and are connected to the CPU. Define Object Class PC(p): Roles: Keyb, Diskl, MainBox Constraints: Keyboard^ Keyb) & PART-OF (Keyb, p) & CONNECTED-TO(Keyb, KeyboardPlug (MainBox)) & DiskDrive(Diskl) & PART-OF(Diskl, p) & CONNECTED-TO(Diskl, DiskPort(MainBox)) & CPU (MainBox) & PART-OF (MainBox, p) (PC4 isa PC with Keyb = KEY14, Diskl = DDI2, MainBox = CPU07) which implies all the following information: PC(PC4) &Keyb(PC4)= KEY14& PART-OF(KEY14, PC4) & CONNECTED-TO(KEY14, KeyboardPlug(CPU07)) & Diskl(PC4) = DD12 & PART-OF(DD12, PC4) & CONNECTED-TO(DD12, DiskPort(CPU07)) & MainBox(PC4) = CPU07 & PART-OF(CPU07, PC4) Since frames are a way of encoding knowledge about classes of objects, it makes sense that frame information should be inherited through the type hierarchy. For example, if you define a subtype of PCs called PC-With-Second-Disk, this type should inherit all the slots of PC. If you define a new slot for this type—say, Disk2—then all instances will have four slots: Keyb, Diskl, MainBox, and Diskl. Note that the information in a frame should be viewed as default conditions. For instance, it is possible to have a PC, say PCS, in which the keyboard is not connected to the computer. The fact that this property is violated does not make PCS fall out of the class of PCs; it just isn't a typical PC. Frame-based representation can be used to encode additional information about situations beyond their subcomponents. One of the most useful examples of this for natural language understanding occurs in representing actions. As you will see in later chapters, knowledge about the usual situations in which actions occur can be very useful in interpreting language. In particular, knowledge about causality—what effects an action typically has and what conditions are typically necessary for the action to occur—are very important. The slot notation is extended to allow relations between the instance of the frame and other propositions or events. For actions, the following relations are useful: preconditions—properties that typically enable the action, effects—properties that are typically caused by the action, decomposition—the way in which an action is typically performed (usually defined in terms of a sequence of subactions). For example, Figure 13.3 shows the definition of the action of buying something. The action involves four objects: the buyer, the seller, the object, and an amount of money equal to the price of the object. Furthermore, the definition states that a purchase action can occur only when the buyer has enough money and the seller has the object (the preconditions), and that typically at the end the buyer owns the object and the seller has the money (the effects). Finally, a typical way something is purchased involves the buyer giving the seller the money and the seller giving the buyer the object (the decomposition). While this might seem to be quite mundane everyday information, such knowledge is crucial for understanding the 404 CHAPTER 13 Knowledge Representation and Reasoning 405 The Action Class BUY (by. Roles: Buyer, Seller, Object, Money Constraints: Hutnan(Buyer), SalesAgent(Setter), IsObject(Object) Value (Money, Price(Object)) Preconditions: OWNS(Buyer,Money) OWNS(Seller, Object) Effects: -OWNS(Buyer, Money) -OWNS(Seller, Object) OWNS(Buyer, Object) OWNS(Seller, Money) Decomposition: GIVE(Buyer, Seller, Money) GWE(Seller, Buyer, Object) Figure 13.3 The definition of BUY with its decomposition connections between actions and states described in sentences, which in turn are crucial for ambiguity resolution. 13. 4 Handling Natural Language Quantification With the basic KRL defined, you can now consider some issues in mapping the logical form language into the KRL. One of the most obvious differences between the two languages is the treatment of quantifiers. The logical form contains a wide range of quantificational forms corresponding to the English quantifiers, while the KRL allows only universal and existential quantification. Reconciling this difference seems almost hopeless at first glance. Significant progress can be made to reduce the differences, however, by extending the ontology of the KRL to allow sets as objects. A set is a collection of objects viewed as a unit. While sets in general may be finite (such as the set consisting of John and Mary) or infinite (such as the set of numbers greater than 7), we will only use finite sets in the KRL. A set can be indicated by listing its members in curly brackets; for example, [Johnl Maryl }..; refers to the set consisting of the denotation of Johnl and the denotation of Maryl. The order doesn't matter; {Johnl Maryl) = {Maryl Johnl}. We also allow constants to denote sets. Thus SI might be a set defined by the formula SI = {Johnl Maryl}. Full set theory would allow sets to be members of other sets. We will not use such sets in the KRL. Sets will usually be defined in terms of \ some property. This will be written in the form.fy I Py], which is the set of all objects that satisfy the expression Py. The set of all men is {y I Man(y)}. In addition, we introduce the following predicates to relate sets and individuals: SI c S2 iff all the elements of Si are in 52 x e S iff x is a member of the set S With setlike objects in the representation, we can produce an interpretation for Some men met at three, as follows: 3M:Mc{ri Man(x)} . Meetl(M, 3PM) that is, there is a subset of men M that met at three. By convention, we will always use uppercase names for variables ranging over sets. In principle, sets are allowed in all situations where individuals have been allowed. In practice, certain verbs require only sets or only individuals in certain argument positions. For example, the verb meet requires its agent to be a set with more than one element, as a single individual cannot meet. Other verbs require individuals and exclude sets, and others allow both sets and individuals as arguments. Consider the different formulas that arise from the collective/distributive readings. There are two interpretations of the sentence Some men bought a suit, which has the following logical form (omitting the tense operator): (SOME ml: (PLUR MAN1) (A si: SUTT1 .....(BUY1 ml si))) The collective reading would map to 3 Ml : Ml c [z I Man(z)} 3 s : Suit(s) . Buyl(Ml, s) that is, there is a subset of the set of all men who together bought a suit. The distributive reading involves some men individually buying suits and would be represented by 3 M2 : Ml c {z I Man(z)} V m : m e M2 3 s : Suit(s). Buyl(m, s) Note that the collective and distributive readings both involve a common core meaning involving the subset of men. The only difference is whether you use the set as a unit or quantify over all members of the set. The set-based representation can also be used to ensure that more than one man bought a suit. To do this we introduce a new function that returns the cardinality of set. For any given set S, let ISI be the number of elements in S. Using arithmetic operators, we can now encode constraints on the size of sets. For example, the meaning of Three men entered the room would be as follows, again with tense information omitted, 3M: (Mc {y\ Man(y)}&\M = 3) V m: me M. Enterl(m, Rooml) By changing the restriction to IMI > 3, you get the meaning of At least three men entered the room, and so on. More problematic quantifiers can also be given an approximate meaning using sets. For instance, if we define most as being true if more than half of some set has a given property, then Most men laughed might have the meaning 406 CHAPTER 13 Knowledge Representation and Reasoning 407 3M: (M c [y\ Man(y)} & 1A« > lfyl Mf(V)") V m : m e M. Laughed{m) In an actual discourse, the interpretation of the quantified terms will usually bel relative to some previously defined set. For example, the sentence Most men-laughed typically will refer to most of the men in a previously mentioned se rather than to most of the men in the world. In other words, the sentence wouldi not claim that more than half of all men laughed, but that more than half the men" in a certain context (say in a given room) laughed. This type of interpretation wiU-i be discussed further in Chapter 14. -You have seen that by introducing sets as explicit objects in a representation, a wide range of quantificational constructs can be captured in an' intuitively satisfying way. While the development here was in terms of extea— sions to FOPC, similar capabilities are needed in any representation to capture the same phenomena. For example, assume you are using a semantic network representation. To handle quantification you must be able to have nodes that represent sets, be able to state cardinality restrictions on these nodes, and be abler to quantify over these sets to obtain the distributive reading. o 13.5 Time and Aspectual Classes of Verbs One of the central components of any knowledge representation that supports natural language is the treatment of verbs and time. Much of language involves time, including temporal information implicit in the tense and aspect of sentences and explicit temporal information conveyed by a wide range of temporal adver-bials (for example, for five minutes, yesterday, at 3 o 'clock, after they had left). In the logical form language, temporal information was handled in several-ways. There were modal operators to represent tense (for example, PAST. PRES, PROG, FUT) and temporal connectives (for example, BEFORE, DURING), and all predicates could take time arguments. To handle such phenomena, we need to introduce additional extensions to FOPC to represent time. There are several different types of times. A time point is an instantaneous time that is generally associated with some transition in the world, such as a light* turning on or someone finding a lost pen. An interval of time is an extended stretch of time over which some event occurs. All intervals have durations (for example, five minutes long), while points cannot have durations. Many predicates can be defined only over intervals. For example, consider the predicate that asserts that John drove his car to work at a certain time. This can be true only over an interval of time, because driving to a destination necessarily takes time; you cannot drive in a single point. Points and intervals have to be distinguished because different relationships-;: can hold between them. For example, two intervals may overlap, whereas points cannot overlap. In addition, two intervals may meet: One ends where the other ■ begins, but they do not overlap in time or have any time between them. A point or an interval may be contained within another interval, but nothing can be contained within a point. The following predicates are allowed for temporal relations: tl < t2 point/interval tl is before point/interval t2 tl : t2 , interval tl meets interval t2, or point tl defines the beginning of interval t2, or point t2 defines the end of interval tl tl c t2 point/interval tl is contained in interval t2 As previously mentioned, some predicates can be true only over intervals of times, whereas others can be true only at points, and others can be true at either. The classification of predicates corresponds with different aspectual classes of verb phrases. Sentences describe propositions that fall into at least three distinct classes: those that describe states (stative propositions), those that define ongoing activities (activity propositions), and those that define completed events (telic propositions). Stative propositions describe some property of the world that can hold for an instant or extend indefinitely, as in the sentences Jack is happy. I believe the world is flat. Stative propositions describe situations that lack a precisely defined ending point, and cannot appear in certain linguistic contexts. For instance, they do not naturally appear in the progressive form *Jack is being happy. *I am believing that the world is flat. Activity propositions describe activities that occur over an interval of time. Activities are often expressed using the progressive form, as in the sentences Jack is running. The door was swinging to and fro. Sentences describing states and activities do not usually allow temporal modifiers, such as in five minutes, but they do allow duration modifiers, such as for five minutes. Telic sentences describe events that are brought to completion, as in Jack fell asleep. Jack climbed the mountain. In both sentences, the event ends at some time (called the culmination point), and you know that some resulting property starts at the culmination point. For instance, with the first sentence you know that Jack is asleep at the end of the event, and with the second you know that Jack is at the top of the mountain. 408 CHAPTER 13 Knowledge Representation and Reasoning 409 Sentences describing telic propositions can include temporal modifiers such as /« an hour, as in They climbed the mountain in two days. Jack fell asleep in an hour. Telic eventualities are often broken down into two subclasses, depending on whether they essentially describe a transition only (the achievement class) ©i involve some activity leading up to the culmination (the accomplishment class). The previous examples describe accomplishments, whereas the following describe achievements: Jack recognized the man. Helen woke up. The four types of proposition classes can be distinguished by differenl types of temporal arguments. In particular, stative propositions can be true at a point or an interval. For example, it makes sense to speak of a ball being red at a particular instant of time or over an extended interval of time. Stative propositions are homogeneous—whenever they hold over an interval, they also hold over all subintervals of that interval. Achievement sentences, such as Jack reached the summit or Helen closea the door, map to propositions that describe transitions. The first describes a transition after which Jack is at the summit, whereas the second describes a tran sition after which the door is closed. Such predicates cannot hold over intervals, but their definitions might include information about resulting states, such as V al, 11, tl . Reach(al, 11, f7) => 3 Tl. tl : Tl & At(al, 11, Tl) that is, if an agent al reaches a location 11 at time point tl, then al is at 11 foi some interval of time that starts at tl. Propositions that describe processes correspond to activity verbs, such as Jack ran. Process predicates can occur only over intervals and tend to be homogeneous, although not in a strict way as with statives. In particular, it coulrj be true that Jack was running between 2 and 3 o'clock, even if he stopped for a five-minute rest sometime during that time. Thus there can be a defeasible implication, but not an entailment, that if a process P occurs over an interval Tl then it is likely to have occurred over an interval T2 within Tl. Accomplishment sentences, such as Jack ran to the store, have a more complex structure that seems to combine several forms. We can handle this by mapping the logical form predicates to a more complex sentence in the KRL. Ir particular, the logical form for Jack ran to the store could map to a formult indicating that a process of running occurred that culminated in a state of being ai the store, that is, 3 77, T2 . Running( Jackl, Tl)&At(Jackl, Store!, T2) &T1 :T2 Figure 13.4 summarizes some distinguishing properties of the aspectual classes. Aspectual Can Be True Can Be True Temporal Class at a Point? at an Interval? Modifier in Stative Phrase YES YES NO Activity NO YES NO Achievement YES NO YES Accomplishment NO YES YES Figure 13.4 Different properties of the aspectual classes Encoding Tense Tense operators can also be represented directly in the temporal logic without the need for modal operators. The basic idea is to map tense operators to temporal relations with respect to some indexical term referring to the current time. For the following examples, let us assume that the constant NOW1 denotes the current time. Given this, we could map the PAST operator into a formula that existen-tially quantifies over a time before now; that is, the sentence John was happy would map to the KR expression 3 77 .Tl < NOW1. Happy (Jackl, Tl) The same sentence in the simple present, John is happy, would map to 3 Tl. NOW1 c Tl. Happy(Johnl, NOW!) and the simple future, John will be happy, would map to 3 Tl. Tl>NOWl. Happy (Jackl, Tl) Note that there is ambiguity in the interpretation of tense, because some simple present sentences refer to the future, as in The flight arrives at noon, whereas some simple future sentences refer to the present, as in Jack will be in class by now. But we will ignore these complications in this development. Even without ambiguity, there are some difficult problems. For instance, there are two ways to assert that something was true in the past, corresponding to the simple past and the past perfect, for example, Helen saw the books. Helen had seen the books. What is the difference between these two readings? As isolated sentences, it is hard to tell, but consider these forms in more complex sentences, such as When Jack opened the door, Helen saw the books. When Jack opened the door, Helen had seen the books. In the first sentence the act of seeing is cotemporal with or immediately after the time Jack opened the door, whereas in the second the act of seeing preceded 410 CHAPTER 13 Knowledge Representation and Reasoning 411 Jack's opening the door. The generally accepted account of this difference was proposed by Reichenbach (1947), who suggested the notion of reference time. In these examples the reference time for the main clause is the time that Jack opened the door. The simple past equates the time of the event (of seeing) with" the reference time, whereas the past perfect asserts that the event precedes the. reference time. Specifically, Reichenbach developed a theory that tense gives information about three times: S — the time of speech E — the time of the event/state R — the reference time The reference time can be provided by temporal adverbials, as above, or can often be determined by the discourse context, as will be discussed in Chapter 15. In the simple tenses the reference time is the same as the event time, that is, E = R. The three forms are generated by varying the relationship between R and S: Jack sings Jack sang Jack will sing simple present: S = R, E = R simple past: R < S, E = R simple future: S < R, E = R The perfect tenses, on the other hand, have the event time preceding the reference time, and differ, as before, in terms of how the reference time and speech time are related. Thus we have Jack has sung Jack had sung Jack will have sung present perfect: S = R, E < R past perfect: R < S, E < R future perfect: S < R, E < R This analysis also provides an account of the posterior tenses, in which R < E: Jack is going to sing Jack was going to sing Jack will be going to sing posterior present: S = R, R < E posterior past: R < S, R < E posterior future: S < R, R < E These orderings are shown graphically in Figure 13.5. 13.6 Automating Deduction in Logic-Based Representations The previous sections have developed the abstract knowledge representation language that will be used through the rest of this book. The remaining sections change the focus and look at selected reasoning strategies used in knowledge representation systems. This section describes some techniques for automated reasoning in knowledge representations. As mentioned earlier, reasoning systems fall into two main categories: the declarative techniques based on deductive proof techniques and the procedural approaches. This section considers some purely deductive techniques. Simple Perfect Posterior Present IS IE IS |S |E Ir E 1 Ir IR 1 Past IE IS |R IS IE IR 1 |E |R |S 1 1 1 Future IS IE IS E IR IS |R |E 1 IR 1 1 1 Figure 13.5 Some temporal configurations allowed by the tenses If you have ever studied proof methods such as natural deduction for FOPC, you know that finding proofs is complicated because there are many different ways to try to prove any given formula. In addition, there are often many syntactic ways to express the same content. Most work in automated reasoning attempts to reduce this complexity before starting the task. In particular, a normal form is used for formulas that uses a restricted subset of the full FOPC syntax. Many formulas that are logically equivalent but syntactically different have the same normal form. For example, the formulas Q -P) are all logically equivalent. In a representation based on conjunctive normal form (CNF), described later in this section, these all map to the same formula. A second technique concerns the handling of variables. In general, with quantified variables, many particular instantiations of those variables could be used to establish a proof. Thus, when searching for a proof, there is ample opportunity to pick the wrong instantiation and have to reconsider later. The unification technique partially avoids this problem by always computing the most general solution possible for a given line of reasoning. If you are not already familiar with unification, see Appendices A and B. Many different reasoning systems can be viewed with the framework of automatic proof systems, from simple pattern-based retrieval from databases, to Horn clause systems similar to PROLOG, to fully general theorem-proving systems. The differences arise from the form of expressions each system can represent and reason about. To see this, let's first develop a fully general normal form for FOPC. At the level of constants, functions, and atomic propositions, 412 CHAPTER 13 Knowledge Representation and Reasoning 413 Formula in FOPC Clause Form Equivalent P (P<-) (<-/>) P&Q two clauses: (P <-) and (Q <-) PvQ (PQ<-) Qv^P (Q<- P) P^Q (Q<-P) - Q) two clauses: (P <-) and (<- Q) (P& Q) vR two clauses: (PR <-) and(gR <-) Figure 13.6 Formulas in clause form conjunctive normal form is identical to FOPC. A literal corresponds to an atomk proposition, possibly negated, such as the following: Person(Johnl) Car(Carl) Owns(Johnl, Carl) -iHappy(Johnl) A clause is simply a disjunction of literals such as the following, which assen v that either Helen owns a particular car or she is not happy: Owns(Helenl, Car2) v ->Happy(Helenl) In many systems, clauses are written using a form of the implication operatoi instead, and the preceding clause would be written as (Owns(Helenl, Carl) <- Happy(Helenl)) In general, a clause is written as (PI,Pn<- Ql.....Qm) which states that if Ql, Qm are true, then at least one of PI, Pn is true Viewing a clause as a disjunctive formula, the Qj's are all the negated literak while the Pi's are all the positive literals. It can be shown that all formulas in FOPC have an equivalent clause form. Figure 13.6 gives some examples of some propositional formulas using standard logical operators and their equivalent form in conjunctive normal form. Quantification is handled in clause-based system-using variables and skolemization, as discussed in Section 13.2. A very useful restriction of this type of expression is the Horn clause, which has exactly one literal on the left side, that is, exactly one positive literal In a KB restricted to Horn clauses, the backward chaining strategy used in PRC) LOG is able to find a proof of any formula that logically follows from the KB. Reasoning with clauses makes extensive use of the unification algorithm.-Consider first a very limited KR that allows only literals in the KB, similar to a •i :4l relational database that allows quantification. The intuition we want is that a literal P follows from the KB if we can retrieve a formula in the KB that unifies with it. Consider how variables are treated in such a system. The FOPC formula V y 3 xP(x y) would correspond to the literal P(Sk2(?y), ?y). If the KB contains this literal, consider what would happen if you later want to see whether V w 3 z P(w, z) follows from the KB. If you convert it to clause form, you would obtain a literal such as P(Sk3(?w), ?w), which will not unify with the literal in the database since the Skolem functions are named differently. In pattern retrieval systems, this is handled by changing the interpretation of quantifiers when they are used as a query. Thus, as a query, Vy 3 x P(x y) would map to a literal P(?z, Sk4), which would unify with the literal for the same formula already in the knowledge base. At first glance, this technique of changing the interpretation of quantifiers for queries may seem rather arbitrary, but it actually follows from the underlying proof strategy being used. In particular, there are always two general methods for proving a formula P. The first is to build a proof of P directly from the KB using the rules of inference. The second is to show that ->P is inconsistent with the KB (and thus P must follow from the KB). The latter approach is called a refutation proof and turns out to be the most useful technique for automatic deduction. It forms the foundation for pattern-matching techniques as previously described, Horn clause proof strategies, and general resolution-based theorem-proving systems. All of these can be viewed as specialized implementations of a single rule of inference called the resolution rule. A simple case of the resolution rule resembles modus ponens. In particular, given a clause (2 <- P) (that is, P implies Q) and the clause (P <-) (that is, P is true) the resolution rule allows you to conclude (Q <-) (that is, Q is true). Another simple case of the resolution rule detects contradictions. Given (P <-) (that is, P is true) and (<- P) (that is, P is false), the resolution rule gives the empty clause (<-), which indicates that the database is inconsistent. The resolution rule is generalized to FOPC by using unification to instantiate the variables in the two clauses to make the X's identical. For example, consider a KB that includes clauses asserting that all dogs bark and that Fido is a dog: (Bark(?x) <- Dog(?x)) (Dog (Fidol) <-) The resolution rule would allow you to conclude that Fido barks, for after substituting Fidol for ?x in the two clauses, you can cancel the literal Dog(Fidol) from both clauses to obtain the resulting clause: (Bark (Fidol) <-) 414 CHAPTER 13 Knowledge Representation and Reasoning 415 With the resolution rule in hand, you can now consider the refutation prool strategy. Given a consistent KB in clause form, we can determine whether V formula P follows from the KB by negating P, converting it to clause form arid adding it to the KB, and then showing that we can derive the empty clause using the resolution rule. Since this indicates that the KB is now inconsistent, P musi' follow from the KB. It can be proven that the resolution strategy is complete?! the sense that if P does follow from the KB, then a proof can be found. However the converse is not true in the general case. If P does not follow from the KB. ih proof strategy may never be able to tell this fact. By limiting the form op th; clauses that can be used, you obtain different properties. For instance, with a KI consisting solely of Horn clauses, you can tell whether a formula P does or doe not follow from the KB in every circumstance. In deductively based systems, a common technique for introducing--, default mechanism is called proof by failure. A new operator called UNLESS i: introduced that recursively calls the theorem prover on the formula that is iti argument. If the recursive call to the theorem prover stops without pro- nig il\ formula true, then the UNLESS formula is true. For example, the default rule tha cats purr might be expressed as the following axiom: V c. Cat(c) & Unless^ Purr (c)) z> Purr(c) That is, you can conclude that a cat purrs except when you can prove it doesn' purr. Because of the potential expense of recursively calling the prover, such techniques are usually used only with restricted proof systems, such as ii PROLOG-style Horn-clause representations. Another technique used in man; such systems that is closely related to the closed world assumption is th< negation as failure rule, where a proposition is false if it can't be proven h tic that is, for any proposition P, Unless(P) z> ->P. Of course, you would have to b< very careful if using default rules in a system that uses negation as failure. Fo: instance, in a system using negation as failure, the previous default rule would state that you can conclude that cats purr only when you can conclude that catsJ purr—not a very useful rule! , The notions of clauses, unification, and refutation proofs provide the for-, mal underpinnings of virtually every modern knowledge representation system; that is, any system that uses pattern matching with variables can be seen as a-special case of the general technique. Of course, this does not mean that matching covers all the reasoning that a knowledge representation system can do, but it i>. .i crucial part of every system. 13.7 Procedural Semantics and Question Answering (FLIGHT Fl) (ATIME F2 CHI 1000HR) (FLIGHT F2) (ATIME F3 CHI 900HR) (FLIGHT F3) (ATIME F4 BOS 1700HR) (FLIGHT F4) (DTIME Fl BOS 1600HR) (AIRPORT BOS) (DTIME F2 BOS 900HR) (AIRPORT CHI) (DTIME F3 BOS 800HR) (ATIME Fl CHI 1700HR) (DTIME F4 CHI 1600HR) Procedurally based techniques are frequently used in database query applications, where there is a large difference in expressive power between the logical form* language and the database language. Cast in terms of the formalism in the last.-.. section, the KB (that is, the database) consists only of positive literals, often Figure 13.7 A simple database of airline schedules without variables. Rather than convert the logical form language into extended FOPC as described in earlier sections, the logical forms are treated as expressions in a query language. Each logical form language construct corresponds to a particular procedure that performs the appropriate query. For example, the query Does every flight to Chicago serve breakfast? with the logical form (EVERY fl: (& (FLIGHT fl) (DEST fl (NAME cl "Chicago"))) (SERVE-BREAKFAST fl)) would be interpreted as a procedure as follows: 1. Find all flights in the database with destination CHI (the database symbol for Chicago). 2. For each flight found, check if it serves breakfast. If all do, return yes; otherwise return no. This section shows how to interpret logical form expressions as procedures, a method of interpretation often called procedural semantics. To make the development concrete, consider the very simple database retrieval system shown in Figure 13.7. The database consists of a set of positive literals containing no variables. Times are indicated in international notation; for example, 1700HR is 5:00 PM. The relation (ATIME f c t) indicates that flight f arrives at airport c at time t, and (DTBVIE f c t) indicates that flight f leaves from airport c at time t. The database system provides a simple interface based on pattern matching of literals, where the query may contain variables. Two database query functions are assumed: (Test i „)—returns true if there is some binding of the variables such that each literal is found in the database. (Retrieve i n)—like Test, but if it succeeds it returns every instance of the indicated variable that provides a solution. For example, given the database in Figure 13.7, the query (Retrieve ?x (FLIGHT ?x) (ATIME ?x CHJ lOOOilR)) 416 CHAPTER 13 Knowledge Representation and Reasoning 417 would return the list (F2) because F2 is the only binding of ?x where both these literals are in the database. All expressions in the logical form language must be interpreted in a way that reduces eventually to these two query forms on the database. The way this is done is by mapping the logical form into a procedure that performs the appropriate queries on the database. Thus answering a question is done in two steps: translating the logical form into a program and then executing that program to compute the answer. Consider the translation step first. For any logical form expression E, the translation of E in the database query language will be indicated as T(E). The translation of expressions varies depending on the constructs. For instance, expressions such as (NAME cl "Chicago") will be translated into the appropriate database constant, in this case CHI. But in addition, the symbol cl must be stored with the constant CHI on a structure called the symbol table, so that if cl is found again in another part of the logical form, it can also be replaced by its value CHI. Some logical form relations will translate directly into database relations, whereas others will translate into more complex expressions. For instance, the logical form relation DEST is not used in the database; rather, the destination of a flight is encoded in the ATIME relation that includes both the flight's destination and its arrival time. Thus the logical form relation (DEST fl (NAME cl "Chicago")), where fl has already been associated with a variable ?f, would translate into the database relation (ATIME ?f CHI ?t) Since the time is not included in the DEST relation, it is interpreted as an unconstrained variable in the translation. In general, the translation of each relation in the logical form must be specified. The procedural semantics approach gets more interesting as it interprets logical connectives and quantifiers, which of course have no corresponding constructs in the relational database. The logical operators are interpreted as follows: Conjunctions: (& Ri R„)—will translate into a program of the form (CHECK-ALL-TRUE T(Ri) T(Rn)), which when executed will successively query each T(Rj) to make sure it is true and pass on the variable bindings to the queries that follow. If there is a set of variable bindings such that querying each T(Rj) succeeds, then the program succeeds; otherwise it fails. Disjunctions: (OR Rj Rn)—will translate into a program of the form (FIND-ONE-TRUE T(Ri) .... T(Rn)), which when executed will successively query each T(Rj) until one of the R; succeeds, in which case the program succeeds. If no Rj succeeds, then the program fails. - IIa The procedure for negation assumes the closed world assumption on all relations in the database, and uses proof by failure: (NOT Retranslates into a program of the form (UNLESS T(R)), which succeeds only if querying T(R) fails. The most complex'translations occur with quantifiers. Each quantifier translates to a program that does the appropriate operations on the database. Because of the limitations of the database language, only the distributive readings of plural quantifiers are usually supported. Consider three quantifiers important in question-answering applications: THE, EACH and WH. (THE x : Rx Px)—translates into a program (FIND-THE ?x T(R?X) T(P?X)), which first does a retrieval to find all ?x that satisfy T(R?X)> that is, (Retrieve ?x T(R?X)). If a single answer is found, then that answer is substituted for ?x in the entire expression, and T(P?X) is executed to provide the answer for the entire expression. If no object is found when querying T(R?X), then there is a presupposition violation that might be handled by the question-answering system in a special way, say, notifying the user that there is no such object. If multiple answers are found, the designer of the system must decide what is best to do. Some systems allow this situation and execute T(PX) for each of the values; other systems treat it as a failure. (EACH x : Rx Px)—translates to a program (ITERATE ?x T(R?X) T(P?X)), which also starts by doing a retrieval to find all ?x that satisfy T(R9X). It then iteratively executes T(P9X) for each value found and succeeds only if each of these queries succeeds. (WH x : Rx Px)—translates into a program (PRINT-ALL ?x T(R?X) T(P?x))> which retrieves all objects that satisfy the translations of Rx and Px, that is, (Retrieve ?x T(R?X) T(P?X)), and then prints out the results. Determining the best format for printing the answers, especially determining whether additional information should be provided, is a complex issue. Here we assume it simply prints the answers found. This is enough mechanism to show some examples using the database in Figure 13.7. The query Which flight to Chicago leaves at 4PM? would have the logical form (after scoping) (WH fl : (& (FLIGHT fl) (DEST fl (NAME cl "Chicago"))) (LEAVE 11 (NAME tl "4PM"))) This would translate into a query of the form (PRINT-ALL ?f (FLIGHT ?f) (ATIME ?f CHI ?t) (DTIME ?f ?s 1600HR)) 418 CHAPTER 13 Knowledge Representation and Reasoning 419 Here, the DEST relation maps to an ATIME relation as previously described, and the LEAVE predicate maps into the DTIME relations. Note that the departure location was not specified in the logical form and so is treated as a variable here. In a real application the departure city would be determined by context or by default. With the small database shown in Figure 13.7, however, there is only one flight matching the current description, namely Fl, so it works in this case. Consider a more complex example that involves iteration, as in the request Give the departure time of each flight to Chicago with the logical form (EACH fl: (& (FLIGHT fl) (DEST fl (NAME cl "Chicago"))) (THE tl : (DEPART-TIME fl tl) (GIVE-SPECIFY 1 gl))) This would translate into the query (ITERATE ?fl (CHECK-ALL-TRUE (FLIGHT ?fl) (ATIME ?f CHI ?tl)) (FIND-THE ?tl (DTIME ?fl ?city ?tl) (PRINT ?tl))) In this case, the interpretation of the verb give simply involves printing out its argument, i.e., the departure time. The execution of the expression then proceeds as follows: 1. The first part of the ITERATE step is to find all ?f 1 satisfying the restriction. The CHECK-ALL-TRUE procedure succeeds for ?fl only if both (FLIGHT ?fl) and (ATIME ?f CHI ?tl) are in the database. This step returns the flights Fl, F2, and F3. 2. The second part of the step is to execute (FIND-THE ?tl (DTIME ?fl ?city ?tl) (PRINT ?tl)) for each of the three values. Consider the execution with the first value, Fl. The expression is (FIND-THE ?tl (DTIME Fl ?city ?tl) (PRINT ?tl)) The program for FIND-THE first performs the query (Retrieve ?tl (DTIME Fl ?city ?tl)). This returns a unique answer, namely the time 1700HR. The second step of the FIND-THE program executes PRINT on this value, causing 1700HR to be printed. The second and third iterations print the values 1000HR and 900HR, respectively. Many natural language database query systems use the procedural semantics technique. It provides a convenient way to capture the appropriate behavior for many constructs whose meaning cannot be expressed within the limited language of the database system. Because of the nature of database applications, the limitations of these techniques do not appear to be a problem in practice. For instance, database systems don't typically encode information that would make queries using collective interpretations of quantifiers necessary. In addition, since m ■ Wik BOX 13.3 LUNAR: A Natural Language Database Query System With the discussion of procedural semantics, you have now seen most of the central components of the LUNAR system. LUNAR, developed in the 1970s, acted as a front-end query system to a database containing information about the rock samples brought back from the Apollo missions to the moon. It was the first natural language system to demonstrate extensive coverage in a realistic application domain, and many of the techniques that are common in the field today either originated or were first developed to an advanced stage in this system. The system used an ATN parser (see Section 4.6) that produced a representation based on grammatical relations, which was then interpreted by a semantic interpretation module that used a recursive pattern-matching technique to produce an expression in a meaning representation, as in Section 11.1. Quantifier information was maintained separately from the rest of the semantic representation and was then ordered using heuristics similar to those described in Section 12.3. The result was a final meaning representation expressed in a meaning representation language similar to our fully scoped logical form. This was then executed using a procedural semantics approach as described in this section. Some examples of queries that LUNAR could handle are Give me all lunar samples with magnetite. In which samples has apatite been identified? What is the specific activity of A126 in soil? What is the average concentration of olivine in brecchias? In which brecchias is the average concentration of titanium greater than 6 percent? For more information on LUNAR, see Woods (1970; 1977; 1978). the database does not contain disjunctive information, the limited forms for disjunctive queries also do not pose a problem. Procedural semantic techniques can also be used with Horn-clause-based databases as well. Most of the procedural definitions of constructs can be defined by Horn clause axioms, with the addition of an ability to recursively invoke the prover to perform tasks such as finding all objects that satisfy some set of literals. With extensions to handle finite sets, such a representation can handle a wide range of quantifiers procedurally using the encoding techniques described in Section 13.4. With their additional expressive power, Horn-clause-based databases are a very attractive generalization to the traditional relational database for supporting natural language query systems. 13.8 Hybrid Knowledge Representations Even if a knowledge representation language remained first-order, general search strategies in theorem proving would usually be loo inefficient for practical systems. The theoretical cleanness of viewing inference as theorem proving, 420 CHAPTER 13 Knowledge Representation and Reasoning 421 --■—1 > ANIMAL Ts MAMMAL ts DOG : T isa Fidol Figure 13.8 A small type hierarchy however, has many attractive properties. Hybrid KR systems attempt to gain the* advantages of using efficient procedural inference for some tasks while retaining* the theoretical framework of theorem-proving systems. As a start, the ideas of unification and refutation proof can be carried over into most systems. A hybrid system, however, does not depend entirely on these techniques. Rather, certain forms of inference are accomplished using special-purpose techniques that can be considerably more efficient. For example, consider the implementation of type hierarchies in a KR system. You saw earlier that type hierarchies can be encoded as axioms (for example, V x . DOG(x) n MAMMAL(x)), or as graphs, as in semantic networks. These techniques may be formally equivalent, but they can produce radically different computational properties. One way to combine these techniques is to assume a typed logic resembling the restricted quantification logic developed in Section 13.2. The type hierarchy is predefined in a semantic network structure in which DOG is a subtype of MAMMAL which is a subtype of ANIMAL, and Fidol is predefined to be a member of the set DOG. Given this general knowledge, the KB encoding the assertion All animals have a mother would be (MOTHER (?x:ANIMAL, Skl(?x)) <-) where the notation ?x:ANIMAL indicates a variable ranging over type ANIMAhk Such expressions could be reasoned about by extending the unification algorithm^ so that two terms may unify only if they are of compatible types; that is#-?x:ANTMAL and Fidol will unify only if Fidol is a member of the set ANIMAL: This constraint can be checked procedurally using the semantic network shown in Figure 13.8. Now the query as to whether Fido has a mother—thai MO'THFR(Fidol. ?y)—can be proved using a single unification step. The procedural approach allows you to write highly optimized procedure that are significantly faster than would be possible doing the same work UMiig axioms. The hybrid representation also allows for a more intuitive encoding of -the information, using a semantic network for the type information, and also could permit other nondeductive algorithms to be performed on the srm.intk _ network. - Another example of a specialized reasoner that can be put to very effective use concerns equality reasoning. It is very difficult to axiomatize equality directly into a theorem-proving system because it is hard to encode the equivalence of formulas that differ only in using two different names for the same object. Very efficient algorithms exist, however, for maintaining equality information between ground terms based on equivalence classes. If such techniques are built into the unifier, then no explicit axioms about equality need be encoded in the system. Rather, the extended unification algorithm would use the procedures defined for equality to check whether two terms are equal and thus can be unified. Other forms of specialized reasoning systems can be integrated by defining procedures that establish the truth of particular predicates. The technique is called procedural attachment. For instance, consider temporal reasoning. While it is possible to use an axiomatization of time to drive temporal reasoning, the resulting system would be very inefficient. There are, however, specialized reasoning techniques that can manage temporal information quite effectively. Such systems can be integrated into a hybrid system using special predicates. To see this, consider what roles propositions play in a reasoning system. There are generally three different operations applicable to propositions: Assert that it is true (that is, add it to the KB) Query whether it is true (that is, invoke the theorem prover on it) Retract it (that is, remove it from the KB) While each of these operations was defined in terms of a theorem-proving system, this does not have to be the only way such operations are accomplished. In fact, you could define arbitrary procedures to perform each of the tasks. For instance, consider a specialized temporal reasoning system that maintains a graph of temporal relations and uses graph search techniques to establish temporal relations. Assume that there is a predicate BEFORE in the KB that indicates that one time precedes another. When a proposition such as (BEFORE tl t2) is to be added to the KB, the specialized temporal reasoner is invoked to add the information to its temporal graph. When the same proposition is queried (either directly by the user, or as a substep of a more complex proof), then the specialized temporal reasoner is called to establish it. As a result, the specialized temporal reasoner can be fully integrated into the theorem prover, and used whenever temporal information is required. Of course, to be fully integrated, the specialized reasoning would have to be able to handle variables and return results equivalent to unification. For instance, if the theorem prover needed to establish (BEFORE tl ?x), the temporal reasoner would have to return a binding for ?x. In addition, it would need to be able to handle backtracking when alternate solutions need to be explored. Hybrid reasoning systems offer an attractive way to integrate specialized reasoning algorithms into a uniform framework. J 422 CHAPTER 13 Knowledge Representation and Reasoning 423 Summary Natural language understanding requires a capability to represent and reason -about knowledge of the world. While there are many different techniques fon representing knowledge, every representation sufficient for general lan-jn.i^0 understanding must at least support the following general capabilities: • a full range of logical operators and logical quantification, as found in FOPC • a way to represent default, stereotypical information about the objects and situations that occur in the domain • a way to explicitly represent and reason about finite sets • a method of representing and reasoning about temporal information These are representative but by no means exhaust the areas of concern. In a fully general system, for instance, you would also explore the spatial information in language, and the representation of mental attitudes. This chapter developed an abstract representation language, combining thes techniques of FOPC and frame-based systems, that satisfies the requirements just listed. This abstract representation could be realized within a wide range of knowledge representation systems using different techniques. Any knowledge representation system must support basic capabilities for pattern matching, for which the notion of unification and inference based on refutation provide a' formal basis. Many systems also specify specialized procedures for some or all of the reasoning tasks. A system that uses a mix of techniques, including deductive and procedural techniques, is called a hybrid system. Related Work and Further Readings Knowledge representation is a highly diverse area in artificial intelligence and is fundamental to many problems beyond language understanding. A good inm> duction to the field is the collection of papers in Brachman and LeveSque (1985); A good sample of current work in the field can be found in the proceedings of the Conferences on Knowledge Representation and Reasoning (for example. Brachman, Levesque, and Reiter (1989); Allen, Fikes, and Sandewall (1991); and Nebel, Rich, and Swartout (1992)). Norvig (1992) has written an excellent text that discusses different implementation techniques for knowledge representation systems. The introduction of frames by Minsky (1975) produced a large body of subsequent work in representation. One of the first knowledge representation systems based on these ideas was KRL (Bobrow and Winograd. 1977). Hayes (1979) performed an analysis of KRL in terms of FOPC, Most modern representation systems, such as the systems described in Brachman and Levesque i I'lSS) can be seen as combinations of frame systems, semantic networks, and deductive logic. A large class of systems organize knowledge around descriptions of categories of objects and are called term subsumption languages. There are good examples in Brachman and Levesque (1985). A good reference for semantic network-based systems is Sowa (1991). The strongest proponents of knowledge representations based on decompo -sitions into a small set of primitives have been Schank (1975) and Wilks (1975). Most current representation systems, however, Use abstraction as the organizational tools in representations, and are often based on semantic networks and frame systems. There have also been many proposals for decomposition in linguistics (such as Dowty (1979) and Jackendoff (1990)). The treatment of quantifiers by using explicit sets is common in computational systems (for example, Woods (1977); Warren and Pereira (1982); and Alshawi (1992)) and in linguistics (for example, McCawley (1993)). The study of tense and aspect is a very active area of research in linguistics, philosophy, and computational linguistics. A good reference on tense and aspect in the computational literature is a special issue of Computational Linguistics (1988). An excellent place to start in the linguistics literature is with Dowty (1979; 1986) and Bach (1986). More recent work in the area includes Parsons (1990) and Pustejovsky (1991). The classic reference for tense is Reichenbach (1947). There is also a large literature on the treatment of tense as a modal operator (for example, Prior (1967)). McCawley (1993) contains a good introduction to linguistic issues in dealing with tense and aspect. Allen (1984) describes a temporal logic that explicitly involves predicates in three different aspectual categories. Davis (1990) describes a logic-based representation that includes specialized representations for time, space, and many other aspects of the world. Most deductive techniques have evolved from work in resolution theory proving as introduced by Robinson (1965). Robinson introduced the resolution rule and proved that the resolution refutation proof technique was complete. The technique of proof by failure was used in the early AI programming language PLANNER (Hewitt, 1971), and was formalized by Clark (1978). Much of the work on default logics stems from work by Reiter (1980). Etherington and Reiter (1983) used this formalism to define inheritance formally with exceptions in type hierarchies. There is a large body of literature on representation using semantic techniques based on minimal models. A good general overview can be found in Genesereth and Nilsson (1987). Procedural semantics was extensively used in early systems, notably Winograd (1973) and Woods (1978), and is still a common technique in database query systems. The CHAT-80 system (Warren and Pereira, 1982) used similar techniques within a PROLOG-based representation to produce an elegant and quite powerful query mechanism. For a brief survey of question-answering techniques, see Webber (1992). A good example of a more current question-answering system is the TEAM system (Grosz et al., 1987). TEAM was aimed at being transportable, which means it can be relatively easily adapted to a different 424 CHAPTER 13 Knowledge Representation and Reasoning 425 database without having to rewrite the grammar and semantic interpreter. To do this, it uses a context-independent logical form similar to the approach used in Chapter 8. Exercises for Chapter 13_____^ 1. (easy) Give a plausible logic-based representation for the meaning of the following sentences, focusing on the interpretation of the quantifiers. If the sentence has a collective/distributive ambiguity, give both interpretations. Several men cried. Seven men in the book met in the park. All but three men bought a suit. 2. (easy) For each of the following lists of sentences, state whether the first sentence entails or implies each of the sentences that follow it, or that there is no semantic relationship between them. Justify your answers using sortie linguistics tests. a John didn't manage to find the key. John didn't find the key. John looked for the key. The key is hard to find. b. John was disappointed that Fido was last in the dog show. Fido was last in the dog show. Fido was entered in the dog show. John wanted Fido to win. Fido is a stupid dog. 3. (easy) Classify the indicated verb phrases in the following sentences as to whether they describe a state, an activity, ah achievement, or an accomplishment. Justify your answers with some examples that demonstrate their linguistic behavior. Discuss any problems that arise in your classification. Jack ran to the store. Jack was running to the store. Jack hated running. Jack runs every day. Jack stopped running when he broke his leg. 4. (medium) One of the classic examples of decompositional semantics is the encoding of the verb kill using a causation operator and a predicate DIE. In particular, the meaning of the sentence John killed Sam would be CA USE(Johnl, DIE1 (Sam)) Does this decomposition completely capture the meaning of the verb kill'1. Consider whether John killed Sam and John caused Sam to die are equiva- ■1 lent in all situations. Given your position on this issue, would it be better for a KR to decompose all instances of kill to this form, or to use a meaning postulate and retain a predicate KILL in the KR? Justify your answer. (medium) Using the translation of inheritance networks into logic described in Section 13.1, give the axioms for the simple network AGENT Note that the definition of the drive action places a more restrictive constraint on the type of the THEME role. Does the axiomatization do the right thing, that is, does it show that the resulting axioms are consistent and that, for any drive action D, 3 o . THEME(D.o) a CAR(o)? Did the definition of the THEME role on the general class OBJ/ACTION interfere with this in any way? In answering these questions, explicitly specify any assumptions you need to make about the type hierarchy to do each proof. (medium) Using the frame-based representation described in this chapter, define an action class DRIVE that corresponds to a sense of the verb drive in i. I drove to school today. In particular, your definition should contain enough detail so that each of the following statements could be concluded from sentence i. ii. I was inside the car at some time. iii. I had the car keys. iv. The car was at school for some time. v. I opened the car door. For each of these sentences, discuss in detail how the necessary knowledge is represented (as a precondition, effect, decomposition, and so on) and what general principle justifies it being a conclusion of sentence i. Identify three other conclusions that can be made from sentence i, and discuss how the required knowledge to make each conclusion is encoded. Should any of the definitions be considered to be default knowledge? If so, why? If not, identify one further conclusion that could be made if some default knowledge were used. 426 CHAPTER 13 7. (medium) Consider a question-answering system on a fixed database restricted to ground literals as shown below and using the closed world assumption: TYPE(FORD, COMPANY), TYPE(GM, COMPANY) TYPEfCARl, AUTO),TYPE(CAR9, AUTO) MADE-BY(CAR1, FORD), COLOR(CARl, WHITE) MADE-BY(CAR2, FORD), COLOR(CAR2, WHITE) MADE-BY(CAR3, FORD), COLOR(CAR3, BLACK) MADE-BY(CAR4, FORD), COLOR(CAR4, WHITE) MADE-BY(CAR5, GM), COLOR(CAR5, RED) MADE-BY(CAR6, GM), COLOR(CAR6, WHITE) MADE-BY(CAR7, GM), COLOR(CAR7, BLUE) MADE-B Y(CAR8, GM), COLOR(CAR8, RED) MADE-BY(CAR9, GM), COLOR(CAR9, BLUE) a Specify a procedural semantics for the English quantifier most appropriate for database queries. Discuss any complications that arise, and any assumptions that you need to make. b. Give the two different interpretations, due to quantifier scoping, of the sentence Most cars made by some company are white. Give an informal trace of the question-answering process in each of the interpretations. What is the answer in each case? c. What knowledge of the domain could be brought to bear to select the most likely interpretation in part b? Informally describe a disambiguation process that would select the appropriate interpretation in such cases, using the analysis in part b as an example. rJH -■.« •■»3 CHAPTER Local Discourse Context and Reference 14.1 Defining Local Discourse Context and Discourse Entities 14.2 A Simple Model of Anaphora Based on History Lists 14.3 Pronouns and Centering 14.4 Definite Descriptions 14.5 Definite Reference and Sets 14.6 Ellipsis 14.7 Surface Anaphora Local Discourse Context and Reference 429 This chapter considers a set of issues related to reference and the local discourse context. This includes the issues of anaphoric reference, ellipsis, VP anaphora, and one-anaphora. In order to explore anaphoric processing in more detail, a simple model of global discourse structure called the history list is introduced. This will allow a range of issues relating to reference to be discussed here, rather than wMting until global discourse context is considered in detail in Chapter 16. Section 14.1 explores the notion of local context and introduces the idea of discourse entities to serve as the referents for referential expressions. Section 14.2 develops the history list representation and explores its use in the interpretation of definite noun phrases. Section 14.3 looks at pronominal reference in more detail and develops some structural preferences for pronoun reference using a notion of a discourse center. Sections 14.4 and 14.5 explore definite descriptions: The first considers singular definite descriptions and the second considers plural descriptions and sets. Section 14.6 examines ellipsis, where the local context is used to fill in missing information in the current sentence. Finally, Section 14.7 looks at surface anaphora, a phenomena that falls somewhere between pronominal reference and ellipsis. 14.1 Defining Local Discourse Context and Discourse Entities As a start, think of the local context as including the syntactic and semantic struc ture of the preceding sentence, together with a list of objects mentioned in the sentence that could be antecedents for subsequent pronouns and other definite noun phrases. The local context is useful for many devices. For instance, it may contain antecedents of pronouns, as in la. Jackj lost his walletj in his car. lb. Hei looked for itj for several hours. In addition, the local context defines the structures most useful for interpreting sentences that use verb phrase ellipsis, such as 2a. Jack forgot his wallet. 2b. Sam did too. Such verb phrase ellipsis typically refers to the event described in the previous sentence, as in 3a. Jack forgot his wallet. 3b. He looked for someone to borrow money from. 3c. Sam did too. In this discourse, sentence 3c cannot mean that Sam forgot his wallet as well. But there are problems with using the sentence as the unit for local context, due to the presence of conjunctions. For instance, discourse 3 could be modified 430 CHAPTER 14 Local Discourse Context and Reference 431 so that sentences 3a and 3b are conjoined into one sentence, but this doesn't change the possible readings: 4a. Jack forgot his wallet, so he looked for someone to borrow money from. 4b. Sam did too. It is hard to interpret 4b as meaning Sam forgot his wallet. In contrast, a single sentence with a conjunction supports VP ellipsis between the conjuncts, as in 5. Jack forgot his wallet, and Sam did too. As a further complication, the following example involving conjunction between verb phrases indicates that some conjunctions do create contexts that support ellipsis, as in 6a. Jack forgot his wallet and lost his credit cards. 6b. Sam did too. In this case, 6b means that Sam also forgot his wallet and lost his credit cards. Given these examples, a reasonable start is to consider the local context to be derived from the preceding major clause rather than the preceding sentence, For instance, because conjunctions such as and can conjoin major clauses, examples like sentence 5 can be accounted for: The first conjunct provides the local context for the second. VP conjunction occurs within a single major clause, so the local context for 6b is all of 6a, as desired. Subordinate clauses are included as part of the major cause they modify so should not create a new local context. This is seen in the following example: 7a. Jack forgot his wallet when he went out to the movies. 7b. Sam did too. Sentence 7b can be interpreted to mean that Sam lost his wallet (or that he lost his wallet when he went out to the movies) but not to mean that he went out to the movies. An important part of the local context is a list of possible antecedents fot pronouns, which we will call the discourse entity (DE) list. The DE list is a set of constants defined in the KB that represent die objects that were mentioned in the last major clause and can subsequently be referred to by a pronoun. Sometimes a DE is not explicitly mentioned in the previous clause but is implicitly introduced by it. To allow for such cases, we will usually talk of the objects that a sentence evokes, which includes those mentioned explicitly as well as the implicit ones, to be discussed later in this chapter. When we say that a pronoun has a certain antecedent, we really mean thai the pronoun and antecedent both refer to the same object; that is, die pronoun and its antecedent co-refer. It is important to note that a pronoun and its antecedent may co-refer to the same object X even though neither the speaker nor the hearei could identify X in any concrete way. It simply means that they refer to the same object, no matter what that object is. Consider the discourse fragment 8a John bought a car; yesterday. 8b. It; was very expensive. This can be said and understood without either the speaker or hearer ever having seen the car or having any means to identify it. The indefinite description a car asserts the existence of a car and the pronoun it is used to add additional information about that car. This discourse is perfectly well understood even though no one knows what car is being talked about in the real world. So how can this referent be represented as a constant in the KB when we don't know which object is referred to? Faced with problems like this, some researchers argue for a completely different level of representation for discourse entities. But we will take a simpler approach and use Skolem functions as the discourse entities. Remember that a Skolem function (or Skolem constant) is simply a new term introduced into the language. When possible, we will use the discourse marker introduced in the logical form as the new Skolem constant. Thus if a car had the logical form , the Skolem constant generated would be CI. Co-reference is indicated by using equality. Thus, if the pronoun it has the logical form (PRO il IT1), then the fact that it co-refers with Cl would be captured by asserting (II = Cl). Generating Discourse Entities As a clause is interpreted, a discourse entity is typically generated for each noun phrase. Different NPs place different constraints on the discourse entities they evoke. An indefinite NP typically evokes a new discourse entity, as previously described, and often need not be further identified in the KB. A proper name, on the other hand, generally describes some object already defined in the KB that is associated with the name. A definite NP (including pronouns) typically refers to an object previously mentioned in the discourse, often a discourse entity in the local context. Plural NPs evoke sets of objects, A complex NP involving conjunction evokes a set consisting of all the conjuncts. For example, the NP John and Mary evokes three DEs, Johnl, Maryl, and the set {Johnl Mary I}. The rest of this section considers the discourse entities evoked by indefinite NPs. This class is very important because indefinite NPs introduce many new discourse entities, and provide much of the background needed for handling definite reference (discussed in later sections). To compute the set of discourse entities, we first convert the logical form into the representation of quantifiers described in Section 12.4, reducing all the natural language quantifiers to universal and existential quantification using explicit sets. Also, for the simple examples here, we assume that all proper names and definite NPs are replaced with KB constants representing their referents. Plural noun phrases evoke the same set whether interpreted collectively or distributivcly. In collective readings, the set introduced is used direcfly in the proposition as an argument, whereas in distributive readings, the set is used as the range of the universal quantifier. To define the set, the modifiers must be 432 CHAPTER 14 Local Discourse Context and Reference 433 Example Discourse Set Restriction Individual Restriction (IR) Entity (DE) (SR) a man Ml none MANl(Ml) three women Wl IW7I = 3 Wl : { WOMANl(w)} .. or equivalently, ■"■ V w . w e Wl => WOMAN 1 (w) '^f some black cats CI IC7I > 1 CI c{c 1 CATl(c) & HLACKl(c)) - Figure 14.1 The translation of indefinite noun phrases divided into two classes in the translation: the restrictions on the set (SR) and the restrictions on each individual in the set (IR). For example, the translation of the noun phrase Three boys in the sentence 9. Three boys lifted Fido. would involve an SR that states that the set has three elements, and an IR thai asserts that each element is a boy. Figure 14.1 summarizes the translation ol some indefinite noun phrases. Indefinite Noun Phrases Within the Scope of Universals Complications arise when an indefinite noun phrase appears within the scope of a universal quantifier arising from a distributive interpretation. For example, a singular indefinite noun phrase within the scope of a universal evokes a set rathei than an individual. Consider the following discourse: 10a. Three boys j each bought a pizzaj. 10b. They j ate themj in the park. Note that Theyi ate itj in the park is not an allowable continuation to sentence 10a. What properties are needed to define the discourse entity generated from a pizza in 10a? The set [x I Pizza(x)} is far too general, as the themj in 10b couldn't refer to the set of all pizzas. A better treatment of a pizza is a new discourse entity PI denoting a subset of {x I Pizza(x)}. With this interpretation, themj in 10b could refer to the right set of pizzas, namely those introduced in the last sentence. But this treatment would still lose information known about the contents of PI, namely that it is the set consisting of the pizzas bought by the boys mentioned in 10a. A representation of PI can be derived by considering the original formula for 10a: V b:be HI. 3p: Pizza(p) . Buy(b,p) where IBl\ = 3& /J/cf.vl Boy(x)} _ , __ Sentence: Three boys each bought a pizza. Initial Translation: 3B :\B\ = 3&B cz [x\Boy(x)} Vb:beB. 3p : Pizza(p) . Buy(b, p) Discourse Entities: B1:W1\ = 3 &B1 c ; a />Y>v(.v); PI: PI = {x\Pizza{x)&.3y :y e Bl .x=sk4(y)} Semantic Content: Vb:be Bl . sk4(b) ePJ4 Buyl(b, sk4(b)) Figure 14.2 The discourse entities generated from Three boys each bought a pizza. The skolemized form is V b:be Bl. Pizza(sk4(b)) & Buyl (b, sk4(by) where sk4(b) is a new function that yields the pizza that each boy bought. Given this, the set PI could be defined as the set of pizzas generated by this new function: PI = fx I Pizzaix) &.3y:yeBl.x= sk4(y)} This exactly picks out the pizzas that the boys bought. The entire analysis of this sentence is summarized in Figure 14.2. 14.2 A Simple Model of Anaphora Based on History Lists This section considers a simple technique used for identifying the antecedents of pronouns. The basic structure idea is the history list, which is a list of discourse entities generated by the preceding sentences, with the most recent listed first. Viewed in terms of the previous section, the history list is a sequence of structures corresponding to the discourse entities in the prior local contexts. The entities from the current local context (that is, the entities generated by the previous clause) are listed first, then the entities in local context generated by the sentence before that, and so on. In Chapter 12 you saw how pronouns may refer to objects within the same sentence, and a set of co-reference constraints were derived that indicated which objects in the same sentence could or could not be the antecedent of each pronoun. These constraints affect the intersentential cases as well. For example, 434 CHAPTER 14 the reflexivity constraint holds even if there is an intersentential occurrence of theS same object, as in the discourse 1 la. Jack saw Sam at the party. 1 lb. Sam gave him a drink. In this discourse the reflexivity constraint indicates that him and Sam injfh cannot co-refer. This also prohibits the pronoun him from co-referring with thi discourse entity evoked by Sam in 11 a. The possible antecedents for pronouns are not restricted to appearing in thi-local context, but the local context is very important for resolving pronomina reference. A large majority of antecedents for pronouns are found in the same sentence or in the local context. The further back in the discourse an antecedeni was last mentioned, the less likely it is to be referred to again by a pronoun. Once the algorithm for producing the discourse entities generated by i sentence is defined, the history list idea is quite simple. The history list consist of all the discourse entities that have been evoked in the reasonably recent past Some systems allow just the last one or two local contexts, while others let thi history list grow unboundedly. Given the history list, the algorithm for finding an antecedent proceeds as follows: Check the most recent local context for an antecedent that matches all the constraints related to the pronoun. Constraint may come from any source. For example, reflexivity constraints will prohibi some objects from being the antecedent, gender and number will eliminat others, and constraints derived from imposing selectional restrictions may intro duce further restrictions. If no antecedent is found in the current local contexi then move down the history list to the next most recent local context and searcli there. This algorithm implements what is often called the recency constraint which states that the antecedent should be the most recently mentioned objec that satisfies all the constraints. Consider a discourse concerning a sailboat race: 12a. The companies had a lot of money and spent lavishly on their boat. 12b. The boys, in contrast, built their boat on a tight budget. 12c. They knew they would win the race easily. The they in 12c most likely refers to the boys, even though on purely semantic grounds it would be more likely that the companies felt they would win the race The history list for sentence 12c might look like that in Figure 14.3. To find the antecedent for the pronoun they, you search the history list for the first object that satisfies the pronoun's constraints. In this case it would be an object > satisfying THEYl(x), which would be true on any plural set. The first cntih tested, B2, would succeed and be selected as the referent. The same basic tecli nique can often be used for definite descriptions as well. For instance, if 12c weie They knew the boat would win easily, the antecedent of the NP the boat would tithe first object x that satisfies the constraint BOAT(x), which would be B3. History lists are the basic underpinnings of many computational ap proaches. The technique must be refined further, however, to capture certain Local Discourse Context and Reference 435 3» Sentence Sentence 12b Sentence 12a Discourse Entities Generated B2: B2 cz{x\Boy(x)) B3 : Boat(B3) & BuiltBy(B3,B2) B4 : Budget(B4) & Limited-Funds(B3) CI : CI cz{c\ Company(c)} Ml: MoneyiMl) Bl : Boat(B2) & OwnedBy{B2, CI) Figure 143 The history list generated by discourse 12 cases. The next two sections look at issues in interpreting pronouns and definite descriptions in more detail. History lists will also be reconsidered in Chapter 16, when issues related to global discourse structure are discussed in more detail. 14.3 Pronouns and Centering The recency constraint imposes a preference between the local contexts. Are there preferences between DEs within a single local context? The answer is yes. There do seem to be some preferences for the objects playing the central roles in the major clause over the discourse entities generated in adjunct phrases and subordinate clauses. In some cases, these preferences are strong enough to interfere with logically possible interpretations. Consider the discourse (adapted from Wilks (1975)) Jack drank the wine on the table. It was brown and round. Even though recency and the semantic constraints suggest that it refers to the table, most people have trouble finding this interpretation, and many consider the second sentence anomalous, or at least humorous, because it appears to refer to the wine. Preferences between the major arguments in the main clause are more subtle, but consider the discourse 13a. Jack saw Sam at the party. 13b. He went back to the bar to get another drink. While either Jack or Sam are logically possible antecedents for the he in 13b, there seems to be a preference for he to be referring to Jack. This preference is not strong enough to override semantic and contextual influences, however, as seen in 14a. Jack saw Sam at the party. 14b. He clearly had drunk too much. It is hard to interpret he in 14b as Jack, for with that interpretation it is not obvious how the two sentences are related to each other. 436 CHAPTER 14 There seems to be another factor that affects the interpretation as well. T> is based on the notion of a discourse focus or center. The intuition behind these theories is that most discourse is organized around an object that the discoursi about. This object, called the center, tends to remain the same for a few sentences and then shift to a new object. The second key intuition is that the center of a • sentence is typically pronominalized. This affects the interpretation of pronouns-because once a center is established, there will be a strong preference for subse- -quent pronouns to continue to refer to the center. For example: 15a. Jack left for the party late. 15b. When he arrived, Sam met him at the door. 15c. He decided to leave early. Semantically, 15c makes sense with either Jack or Sam as the antecedent, and the structural preferences favor Sam because he plays a central role in the major clause in 15b. Centering theory, however, would predict that Jack is the antecedent because Jack was referred to pronominally in 15b and thus is the center oL 15b, and nothing in 15c indicates that the center has changed. More precisely, two interacting structures are used in the centering theory: • The discourse entities in the local context, which we will call the potential next centers (or forward-looking centers). These are listed in an order reflecting structural preferences: subject first, direct object next, indirect object, and then the other discourse entities in the sentence. The first one on the list is called the preferred next center, written as Cp. • The center, written as Cb (for backward-looking center), which is what the current sentence is about. The Cb is one of the potential next centers, and typically it is pronominalized. The constraints between the center and pronominalization can be stated as follows: Centering Constraint 1—If any object in the local context is referred to by a pronoun in the current sentence, then the centi i of that sentence must also be pronominalized. Centering Constraint 2—The center must be the most preferred dis - ~ course entity in the local context that is referred to by a pronoun. Centering Constraint 3—Continuing with the same center from one sentence to the next is preferred over changing the center. Note that by constraint 1, if there is only one pronoun in a sentence, then it identifies the center unambiguously. By constraint 2, this means that if the nexfa sentence also contains a single pronoun, and it is contextually reasonable, that the two pronouns refer to the same object, then they will co-refer. For ex.miple. given the discourse « Local Discourse Context and Reference 437 Cb2 = Cp2 Cb2*Cp2 Cbi = Cb2 Continuing Retaining Cb1#Cb2 Shifting to preferred Shifting to nonpreferred SB" Figure 14.4 The types of movement for centers 16a. Jack i saw him2 in the parit3. 16b. He4 was riding a bikes Let DR i, DR2, and DR3 be the three discourse entities produced for Jack, him, and the park respectively. DR2 is the center (the Cb) of sentence 16a, and the potential next center list, in order of preference, is DRi, DR2, DR3. The first element of this list, DR 1, is the preferred next center (the Cp). Given this local context, now consider 16b. On purely semantic grounds he could refer back to DRi or to DR2. Given the centering constraints, however, there should be a preference that DR4=DR2, continuing with the same center. Given this interpretation, DR4 would be the center of 16b as well as its preferred next center. A sentence containing multiple pronouns is more complex because there is ambiguity with respect to which pronoun corresponds to the center. For example, the centering constraints given so far would indicate no preferences in the following, where he in 17b is interpreted intrasententially, referring to Jack. 17a. While Jack \ was walking in the parish, hej met Sam3. 17b. He4 invited hints t0 the partyg. By constraint 1, the center of 17a is unambiguously DRi, that is, Jack, since there is only one pronoun. Sentence 17b contains two pronouns, however, so the centering constraints can be satisfied with either DR4=DR 1 or DR5 =DR 1; that is, there is no preference for whether He4 in 17b refers to Jack or Sam. This may or may not be a problem depending on your intuitions. But for many, there is a preference for He4 to refer to Jack. To explore this concept, note that there are four different combinations for how two sentences relate to each other based on the center and preferred next center. These are shown in Figure 14.4, in which Cbt and O02 are the centers of the two sentences, and Cp2 is the preferred next center of the second sentence. When the center remains the same between the two sentences, the transition is either a continuation or a retention, depending on whether or not the center is now the preferred next center as well. If the center shifts, there are two cases, depending on whether or not the new center is the preferred next center. These distinctions are used in a revised constraint 3: Centering Constraint 3'—Continuing is preferred over retaining, retaining over shifting to preferred, and shifting to preferred over shifting to nonpreferred. 438 CHAPTER 14 Local Discourse Context and Reference 439 Sentence 17a: While Jack j was walking in the parkj, he i met Sam 3. Discourse Entities: Cp: DRi=Jackl, Others: DR3=Saml, DR2=Park7 Discourse Center: Cb: Jackl Sentence 17b: He4 invited him5 to the party6. Interpretation 1: (Continuing) Discourse Entities: Cp: DR4=Jackl, Others: DR5=Saml, DR6=Partyl Discourse Center: Cb: Jackl Interpretation 2: (Retaining) Discourse Entities: Cp: DR4=Saml, Others: DR5=Jackl, DR6=Partyl Discourse Center: Cb: Jackl Figure 14.5 Continuation versus retention of the center Consider again the interpretation of discourse 17, which is summarized in Figure 14.5. Since both Jack and Sam are referred to pronominally in 17b, constraints requires that the center be Jackl. But this in itself doesn't determine the referent for He^ as there are two possible antecedents: DRi, the center in the local context, indicating a continuation of the center, and DR3, indicating a shift to the preferred next center. Constraint 3' prefers interpretation 1, and He4 refers to Jack. How do centering preferences interact with the possibility of intrasentential interpretations? Determining what technique is best must await further development and evaluation of the possible algorithms. Currently, some algorithms always prefer intrasentential referents, while others favor the reverse. An interesting combination is to prefer any interpretation that assigns a pronoun to the center, but failing that, to prefer intrasentential readings over intersentential readings. Whatever the strategy, it is important to remember that it is more general contextual factors that ultimately determine the best interpretation. Any algorithm based solely on structural properties will not be foolproof. This suggests that these structurally based pronoun resolution algorithms will be most useful if they produce a list of possible referents in preferred order for each pronoun and leave the final decision to the general reasoning system. Finding the Likely Antecedents What follows is an example algorithm that produces an ordered list of possible referents for each pronoun based on the local discourse context, the co-reference restrictions, and a subset of the centering constraints just discussed. Unfortunately, because of co-reference restrictions, there is no way to independently process each pronoun. For instance, in 17b, if the two pronouns are interpreted independently, the preferred referent for each would be the center in the local context, namely Jack. But we know both cannot refer to Jack because of the reflcxivity constraints. There is no way to avoid this problem at this stage of the fl For each pronoun (PRO p S R) with antecedent constraints CRX derived from S and R: 1. Create an ordered list L of possible antecedents, consisting of (in order) the Cb and Cp from the local context, the possible referents from intrasentential processing, and the rest of the discourse entities in the local context. Remove any referent R in L that makes CRR false. Add (REF-LIST p L) to the restrictions in the pronoun's logical form. Figure 14.6 An algorithm for identifying possible antecedent for nonreflexive pronouns processing except by enumerating each combination of referent assignments to each pronoun. In the following algorithm, we avoid the issue by passing on the problem. The highest-rated referent for both pronouns will be Jack, and we leave it to the reasoning system to apply the co-reference restrictions and the interpretation. Thus centering constraint 3 may need to be considered by the reasoning system when it makes the final choice. There are three steps to the algorithm: 1. Generate a ranked list of possible antecedents for each pronoun. 2. Use general reasoning to select the appropriate antecedents. 3. Use the results of step 2 to define the Cb for the sentence to be used as part of the local context of the next sentence. Step 2 will be discussed in Chapter 15. Figure 14.6 gives an algorithm for step 1. Consider the algorithm running on discourse 17. The local context generated from 17a was shown in Figure 14.5. The logical form of 17b, He invited him to the party, might be (PAST (INVITE 1 il [AGENT (PRO hi (& (HE1 hi) (* hi h2)))] [THEME (PRO h2 (& (HE1 h2) (* hi h2)))] [PURPOSE ])) Consider the pronoun (PRO hi (& (HE1 hi) (* hi h2)V). The initial list of candidates consists of the Cb (Jackl), the Cp (Jackl again), the intrasentential antecedents (none), and the other discourse entities (Saml, Park3), yielding the list (Jackl Saml Park3). Only the first two of these satisfy the constraints on hi, so the logical form is updated to (PRO hi HE1 (* hi h2) (REF-LIST hi (lackl Saml))) The second pronoun is handled similarly and is updated to (PROh2 HE I (*hl h2) (REF-LTST h2 (Jackl Saml))) This information would be passed onto the general reasoning system in step 2, which, if it agrees with intuition, will identify the antecedent of hi as Jack! and h2 as Saml. With these results, you can then identify the center of the current sentence, and then use it as part of the local context for the next sentence. The 440 CHAPTER 14 Local Discourse Context and Reference 441 If the sentence contains no pronouns, then the new Cb is NIL. Otherwise, do the following: 1. Construct a list L consisting of the referents of each pronoun. 2. If the old Cb is in L, the new Cb = the old Cb (a continuation or retention). 3. Otherwise, pick the element of L that was highest ranked in the old discourse entity list and make it the new Cb (a shift). - s Figure 14.7 An algorithm for identifying the new Cb algorithm for step 3, shown in Figure 14.7, implements the centering constraints discussed earlier. In the previous example, this algorithm would identify Jackl as the new Cb. 14.4 Definite Descriptions A definite description refers to an object that is usually uniquely determined in the context. To identify the referent, you may have to use information from any of the four types of context. A description may uniquely describe an object in the general or the specific setting, or it may describe an object in the local or global discourse context. There is an important distinction in the use of definite noun phrases, depending on whether they are used existentially or refercntially. These readings have been discussed previously, but to remind you, the existential reading asserts the existence of a unique object satisfying the description, and the referential reading uses the description to refer to a previously known object, just as names and pronouns do. With the referential reading, the hearer must identify the referent, not merely accept that the referent exists. For instance, if we live on a dairy farm and, unbeknownst to me, exacdy one of the cows is sick, you cannot simply start a conversation with the sentence The cow is sick, unless some preceding context allows me to infer which cow is being talked about. This sentence would be inappropriate as a way of asserting that exactly one cow is sick, as would be suggested by the existential reading. In referential readings, definite descriptions generally must identify objects that already exist in context or the sentence is considered defective. Note that a definite description does not need to identify an object in some absolute sense, as the discourse context can be used to constrain the possible referents. In fact, a referential noun phrase can successfully refer to an object that is not identifiable in the world by the speaker or hearer. For example, consider the discourse 18a. Helen bought a carj and a boat yesterday. 18b. She paid too much for the car;. 4* •sir 5 BOX 14.1 Generating Referring Expressions Discourse models are crucial for enabling a natural language generation system to appropriately select referring terms. Consider that a generation system starts with a meaning representation that involves constants, without a mode of reference. For instance, consider generating sentences from the following two meanings: 1. ThinHWl, Won(Wl))& Woman(Wl) 2 ThinU. W2, Won(WI))& Woman(Wl) & Woman( W2)&W1*W2 Given no prior discourse context, the generation system might generate something like A woman thought she won for the first and A woman thought a woman won for the second. Note that the second sentence could not be used to realize meaning 1, because it would violate the co-reference constraints from Section 12.4 that the two instances of the noun phrase a woman cannot co-refer. The centering constraints also affect how a sentence is realized given the context of the previous sentence. Consider what happens if the system must realize meaning 1 after it has just generated the sentence Jane guessed a number from the expression 3. Guess-Number(Wl) & NameiWl, "Jane") Following this, meaning 1 would best be realized as She thought she won. The alternative realization Jane thought she won would be acceptable but stilted, while Jane thought that Jane won and Jane thought that the woman won would suggest incorrect readings. On the other hand, consider the context generated by the discourse 4a. Jane guessed a number. 4b. She picked the one Sue had suggested. If meaning 2 needs to be realized next, where W2 represents Sue, then Sue thought she won is acceptable but She thought Jane won is not, because it violates centering constraints. A good example of a generation system that uses discourse constraints is McKeown (1985). This is understandable without either the speaker or hearer being able to identify the car in the real world. The definite description the car uniquely identifies an object in the KB, namely the car that Helen bought, evoked in the previous sentence. Such examples suggest close parallels between definite descriptions and pronouns, although there are some differences. Definite descriptions, for instance, do not have such a strong tendency to have antecedents in the immediate local context. In addition, as discussed in Chapter 12, they cannot co-refer with many constituents in the same sentence. Rather, they typically refer back to objects introduced earlier in the history list. 442 CHAPTER 14 Local Discourse Context and Reference 443 Definite descriptions may refer to objects that have not been introduced in an earlier discourse context. These objects are unique within some assumed setting for the discourse. For example, usually the definite description the moon refers to the moon that orbits earth, although other moons exist and can be referred to in an appropriate context. But without a specific context that makes some other moon relevant, the description refers to the standard object. Definite descriptions can also be used to refer to objects that are visually present to the participants but have not been mentioned. For example, if I walked up to a house. I could ask someone to open the door (referring to the door in front of us). We can specify the constraints on successful definite reference by defining the concept of contextual uniqueness. An object O is contextually unique with respect to a description Dx, a situation S that includes both the specific ami general setting, and a discourse context consisting of a sequence of local discourse contexts C i,Cn, if and only if 1. There is a number k such that a for all i < k, there is no object x in Cj such that Dx is true, and b. O is the only object in Cjj such that Dx is true. 2. Otherwise, there is a unique object O in S such that Dx is true. This definition suggests an algorithm for handling referential noun phrase-, that modifies the basic history list search described in Section 14.2. Given a definite description of form , where Dx is the formula representing the description, test each discourse entity r in the local context to see if Dr is tnie If Dr is true for exactiy one r, then that is the referent. If more than one r is found, then the description is faulty in some way. If no discourse entity satisfies Dx, then repeat the same procedure on the previous discourse context. If no referent has been found in any local context, test if Dx is true for exactly one object in S. As with pronoun interpretation, the final decision on the referent will have to be made by the general reasoning system. Thus, the algorithm just outlined should be used to suggest a list of possible referents in order of preference. This information could be added as an additional restriction in the logical form, as was done for pronouns. To identify the referents of complex noun phrases, all the referential terms in the noun phrase must be resolved simultaneously. For instance, consider the noun phrase the cow in the field. To find the referent of this expression, it might seem that you must first find the referent of the NP the field. But if the referents are sought in this order, some problems arise. For instance, consider a situation in which there are two cows, Cl and C2, and two fields, Fl and F2. Furthermore, cow Cl is in field Fl, and cow C2 is in a barn, Bl. The situation is described In the following formulas: Cow(Cl), Cow(C2), Field(Fl), Field(F2), Barn(Bl) ln(Cl, Fl), In(C2, Bl) 1 3sS. •si 11 The NP the cow in the field clearly refers to Cl here, but the NP the field is ambiguous between Fl and F2. Thus you would fail to find a unique referent for the field, and so the whole NP would not be interpretable. To avoid this problem, you must search for both simultaneously to find an object that satisfies the formula 3 c : Cow (c) 3 /:, Field(f). In(c,f) There is only a single solution to this query, namely where c is instantiated with Cl and f is instantiated with Fl. Existential Readings and Indirect Reference Existential readings typically identify a unique object defined by the description. But being unique is not sufficient, as seen by the example The cow is sick, where even if you accept that exactly one cow in your herd is ill, the sentence is defective. Rather, existential readings typically can only introduce an object by defining it in terms of some other object that can be identified referentially. Consider 19. The winner in the 10K race was American. Neither the speaker nor hearer need to know who won the 10K race, they only have to agree that there is a single winner of the race, and whoever it was, that person was American. In this case, a general knowledge that races typically have exactiy one winner would allow the hearer to verify the uniqueness condition. In other cases, however, the uniqueness can only be inferred from the use of the definite noun phrase. Consider the difference in the interpretations of The drawer of my dresser is stuck and A drawer in my dresser is stuck. The hearer may know nothing about the dresser before the sentence. To interpret either sentence, the hearer must assume that there is a dresser. Furthermore, the hearer will probably infer that the dresser has exactly one drawer if given the first sentence, and that it has more than one drawer if given the second. These examples show that definite descriptions do not always refer to objects already in the discourse history. They can also introduce objects. This process is often called accommodation, because the hearer accommodates the speaker by introducing objects and properties that are required for the sentence to be understood. A new quantifier is defined for unique existence: 3\x:Rx.Px This formula is true only if there is exactly one x that satisfies Rx and if Px is true of that object. Given this, the translation of sentence 19, assuming the 1 OK race refers to the object RacelOK, would be 3! w : Win(w, RacelOK). American(w) The discourse entities evoked by existential definite noun phrases arc generated just as with indefinites. The only difference is that the referent is precisely 444 CHAPTER 14 Local Discourse Context and Reference 445 defined by the description. In particular, the discourse entity for the winner in the. 10K race in sentence 19 would be a discourse entity Wl, uniquely defined by Win(Wl, RacelOK). Contrast this to an indefinite noun phrase such as in A runner in the 10K race was American, in which the discourse entity, Rl, would be defined by RunIn(Rl, RacelOK) &American(Rl) Some existential readings do not explicitly include referential terms that can be used to identify them uniquely. In these cases the object and relationship must be inferred from the discourse context. Consider the discourse 20a. My club held a raffle. 20b. The winner won a car. There is no complement in the noun phrase the winner to indicate what it is defined with respect to. But if the same semantic interpretation is given to the lexical item winner as in sentence 19, this suggests that the interpretation of the winner should be 3! w : Win(w, *PRO*) where *PRO* is an anaphoric term that must be contextually determined. Evidence for this analysis can be shown by creating an example where there is no contextually unique contest. For instance, consider 21a. Both my club and John's club held raffles last week. 2 lb. *The winner won a car. Sentence 21b is either ill-formed because there is no referent or very awkwarc because it entails an interpretation where a single individual won both contests. Examples of indirect reference also arise with words that you might not usually interpret as a function on some other object. For instance, consider 22a. Jack brought a pencil to class. 22b. But he found that the lead was broken. The NP the lead in 22b, which intuitively refers to the lead of the pencil that Jacl brought to class, is the problem. This type of reference can be characterized b) positing some relation R and an anaphoric term "'PRO*. To interpret this noun phrase correctly, an antecedent for the pro-form must be found and the relatior *R* identified. The noun phrase the lead would then map to an existential reading of the form 3! /: LeadQ) & *R*(l, *PRO*) where *R* and *PRO* need to be identified from the context. In this example the resolved form would be as follows, where Pencill is the discourse entit) generated from a pencil in 22a: Note also that since the word lead is ambiguous between, say, the lead of a pencil, a fishing lead, or a piece of metal, this relation also needs to be identified in order to identify the correct word sense. While it is unlikely that there is a fixed set of relationships that can be used for *R*, some relations are quite common. For example, the subpart relation is often seen, as in discourse 22. The subpart relation might also be used to cover cases involving typical contents, such as in the sentence When we entered the kitchen, we noticed that the stove had been left on. Here you must use information that kitchens generally contain stoves. Another common class of examples involves events and their roles, as in The day after we sold our car, the buyer returned and wanted his money back. These can often be identified because the description uses a role noun, which may already have a pro-form included in its logical form from the semantic interpretation. In systems that use a frame-based knowledge representation, one technique is to allow any slot relation to be used in such cases. To analyze indirect referential forms, the system would check the definitions of the objects on the history list to see if the new noun phrase describes an object that could fill a slot. This representation could handle the range of examples discussed earlier, since subparts could be slots, and case roles would correspond to slots in a frame representing an event described by a verb phrase. This technique provides a way to limit the possible relations that need to be searched, but it is hard to see how it would handle all examples. Consider the sentence When we entered the kitchen, we saw that the gas had been left on. In this case the NP the gas is presumably the gas used by the stove that is part of the kitchen. It seems that handling a case like this would require some highly sophisticated inference capabilities. o 14.5 Definite Reference and Sets Additional complications arise with descriptions that involve sets, or refer to objects in previously defined sets. You have already seen indefinite descriptions referring to sets such as three men and some men. Definite versions of these descriptions are the three men and the men and can be interpreted either as direct or indirect reference. Consider an example. Remember that modifiers in noun phrases that describe sets may describe either properties of the set itself or properties of individuals in the set. Thus the description The three men who left the party describes a set that has a cardinality of 3, in which every member is a man who left the party. The existential reading of this definite noun phrase would be something like 3! /: Lead(l)& SubpartOfil, Pencill) 3! M : \M\=3 &M=[m\ Man(m) & Leave(m, Party 1)} 446 CHAPTER 14 The fact that this set is unique means that exactly three men left the party; otherwise there would be several possible sets fitting this description. Compare this to the indefinite description, Three men who left the party, which would have the reading 3M: \M=3&Mcz{m\ Man(m) &Leave(m, Partyl)} Since this set need not be unique, more than three men may have left the party. Plural noun phrases without any complement typically involve indncu reference, often to subsets of already introduced sets. For example, considci 23a. Some boys and girls came to the party. 23b. The boys left at 8PM. The noun phrase The boys does not describe a set that is explicitly present in the local context. Rather, it describes a set indirectly, namely the set of all the boys in the previously mentioned set. To allow such interpretations, a translation of 23b could be Left(Bl, 8PM) where Bl = {x I Boy(x)} n *PROSET* where *PROSET* must be a contextually determined set. If you let S2 be the discourse entity generated from Some boys and girls in sentence 23a, this could be resolved to Left(Bl, 8PM) where Bl = {x\Boy(x)} n S2 For another example, consider the discourse 24a. Jack, Mary, Sam and Helen went to the beach. 24b. The boys were too chicken to go in the water. In sentence 24b, the description The boys refers to a set consisting of Jack and Sam, which is a unique set defined by the intersection of the set of boys with the set {Jackl Maryl Saml Helenl} introduced by the conjunctive noun phrase. This technique will handle many cases, but problematic cases remain when even the referent set is not present in the local context. Consider 25a. Jack met Sam at the beach. 25b. The two boys then went to the store. Here there is no explicit set mentioned in 25a. One proposal is that the set [Jack Sam) becomes relevant via reasoning about the situation. In this case we might use general knowledge that when two people meet, they are together and naturally form a new set. But this technique won't work for other examples, such as 26a. Jack lived in San Francisco, while Sam lived in New York. 26b. The two boys never met. Handling such cases in a general way remains problematic. . i I ■ Local Discourse Context and Reference 447 BOX 14.2 Planning Definite Descriptions An important problem in natural language generation is planning definite noun phrases. The generation system must first decide whether it is best to use a pronominal form, a proper name if available, or a definite description. Some issues relating to this decision were described in Box 14.1. If the decision is to generate a definite description, however, many problems remain, as there are many different ways to describe objects. Some possible descriptions can be eliminated by considering whether the description provides sufficient information to make the referent contextually unique given the discourse context. But even with this constraint, many possibilities remain. Another constraint is conciseness. A generation strategy that simply includes everything known about an object in the description would likely produce an incomprehensible sentence. A good description is one that is short yet still contextually unique. For instance, given a context with two red boxes, one large and one small, a good description for the small box is the small box. The description the small red box also uniquely identifies its referent, but contains unnecessary information. The description the red box, on the other hand, is not a good description as it does not uniquely describe its referent. In many cases, there will still be many ways to describe an object that satisfy the two properties above. Consider a KB that includes the following information: SellEvent(sl) &Agent(sl) = Ml & Recip(sl) = Jl & Theme(sl) = P1 & Person(Ml)& Named(Ml, "Mary")& UvesMMl, CTTY1) & Personal) & Named(Jl, "John") & President Jl, COMPl)& Plans(Pl)& Secret(Pl) & Named(COMPl. "XTRA Corp") Consider the context in which the sentence Mary sold the secret plans has already been generated to realize event si, and consider the following possible realizations of the proposition Recip(sl) = Jl: 1. John is the president of XTRA Corp. 2. The buyer was the president of XTRA Corp. 3. The buyer was John. Logically, all three realize the content and uniquely identify the terms. Sentence 1, for instance, uses the uniquely identifying name John to realize the term Recip(sl) and the uniquely identifying description the president of XTRA Corp to realize the term Jl. It does not, however, provide necessary information on how to link the sentence into the context. Thus it is not a good realization. Sentences 2 and 3 fare better because they provide a connection to die discourse context via the description the buyer, identifying a role in the selling event. Since the noun phrases John and the president of XTRA Corp both uniquely describe Jl, they might seem equivalent, but they are not. Which of these two is appropriate depends on the overall goals of the speaker. If, for instance, the speaker is attempting to reveal some act of industrial espionage, then sentence 3 is not a good realization as it does not provide the key information that Mary sold the plans to another company. A good example of a system that generates definite descriptions is Dale (1992). 448 CHAPTER 14 Reference to Elements of Sets When a set is in the discourse context, descriptions can also refer to elements of the set. Many of these examples use specific devices that signal the behavior. For instance, the word one can be used for indirect reference to a set, as in NPs like one and a large one, which are indefinite descriptions, and the one I saw and the-large one, which are definite descriptions. Consider 27a. At the zoo, a monkey scampered between two elephants. 27b. One snorted at it. 27b'. The large one snorted at it. The appropriate behavior of one can be obtained if we assume it translates to 3x: xe *PROSET*... where *PROSET* is referentially determined. In 27b the antecedent would be the set of elephants described in 27a, say ELS 1, and the fonn for 27b is 3x: xe ELS1. SnortAt(x, monkey1) The definite reference case shown in 27b' is an indirect reference with the form 3! x : xe ELS1 & Large(x) . SnortAt(x, monkey 1) More complex examples arise in which there is no explicit signal that the referent is defined contextually by a set. Consider 28a. Jack used two scuba tanks on his trip. 28b. He preferred the 1600psi tank. In this case there may be no object in the discourse context that has the property of being a tank rated at 1600psi. Rather, this noun phrase has selected one element from the previously mentioned set and asserted additional properties of it (that presumably make it unique). Thus, assuming the discourse entity for two scuba tanks is TNKS1, then the representation of sentence 28b would be 3! t: t e TNKS1 & Rated(t, 1600psi). Prefer{Jackl, t) Because such cases contain properties that are not known to the hearer before the utterance, they are often difficult to detect. Universal Quantification and Sets Universal quantifiers such as each and every display an interesting behavior. They are syntactically singular, and support intrasentential singular pronouns, as Local Discourse Context and Reference 449 .4 r ■ t in When each boyi saw Helen, hei was angiy. Intersententially, however, they evoke a set as a discourse entity, as in 29a. Each boyi m the park saw Helen. 29b. *Hej was angry at her. 29b'. They j were angry at her. Note that He in sentence 29b cannot refer to one of the boys introduced in 29a, whereas They in sentence 29b' can refer to the boys in 29a. So the discourse entity evoked by Each boy in the park is the set {x I Boy(x) & In(x, Parkl)}. These noun phrases often involve an implicit definite reference to a set that limits the range of the quantifier. For example, in isolation the discourse entity generated for each girl in the sentence Each girl saw Helen would be simply the set \x I Girl(x)}. This is a very unlikely interpretation, though, as it claims that every girl who exists saw Helen. Rather, the range of each girl is contextually defined, as in the discourse 30a. Several girls arrived at the party. 30b. Each girl saw Helen. When processing 30a, the noun phrase several girls would cause the creation of a constant, say Gl, that is a set of girls. With Gl in the local context for sentence 30b, the interpretation of each girl is most naturally interpreted as ranging over all girls in Gl, and the initial translation of 30b is V g: g e Gl. Seel (g, Helenl) This analysis essentially treats the noun phrase each girl as a paraphrase of each of the girls, and every girl as every one of the girls. The set of girls must be determined referentially, and then is used as the scope of the quantifier. In other examples the set is defined using indirect reference. For example, consider 30b in the context created by the sentence Several boys and girls arrived at the party, where a discourse entity BG1 was created for the set of boys and girls evoked. In this case, the range is created by intersection, as seen before with 23b. The initial translation of 30b would be 3! G3 :G3 ={x\ Girl(x)} n BG1 Vj:j€ G3 . Seel (g, Helen!) 14.6 Ellipsis As described in Section 14.1, ellipsis involves the use of clauses that are not syntactically complete sentences. Often the parts that are missing can be recovered from the previous major clause. Ellipsis occurs across conjunctions, as in 31. Helen saw the movie and Mary did too. 450 CHAPTER 14 Local Discourse Context and Reference 451 BOX 14.3 Reference to Objects Not Evoked by NPs All the examples of referential readings in this chapter have involved antecedents evoked by noun phrases. There are many examples of reference to other objects evoked in the sentence. In most of these cases, the content of the referring expression or the selectional restrictions explicitly identify that the referent is of a special type. In general, the referent for these constructions is found in the local context. Consider some example continuations following the sentence Jack went to New Orleans last year. You can refer to the event described, as in the sentences The trip changed his life. It changed his life. You can refer to the action that Jack did, as in the sentences He does that every year. Sam did it last year as well. You can refer to the fact that Jack went to New Orleans, as in That really surprised Helen. You can also refer to the event's time and location with the special pronouns then and there: Mardi Gras was on then. He loves to go there. There are two ways to handle these. The set of discourse entities generated can be extended to include any action, event, time, location, or proposition implicitly described in the clause, or techniques can be developed to derive the referents from the local context when needed. Usually the surface form and selectional restrictions signal the use of such referring expressions, so both methods work equally well. where it is understood that Mary saw the movie. Other cases involve sentence pairs, as in 32a. Some think that Jack will win the race next week. 32b. But he never will. Of importance in question-answering applications, ellipsis may occur in a series of questions, such as the following dialogue between two conversants A and B: 33a. A Did Sam find the bananas? 33b. B: Yes. 33c. A The peach? Here speaker A's second utterance can be understood only in the context of A's-first question and is understood as an elliptical form of Did'Sam find the peach? ~ Contextual Clause: Helen saw the movie. Syntactic Structure: (S (NP Helen) (VP saw (NP the movie))) Semantic Form: Seel (Helenl, Movie24) Elliptical Clause: Mary did too. Syntactic Structure: (S (NPMary) (VP did)) Figure 14.8 A simple example of VP ellipsis Syntactic Constraints on Ellipsis The hypothesis underlying most analyses of ellipsis is that the completed phrase structure of the elliptical clause corresponds to the structure of the previous clause. Once the structural correspondence between the two clauses is identified, the semantic interpretation of the previous clause can be updated with the new information in the elliptical clause to produce the new interpretation. Consider sentence 31, whose initial analysis is shown in Figure 14.8. Matching the syntactic structures, a correspondence would be found between the two subject noun phrases, Helen and Mary. These determine the transformations required on the semantic form as follows. First, Helenl is abstracted out of the semantic form of the first clause to produce Xp Seel(p, Movie24) This form is then applied to the new information, namely the interpretation of Mary to produce the semantic form of the elliptical clause: Seel (Maryl, Movie24) This technique works with more complex examples as well. For instance, a classic problem in the literature (often called the sloppy identity problem) involves accounting for the ambiguity in discourses such as 34a Jack kissed his wife. 34b. Sam did too. Sentence 34b can mean that Sam kissed Jack's wife (called the sloppy reading) or that Sam kissed his own wife. The syntactic structure matching would proceed as before and find a correspondence between Jack and Sam. Assume that the semantic form of 34a is KissKJackl, wifeOfiJackl)) 452 CHAPTER 14 Local Discourse Context and Reference 453 To generate the semantic form of 34b, this form needed to be abstracted Hut there are two possible ways to abstract Jackl from the expression, producing ihc two forms X p Kissl (p, wifeOf (Jackl)) and X p Kissl (p, wifeOf(p)) Applying the first to Saml produces Kissl (Saml, wifeOf (Jackl)) while appljmg the second produces Kissl(Saml, wifeOf (Saml)). Thus two readings can he derived, as required by intuition. An Algorithm Based on Syntax The crux of this technique is clearly finding the structural correspondent between the two clauses. This process is most strongly influenced by syntactic factors. The input for the algorithm is the syntactic structure in the local context, and a partial syntactic structure generated for the elliptical clause. Note that u bottom-up chart parser can produce this analysis despite the fact that a full sentential analysis may not be derivable. With some constructs such as VP ellipsis, the correspondence is unambiguously signaled. A verb phrase consisting of the auxiliary do with no complement and the presence of modifiers such as too, as well, and also indicate VP ellipsis, and the correspondence is between the two subjects. In others, finding the correspondence is more problematic. For instance, consider dialogue 33, in which ; the NP in the elliptical clause corresponds to the object NP in the previous clause. Potential correspondences can be found by a search using a pattern derived from the input fragment. To find fragments that are as syntactically parallel in structure as possible, start with a pattern that contains all the information in the input fragment except for the content words such as nouns and adjectives. Function words (such as prepositions and articles) will be retained. If this match fails to produce any potential target fragments, you can try again using a less specific pattern (for example, by deleting the number restrictions, rc-mimns specific articles and prepositions, removing modifiers, and so on). Consider; processing dialogue 33. The input to the algorithm is shown in Figure 14.9. The initial pattern will be (NP[3s] (DET the) (CNP)). The syntactic structure of the contextual clause is searched for a constituent that matches, and none is found. If the number restriction in the pattern is relaxed, yielding (NP (DET the) (CNP)), a match is found between the bananas and the peach, and the semantic form can be derived as described before. If this pattern had not matched, you could continue to relax constraints, trying (NP (DET) (CNP)), and if that failed, finally trying (NP). You would want to define the series of relax ations in a way such that the first pattern to match would capture the moM important structural properties. Elliptical forms that consist of a sequence of constituents are more comr. plicated. Consider the example Contextual Clause: Did Sam find the bananas? Syntactic Structure: (S (AUXdid) (S (NPi Saml) (VP find) (NP2[3p] (DET the) (CNP bananas)))) Semantic Form: ? Findl(Saml, Bananas!) Elliptical Clause: The peach? Syntactic Structure: (NP[3s] (DET the) (CNP peach)) Figure 14.9 A simple example of ellipsis 35a. A Did the clerk put the bananas on the shelf? 35b. B: Yes. 35c. A The ice cream in the refrigerator? The correct interpretation of 35c involves two independent constituents rather than a single constituent, namely the NP the ice cream and the PP in the refrigerator (which will match as part of the complement of the verb put). The pattern matcher can be extended to search for each fragment individually with the additional constraint that the order of the constituents in the input fragment should be the same as the order of the target constituents in the initial question. Furthermore, the sequence of target fragments should all be subparts of the same constituent. For example, consider dialogue 35. The syntactic forms of A's initial utterance and the input fragment are shown in Figure 14.10, together with the pattern sequence generated from the input fragment. The pattern generated from NPg would successfully match NP3, NP4, and NP5, whereas the pattern from PP3 would initially fail, but after relaxing the constraint that the preposition be in, it would successfully match PP2. Only the pair NP4 and PP2 occur within a single constituent (VP1) and appear in the original sentence in the appropriate order. Thus NP6 corresponds to NP4 and PP3 corresponds to PP2 when constructing the new interpretation. The semantic form would then have to be abstracted at the argument positions for NP4 and PP2, and the result applied to the interpretation of the new constituents NPg and PP3. Semantic Preferences The algorithm given earlier will sometimes not uniquely identify the appropriate target fragment. For example, consider the dialogue Did the clerk put the ice cream in the refrigerator? No. The TV dinners? 454 CHAPTER 14 Local Discourse Context and Reference 455 Contextual Clause: Did the clerk put the bananas on the shelf? Syntactic Structure: OS (AUXdid) (NP3 (DETthe) (CNP clerk)) (VPl (Vput) (NP4 (DETthe) (CNP bananas)) (PP2 (Pon) (NP5 (DETthe) (CNP shelf))))) Elliptical Clause: The ice cream in the refrigerator? Syntactic Structure: two constituents (NP6 (DETthe) (PP3 (Pin) (CNP ice-cream)) (NP7 (DETthe)) (CNP refrigerator)) Figure 14.10 An example with multiple constituents In this case the pattern generated from the input fragment will match the NPs the clerk, the ice cream, and the refrigerator equally well. Semantic information has to be used to select the ice cream as the appropriate target fragment. Similarly, a continuation of The manager? should identify the clerk as the appropriate target fragment, and a continuation of The freezer? should identify the refrigerator as the target fragment. A technique that can account for each of the preceding examples is to compute a semantic similarity measure between the input fragment and the potential target fragments and then select the closest one. The techniques introduced in Chapter 10 when considering word-sense disambiguation will work well. For example, assume the system has a taxonomie hierarchy as shown in Figure 14.11. Let us use the simplest measure of similarity: the graph distance between each sense. Consider the case where the input fragment is The TV dinners? and the possible target fragments are the clerk, the ice cream, and the refrigerator. You can compute a semantic closeness measure between these by counting the number of steps between the nodes in the hierarchy. The results are TV-DINNER and CLERK: 7 (via PHYSOBJ) TV-DINNER and ICE-CREAM: 4 (via FOOD) TV-DINNER and REFRIG: 6 (via INANIMATE) Thus the preferred target fragment is the ice cream, and the appropriate analysis is generated. Of course, semantic closeness is not foolproof, and it may select a target that turns out to be inappropriate. If the semantic interpreter is unable to an.il>A' PHYSOBJ ANIMATE HUMAN APPLIANCE k CLERK MANAGER LARGE SMALL BUILDING A PREPARED SWEETS STORE REFRIG FREEZER TOASTER TV-DINNER ICE-CREAM Figure 14.11 A type hierarchy the resulting structure successfully, then other possible target choices can be attempted. Consider the following dialogue: A Did you see the clerk in the store? B: Yes. A The toaster oven? This case would have the following semantic closeness checks: TOASTER and CLERK: 7 (via PHYSOBJ) TOASTER and STORE: 5 (via INANIMATE) Thus the semantic closeness check would select the store as the target fragment and construct a new interpretation corresponding to the sentence Did you see the clerk in the toaster oven? Hopefully, the semantic interpretation rules are rich enough to suggest that this reading is unlikely. The ellipsis algorithm then would suggest the next alternative, with the clerk being the target fragment, and construct an interpretation corresponding to the sentence Did you see the toaster oven in the store?, as desired. 14.7 Surface Anaphora There is a whole class of referring expressions that introduce new objects related to objects in the discourse context. These cases are often called surface anaphora because many theories account for them in terms of the surface form of the previous sentence. In contrast, all the cases you have seen so far refer to 456 CHAPTER 14 Local Discourse Context and Reference 457 objects in the discourse context and are called deep anaphora. Consider some - v; typical examples of surface anaphora: - _1 36a. Tell me John's grade in CSC271. j& 36b'. Give me it in MTH444 as well. 36b". Give me Mike's in MTH444 too. ^ The objects referred to in 36b' and 36b" are not mentioned in 36a. But the-context created by 36a allows you to interpret each one as referring to some other grade (John's grade in MTH444 and Mike's grade in MTH444, respectively). Note that there are specific syntactic devices that signal surface anaphora. One large class involve noun phrases with a missing head word, such as Mike's — in MTH444. Others involve a pronoun, but with a complement phrase, as it in: MTH444. Furthermore, many cases involve the use of adverbials like too, also, and as well, which explicitly indicate a comparison to something in the discourse context. There are several methods proposed for dealing with surface anaphora. The first approach uses syntactic/semantic structure matching as used for handling ellipsis. The idea is that the surface anaphoric phrase is an incomplete constituent. The phrase is processed by searching the syntactic tree of the previous sentence for a constituent whose structure matches and then merging that information with the new information to build a new phrase that can then be parsed as usual to produce the interpretation. Consider this technique using example 36b", a phrase with a missing head noun. The phrase Mike's in MTH444 might produce an NP constituent as follows: (NP (DET (NP (NAME ml "Mike")) Cs)) (CNP (CNP *missing*) (PP (P in) (NP (NAME m2 "MTH444"))))) This constituent is used to generate a pattern as described in the last section. The pattern is then matched against all the subconstituents of the previous sentence. In discourse 36, there is only one constituent that matches, namely the NP John's grade in CSC271. This constituent is then copied, replacing old information with new whenever it is specified. The new NP would be (NP (DET (NP (NAME ml "Mike")) Cs)) (CNP (CNP GRADED (PP (P in) (NP (NAME m2 "MTH444"))))) This new NP can be inserted into the current tree and will produce the correct interpretation. Examples that utilize pronouns are treated similarly, and the pronoun simply fills a missing constituent. i. The problem with structural approaches is that they are very sensitive to exactly the way things are phrased. For example, given the context generated by 36a, the technique just described would not work for Give me it for Mike as well, as it couldn't detect the parallelism between the possessive determiner John's and the PP complement for Mike. The other approach is semantically based. It assumes that the missing information is characterized by some set previously mentioned. To make this work, you must allow the abstractions of the sets mentioned. Sentence 36a would evoke the singleton set of John's grades in CSC271, that is, {#1 Grade(g, John, CSC271)}. An abstraction of this might be the set of anyone's grades, {g I 3 p : Personip): Grade(g, p, CSC271)}, or the set of John's grades in any course, [g I 3 c : Course{c). Grade(g, John, c)}, or the set of anyone's grades in any course, {g I 3 p : Person(p) 3 c : Courseic) . Grade(g, p, c)}. The surface anaphora would then pick out one of these sets and add back in the necessary properties based on the new information in the phrase. To make this work, however, you would need a way to incorporate additional modifiers into the set. For instance, consider Give me it for Mike as well again. The appropriate abstraction would be the set of grades in CSC271, that is, {g I Bp : Personip) Gradeig, p, CSC271)}. You now would have to interpret the modifier for Mike as restricting the person to be Mike. But if you do not use the lexical information that the head word in 36a was grade, it is not clear how this could be done. Thus an approach based purely on the set abstractions is difficult to make work in such cases. Summary In discourse the preceding major clause has a substantial effect on the interpretation of the next major clause. The information in the preceding clause defines the local discourse context of the next clause. One important component of the local context is the set of discourse entities evoked by the clause. These serve as the most likely antecedents of pronouns in the next clause. Discourse entities evoked by singular noun phrases represent individuals, whereas those evoked by plural noun phrases are sets. A singular noun phrase within the scope of a universal quantifier also evokes a set, namely the set of all objects generated for each value of the universally quantified variable. The history list is a sequence of local contexts generated by the preceding clauses. The simplest algorithms for finding antecedents for definite NPs (including pronouns) simply scan back through the history list looking for the first (most recent) discourse entity that satisfies the restrictions. Preferences for the antecedents of pronouns within the local discourse context are affected by the discourse focus or center. If a sentence contains at least one pronoun, then one of the pronouns must refer to the center. Constraints on how the center can change from sentence to sentence determine the preferred interpretations of pronouns. The local context also contains the syntactic and semantic structure of the previous clause, which is crucial for interpreting ellipsis and surface anaphora. 458 CHAPTER 14 Local Discourse Context and Reference 459 These phenomena are typically handled using pattern-matching techniques to find syntactic similarities between the previous clause and the fragments in the current clause. Once a correspondence is found, the new interpretation can be constructed by substituting the information in the elliptical sentence into the structure of the previous clause and then reinterpreting the revised structure. Related Work and Further Readings__" Early natural language systems used linear history lists with heuristic techniques to handle common pronouns and verb phrase ellipsis (for example, Winograd (1972) and Woods (1977)). These systems depended strongly on semantic filtering to identify the most appropriate referent. Webber (1983) produced the first computationally motivated study of the range of discourse entities a sentence evokes, with special emphasis on definite and indefinite phrases within the scope of universals. The material in Section 14.1 is based on this work and subsequent improvements by Ayuso (1989). Most work on anaphora in linguistics has focused on constraints on intra-sentential cases. There is some work on intersentential cases, which introduced the notion of discourse entities (for example, Kartunnen (1976) called them discourse referents). The most influential works in formal linguistic models that use these ideas are by Kamp (1981) and Heim (1982). Kamp's discourse representation theory (DRT) uses discourse entities to provide a uniform account of pronouns, whether they appear as bound anaphora, intrasentential, or intersentential anaphora. This work has influenced much subsequent work in linguistics and computational linguistics (for example, Harper (1992)). Sidner (1983) developed a computational model of focus and its effect on pronoun interpretation. Her hypothesis was that the focus is selected according to preference heuristics based on thematic roles. The most preferred discourse entity was selected as the antecedent for a pronoun unless some semantic or pragmatic contradiction prevented it. This model evolved into the centering model for definite noun phrases proposed by Grosz, Joshi, and Weinstein (1983). The centering model has sparked considerable work. The material in Section 14.3 adapts an algorithm based on centering constraints developed by Brennan, Friedman, and Pollard (1987). Walker (1989) compared the Brennan et al. algorithm with the Hobbs algorithm (Hobbs, 1978), an algorithm based on syntactic structure and recency. The comparison is interesting since the Hobbs algorithm always prefers an intrasentential reading if possible, whereas centering always prefers an intersentential center-based reading. Both algorithm^ performed about the same, correctly identifying the antecedents about 90 percent of the time in written text, but only about 50 percent of the time in dialogue. Tn linguistics, many different theories have been proposed that caplme notions related to discourse focusing behavior. These theories typically divide the information in a sentence into two parts, often indicating what pail of the sentence is conveying new information and what is providing background ■SB _ la m - 3 ' information. Influential theories have involved distinctions between topic and comment, theme and rheme, and given and new (for example, see Halliday (1967), Chafe (1975), Clark and Haviland (1977), and Prince (1981)). But there is an important distinction to remember when looking at different theories. Computational models are usually based on a cognitive model of the speaker; that is, the center is a focus of attention that affects inference. In linguistics the same terms may be used to refer to structural properties of a sentence. While these may coincide much of the time, they also could be quite different. Definite descriptions have been the focus of extensive study in philosophy and linguistics. Russell and Whitehead (1925) introduced the notion of definite descriptions as special existential operators. Donellan (1966) introduced the distinction between referential and attributive uses of definite descriptions. Lewis (1979) and Stalnaker (1974) developed the concept of accommodation. This is a central component in the work by Kamp (1981) and Heim (1982). For more information, consult McCawley (1993) or Chierchia and McConnell-Ginet (1990). Most computational systems handle definite descriptions using a linear history list as described in this chapter, allowing also for reference to uniquely described objects in the KB (for example, Winograd (1972) and Woods (1977)). Grosz (1977) showed that in extended discourse, a linear history list is insufficient to handle many cases. These issues will be considered in Chapter 16. Many computational models have used syntactically based analyses of ellipsis, where once a structural correspondence is found, a new syntactic structure is built that then is semantically interpreted from scratch. Techniques based on semantic closeness are used in a wide range of systems, especially those providing natural language interfaces to databases, where ellipsis is very common (for example, Hendrix et al. (1978), Bates et al. (1986), and Grosz et al. (1987)). The analysis of ellipsis in Section 14.6 draws on this work but follows the approach described in Dalrymple, Shieber, and Pereira (1991). Surface anaphora was first addressed in a computational system in the LUNAR system (Woods, 1977). These techniques are important for question -answering applications where such cases arise frequently. Webber (1983) provides a detailed analysis of the phenomena and how it can be handled. Hankamer and Sag (1976) draw the distinction between deep and surface anaphora for linguistic analysis, and argue that making the distinction is essential for handling a wide range of linguistic phenomena. Exercises for Chapter 14_ 1. (medium) Identify the discourse entities evoked by the following two sentences and give their semantic form. For each, assume a set of races Rl was introduced by the previous discourse. In every race, the winner was American. In every race, some runner was American. 460 CHAPTER 14 Local Discourse Context and Reference 461 2. (medium) Consider the sentence Two boys ate three pizzas, which has at -least two readings: The two boys together ate three pizzas, or the two boys each ate three pizzas. For each of these readings, specify the discourse cmi -ties generated and the sentence meaning. Discuss any problems that arise. 3. (medium) Use the centering theory to explain why discourse a is cpIiciuh but discourse b seems awkward: al. Sue got up late. a2. Helen didn't have time to drive her to school. a3. She had to walk. bl. Helen had to be at work early. b2. She didn't have to time drive Sue to school. b3. She had to walk. 4. (medium) For each sentence in the following discourse, give its center and preferred next center. State what centering theory predicts as the antecedent for each pronoun based on the four constraints and the preferences described in Section 14.3. Based on your analysis, classify each transition as a center continuation, retention, or one of the shifts. Discuss whether the predicted interpretations agree with your intuitions. If they don't, which constraint or preference seems to cause the problem? a Sue got up late the day that Helen was supposed to drive her to school. b. So she had to walk. c. Mary saw her walking down the street. d. She stopped to pick her up. 5. (medium) In Section 12.6, a treatment of relational nouns was suggested for semantic analysis. Show how such a representation can be used with the history list mechanism to identify the referent of NPs involving such nouns. In particular, for each NP in the following text, give a reasonable logical form and the current state of the history list, and discuss the process ol identifying the referent. The museum received a large grant from a local corporation. The donation will allow the recipient to stay open through the summer. 6. (medium) One suggestion for handling reference to subsets of sets created from NP conjunctions would be that every subset of the set is introduced as a discourse entity. For example, the sentence Jack, Sam, and Helen went to the beach. 8. would generate the sets {Jackl}, {Saml}, {Helen!}, [Jackl Saml], [Jackl Helenl}, {Saml Helenl}, and {Jackl Saml Helenl}. Now, given the sentence The boys were too chicken to go in the water. the appropriate referent is already defined as a discourse entity. Is this a reasonable general approach to the problem? Compare it to the methods described in Section 14.5, and give arguments for why it is worse or better than that approach. (medium) Give the syntactic structures for each of these sentences and show in detail how the one-anaphora in the second sentence is resolved. I bought a ticket for the early show on Tuesday. Jack bought one for the late show. (medium) Using the semantic preference technique with the hierarchy in Figure 14.11, show in detail how the interpretation is constructed for the third sentence. Discuss the result. A: Did Jack give the toaster to the manager? B: No. C: The clerk? CHAPTER Using World Knowled 15.1 Using World Knowledge: Establishing Coherence 15.2 Matching Against Expectations 15.3 Reference and Matching Expectations 15.4 Using Knowledge About Action and Causality 15.5 Scripts: Understanding Stereotypical Situations 15.6 Using Hierarchical Plans 15.7 Action-Effect-Based Reasoning 15.8 Using Knowledge About Rational Behavior Using World Knowledge 465 Language cannot be understood without considering the everyday knowledge that all speakers share about the world. Some general world knowledge has been used in previous chapters. For instance, knowledge about type restrictions on predicate argument relations was used to support word sense disambiguation in Chapter 11. But these techniques were limited and did not try to relate what was said to some evolving situation in the world that was being described. This chapter considers how general knowledge of the world is used to interpret language. Section 15.1 discusses the notion of coherence, which provides the motivation and justification for all the techniques in the chapter. A general framework for using knowledge of the world is developed in Section 15.2, which develops the notion of expectations produced from the previous discourse. If a sentence matches an expectation, then the result of the match produces the connections that make the discourse appear coherent. This idea is further developed in Section 15.3, where it is shown that some problems in definite reference can be treated within the same framework. The remainder of the chapter explores different ways of producing expectations. Section 15.4 discusses the use of knowledge about action and causality to generate expectations. Section 15.5 discusses script-based processing, in which the system is driven by large-scale action descriptions that characterize the different situations that the discourse may be about. Sections 15.6 and 15.7 discuss plan-based techniques, where knowledge of causality and intentions is used to generate the expectations. Section 15.8 looks at more complex issues in using knowledge of an agent's problem-solving behavior to generate expectations. 15.1 Using World Knowledge: Establishing Coherence In many examples considered so far in this book, the final decision about the interpretation of a sentence has been left to contextual processing. This chapter attempts to define some techniques that automate this process. Doing this, however, will require a better understanding of what it means for a sentence to make sense in a context. There is clearly a relationship between logical consistency and making sense. A reading that is inconsistent will not make sense in any context. But more often than not, there are many logically consistent readings, only some of which make sense. So making sense is more a notion of coherence than logical consistency. Certain readings seem coherent, whereas others are not coherent at all. A discourse is coherent if you can easily determine how the sentences in the discourse are related to each other. A discourse consisting of unrelated sentences would be very unnatural. To understand a discourse, you must identify how each sentence relates to the others and to the discourse as a whole. It is this assumption of coherence that drives the interpretation process. Consider the following discourse: la Jack took out a match, lb. He lit a candle. 466 CHAPTER 15 Using World Knowledge 467 There are several conclusions about sentence lb, motivated by the assumption of coherence between la and lb, that can be classified into several categories: reference—He refers to Jackl (as suggested by centering) disambiguation—lit refers to igniting rather than illuminating the i candle implicature—the instrument of the lighting is the match introduced in la ~ Note that to identify how sentences la and lb are related, you must know that a: typical way to light a candle involves using a match. This general knowledge about the world allows you to draw connections between the sentences. In other-cases there may be almost no apparent relationship between two sentences, but assumptions of coherence still impose some relationship. For example, consider 2a. John took out a match. 2b. The sun set. While it might seem that there is no identifiable relationship between 2a and 2b, in fact you assume a temporal relationship, that is, that Jack took out the match before (or while) the sun set. While only a minimal connection, it is still enough to construct a coherent situation that 2a and 2b describe. Note that many conclusions drawn from the coherence assumption are implications, not entailments. They can be overridden by later discussion and thus are defeasible. As a result, the inference process of drawing such conclusions will be complex. The techniques considered in this chapter will all be cast in terms of matching possible interpretations against expectations generated from the previous discourse. This is discussed in detail in the next section. 15.2 Matching Against Expectations We assume that the specific setting created by a discourse is represented by the content of the previous sentences and any inferences made when interpreting those sentences. This information is used to generate a set of expectations about plausible eventualities that may be described next. Later sections will explore different techniques for generating expectations. This section examines the problem of matching possible interpretations to expectations, assuming the expectations have already been generated. More formally, the problem is the following: Given a set of possible expectations Ei,En, and a set of possible interpretations for the sentence 1], \, determine the set of pairs Ej, Ij such that Ej and Ij match. If there is a unique expectation/interpretation pair that match, then the interpretation of the sentence would be the result of matching the two. If there are multiple possible matches, then some other process must determine which is the best match. Consider the example of discourse 1. By methods to be discussed later, sentence la would sanction an expectation that Jack is going to use the match to light something. Using our event-based logic, this expectation might be written as an event El defined as: Expectation 1: LIGHT1 (El) & Agent(El) -- Jackl & Instrument (El) = Match33 where Match33 would be a new constant introduced by the indefinite NP a match in sentence la. Sentence lb is ambiguous between two readings: One in which someone ignites a candle and the other in which someone illuminates a candle (as in He lit the room with a lamp). For the moment, assume that he has already been resolved to Jackl. The two interpretations are captured by two different events, E2 and E3: Interpretation 1: LIGHT 1 (E2) &Agent(E2) = Jackl & Theme (E2) = Candlel Interpretation 2: ILLUM1NATE1 (E3) & Agent (E3) = Jackl & Theme (E3) = Candlel The constant Candlel is introduced for the indefinite noun phrase a candle. The matching process must somehow use the expectation to select the first interpretation and conclude that Jack lit the candle with the match. The remainder of this section explores two methods for defining the matching process. First, however, we should make sure that a simple deductive solution will not suffice. There are two ways deduction could be used to select an interpretation. We could see whether an interpretation could be proved from an expectation, or we could see whether an interpretation could be used to prove an expectation. Neither will work for discourse 1. Attempt 1: Proving an Interpretation from an Expectation This technique would involve finding an expectation Ej that entails one of the interpretations Ij. This technique is too weak to handle discourse 1 because expectation 1 entails neither interpretation. In particular, the expectation makes no claims about the theme being Candlel. In fact, since Candlel is a Skolem introduced by sentence lb, no expectation could possibly anticipate its use. Attempt 2: Proving an Expectation from an Interpretation This approach is the mirror image of the first one. The goal is to select an interpretation Ij such that there is an Ej where Ij entails Ej. This technique is also too weak to be useful. It requires the input sentence to fully anticipate all the information in an expectation; thus the expectation would not contribute any information not already present in the sentence. This technique fails on discourse 1 as well. Specifically, interpretation 1 does not entail expectation 1 since it doesn't entail that the instrument used was the match in sentence 1 a. 468 CHAPTER 15 Using World Knowledge 469 Jackl AMatch33jv Jackl ^VCandlel Expectation 1 Interpretation 1 Figure 15.1 The expectation and two interpretations Equality-Based Techniques The problem with attempts 1 and 2 is that both the interpretation and the expectation contain information that is not contained in the other, and thus neither will entail the other. The intuition behind the nonmonotonic equality techniques is that two formulas match if there is a consistent set of equality assumptions that can be made such that an E\ then entails an Ij. For example, with discourse 1, the expectation and two equality assumptions entail interpretation 1: Expectation: LIGHTl(El) &Agent(El) = Jackl & lnstrument(El) = Match33 Assumptions: Theme(El) = Candlel & El - E2 In contrast, there is no set of equality assertions that allows expectation 1 to entail interpretation 2. This technique can be implemented as a graph-matching process as follows. Each formula is encoded as a semantic network graph rooted at a node representing the eventuality. The expectation and two interpretations for discourse 1 are represented as graphs in Figure 15.1. The graph matching algorithm is specified as follows: For any node N, let Label(N) be the term that labels the node, T(N) be the type of the node (found by following the isa link), and LL(N) be the set of outgoing links from N. For any arc A, let END(A) be the node to which the arc points. Then two nodes Nl and N2 match if and only if the following three conditions hold: 1. T(N 1) and T(N2) are the same, or one is a subtype of the other. 2. Label(NI) and Label(N2) arc not known to refer to distinct objects. LIGHT 1 \ 'instn (^JackT^) (MatohST) (^äeľ) agent/ ~T \theme \ 'instrument-* Figure 15.2 The result of matching El and E2 3. For all links LI in LL(N1) a there is no link with the same label leaving N2, or b. there is a link L2 with same label leaving N2, and END(Ll) and END(L2) match recursively. This is the same general algorithm used with graph-based unification except that nodes match based on the type and equality information rather than by unifying feature lists. If two graphs match, then the result is a graph constructed by collapsing together all the nodes that match. For example, nodes El and E3 will not match since their types are not compatible. Nodes El and E2, on the other hand, will match, and produce the resulting merged graph shown in Figure 15.2. This is the desired interpretation for sentence lb. Note that this method is still not foolproof, however, as it may be that each pair of nodes matched are not known to be distinct, but as a whole the entire new graph is inconsistent. Consider a KB with three constants, Jackl, Sam!, and Person!, where Jack! * Saml. Figure 15,3 shows two graphs corresponding to an expectation that someone will shave himself, and the interpretation that Jack shaved Sam. The matching algorithm will succeed in matching E4 and E5, because Person! and Jack! are not known to be distinct, and Person! and Saml are not known to be distinct. The overall result, however, would entail that Jackl = Saml, which is known not to be the case. To handle this case, the algorithm would need to verify the global consistency of the resulting graph. In essence, what the graph algorithm does is a heuristic verification of whether two eventualities can be equal; for example, matching El and E2 was a way to verify whether it was consistent that El = E2, using heuristics based on typing and explicit distinctness information. By checking all the outgoing arcs, case relations involving El and E2 were also checked. This procedure docs not guarantee consistency given a match, but it does detect most common cases of inconsistency. In fact, since determining consistency of a set of formulas in 470 CHAPTER 15 Using World Knowledge 471 Figure 15.3 A problem case FOPC is not decidable, there can be no general algorithm that will give guaranteed results. But most knowledge representations provide a heuristic facility for checking consistency of facts. Given such a KR system, you can check whether it is consistent that two constants CI and C2 are equal by doing the following: 1. Find all propositions in the KB that involve C2, 2. For each, add a new proposition with C2 replaced by CI throughout. 3. If the resulting KB appears consistent given the limits of the KR, then allow CI = C2. Of course, this could be a very expensive operation in principle. In practice, however, remember that many of the constants being matched will be new Skolem constants, so very little will be known about them. In such cases there are only a few propositions that need to be tested to check consistency. Abduction-Based Techniques The abductive approach is based on the intuition that we must find an explanation for why the current sentence is true. In other words, we must use the specific situation and expectations to construct an argument that proves the current sentence. Assumptions must be made that do not follow from the KB in order to construct the proof. More formally, the approach is as follows. Given a specific situation S (including expectations) and possible interpretations 11,Ik» then consic >.1 li.it assumptions Aj must be added to S such that 1. S & Aj entails Ij. 2. S& Aj are consistent for each Ij. Choose the interpretation Ij that requires the minimum set of assumptions Aj. Of course, this requires having some metric on assumptions so that the minimal set can be defined. Simply counting the number of different atomic propositions in each Aj will not work in general, as some propositions are much less likely than others. Abductive theories are very general and thus do not make very specific claims until some measure on assumptions has been defined. The method subsumes more specialized techniques like the equality-based technique previously defined. To see this, you can reformulate the equality-based method above as an abductive method with the following restrictions: 3. The only assumptions allowed are equality assumptions together with exactly one expectation. 4. The Aj with the least number of equality assumptions is preferred. For sentence lb, the minimal set of assumptions added to expectation 1 would be simply that El = E2 and Theme(El) = Candlel. These two assumptions and expectation 1 entail interpretation 1. Clearly, much more can be done with the abductive technique if additional types of assumptions are allowed, but this is sufficient mechanism to formalize matching. 15.3 Reference and Matching Expectations In the last section, the matching algorithm assumed that all definite NPs had been resolved to their referents in the KB. This section relaxes this assumption and uses expectation matching to identify referents. The technique is simple. If you generate a new Skolem constant for every definite NP, then when the interpretation is matched against an expectation, an equality assumption will be needed to enable the match. In this case the equality assumption identifies the referent. Consider sentence lb again, this time without the referent of the pronoun he being resolved. A Skolem constant, HI, is generated, and the first interpretation is Interpretation 1: LIGHT1 (E2) & Agent(E2) = HI & Theme (E2) = Candlel Matching interpretation 1 against the expectation will proceed as before. But now there are three nodes that are merged, corresponding to the equality assumptions El = E2, Theme (El) = Candlel, and HI = Jackl. Thus the referent of the pronoun is identified during the match. In this case the referent is identical to the preferred interpretation given the centering constraints. Consider two examples that show how general reasoning can augment the centering technique: 3a Jack poisoned Sam. 3b. He died within a week. 472 CHAPTER 15 In this case the most natural reading of 3b has Sam dying. On the other hand, consider sentence 3a followed by a different continuation: 4a Jack poisoned Sam. 4b. He was arrested within a week. The most natural reading of 4b has Jack being convicted. Centering theory would either predict the same antecedent in 3b and 4b, or not choose between them in either case. The expectation-matching approach can readily identify the correct referent, however. For example, assume that the following expectations an. generated given the assertion that Jack poisoned Sam: Expectation 1: DIE(E4) & Theme(EA) = Saml Expectation 2: IS-IELÍE5) & Theme (E5) = Saml Expectation 3: ARREST{E6) & Theme (E6) = Jackl The interpretation of 3b (with the temporal expression ignored) would be D1E(E7) & Theme (ET) = HI E7 will clearly match E4, expectation 1, and thus HI will equal Saml. The interpretation of 4b, on the other hand, would be ARRESTEES) & Theme(E8) = H2 E8 will match E6, expectation 3, and H2 will equal Jackl. The same technique works for definite noun phrases. For instance, consider 5a Jack poisoned Sam. 5b. The villain was arrested within a week. Even though it was not known previously that Jack was a villain, the same techniques will work. The interpretation of 5b would be ARREST(E9) & Theme(E9) = VI & Villain(Vl) Given suitable knowledge that a villain is a type of person, and that Jack is a person, this will match expectation 3 with the assumption that VI = Jackl, as desired. A consequence of this is that Jack is claimed to be a villain. Thus, expectation matching can be a powerful tool for resolving reference; j but it does not replace the other techniques as described earlier. For instance, constraints from local context may be important even when there would seem to be an unambiguous referent using only semantic information. Recall the example 6. Jack walked over to the table and drank the wine. It was brown and round. ~ Even though the table is the only object that could be brown and round ilu» discourse is awkward and defective. Preferences from local context are aKn important in other cases where there appears to be an ambiguity given the expectations. Consider Using World Knowledge 473 pi • * b. 7a Jack poisoned the man from whom George stole the jewels. 7b. He was arrested within a week. In this case there could be expectations about being arrested for both Jack and George, since both were involved in illegal activity. Based on expectation matching, H3 = Jackl and H3 = Georgel would be equally preferred. Preferences based on centering, however, would strongly prefer the former reading, which agrees with intuition. Further evidence that this is a structural preference rather than something to do with expectations can be seen by considering a paraphrase of 7a, which changes the preferred reading: 8a George stole the jewels from the man that Jack poisoned. 8b. He was arrested within a week. In this case it is George who is arrested, in agreement with the centering constraints. To incorporate these constraints into the expectation-matching procedure, you simply need to add the condition that if multiple interpretations match into the expectations, then the interpretation that agrees with the preferences from local context will be preferred. Expectation matching is a very powerful technique, but it is only as good as the process that generates the expectations in the first place. This issue is discussed in the rest of this chapter. 15.4 Using Knowledge About Action and Causality Much discourse involves action and causality. This section examines some of the issues involving representing actions and how this information can be used to generate expectations. Before getting to the details, consider how you might define action. This issue has occupied philosophers for thousands of years, so do not expect to find a simple answer. But you can make some distinctions. First, there is a distinction between the sense of action performed by intentional agents and the sense of action that arises in physics, where concepts such as force and inertia provide a reasonable theory. We will be concerned primarily with the former sense here. While actions have been discussed earlier as separate from events, what is of interest here are the events that consist of actions being performed. We will loosely refer to such an event as an action, where more precisely we should refer to it as the event of the action's performance. For current purposes, knowing about actions is important in order to identify causal relationships that can be used to generate expectations. For instance, if you are told that Jack walked to the store, you would next expect him to be at the store (a direct effect of the action), and probably to buy something (an activity suggested by knowledge about normal intentional behavior). Many forms 474 CHAPTER 15 Using World Knowledge 475 of causality are needed to understand discourse. These include the following two classes relating actions to states: Effect causality—Every action has a set of effects that are typically caused by the action and usually hold after the action is completed (or while the action is in execution). Effects can be divided into intended effects-—those that the action is performed for—and side effects—those that are caused by the action but not intended. If an action is attempted but fails, it still may have side effects. An action fails when the intended effects are not caused. Precondition causality—Every action has a set of conditions that typically must hold just before the action starts (or during the action) in order for there to be a reasonable chance that the action will succeed. There are also important relationships between two actions that generate expectations. These include the following: Enablement—One action enables another if the effects of the first establish the preconditions of the second. More usually, the first action establishes only a subset of the preconditions, and the others are established by other means. Decomposition—One action is a subpart (or substep) of another action if the first is one of a sequence of substeps that constitute the execution of the other action. For example, to perform the action of entering my car, I might unlock the door, open the door, and then get inside. Each one of these actions is a substep of entering my car, and if all three are performed correctly, then the action of entering my car is accomplished. Generation—One action generates another if executing the first under a certain set of conditions counts as executing the second. For instance, assuming I have a working electrical system and light bulb in my room, then the action of flipping the switch generates the action of turning on the light. Some researchers treat generation as a degenerate case of decomposition where there is a single substep. There are natural language constructs that explicitly identify each of these relationships. For instance, the connective by typically indicates a generation relationship, as in / turned on the light by flipping the switch, or sometimes an enablement relation, as in / bought some milk by going to the store. The connective in order to indicates the same relationships, as in / flipped the switch in order to turn on the light, and I went to the store in order to buy some milk. 01 course, verbs such as enable and cause explicitly indicate relationships, as in Tfu ■ l - , BOX 15.1 Defining Causality There are several commonsense notions concerning how actions relate to each other and the world around them. Many of these fall under the notion of causality. One action causes another to occur, or causes some effect, if the performance of that action somehow leads to that effect. Intuitively, you might want to say that the effect would not have arisen if the action had not occurred. While this seems straightforward, it is extremely difficult to define causality in terms of some simpler concepts. For example, you might try to model that the action of dropping the glass caused the glass to break in FOPC by an axiom of the form DROP(Sam, Glass) z> BROKEN (Glass) But this formula would be trivially true if Sam did not drop the glass because, given the definition of logical implication, a false antecedent trivially makes the implication true. If you then augment this condition by adding that the antecedent is true (that is, Sam did drop the glass), then this is logically equivalent to the conjunction DROP (Sam, Glass) & BROKEN (Glass) which doesn't capture the notion of causality either. You might augment this further by saying that if Sam had not dropped the glass, it would not have broken. This gets closer to a definition but introduces further problems. First, the preceding statement is called a counterfactual statement; that is, it makes a claim about how the world would be if something that is actually true were not true. Such statements cannot be modeled in simple FOPC but require the introduction of modal operators and an appropriately more complex semantic model (Lewis, 1973). Even if this were done, however, it still wouldn't precisely define the notion of causality, because if Sam had not dropped the glass, some other event could have occurred causing the glass to break (maybe Sam dropped the glass because someone had thrown a stone at him, and if he had not dropped the glass, it would have been hit by the stone). Given all these difficulties, many theories usually take the notion of causality as primitive and do not try to decompose it further. explosion caused the window to break, and Being at the store earlier enabled me to buy many items on sale. In order to identify and use such relationships, the KR system must be able to represent general knowledge about actions and causality. Most representations organize such information around each predicate defining an action occurrence. Figure 15.4 shows the definitions of two actions, BUY and PURCHASE-TICKET. The definition of BUY is the same as presented in Chapter 13. The roles define the important objects involved in an action, and the constraints indicate the typical properties of the role values. As usual, the preconditions indicate what typically is true before the action can occur, and the effects indicate how the world changes after it occurs. The decomposition indicates a typical way to accomplish the action in terms of a sequence of subactions. The actions in the 476 CHAPTER 15 Using World Knowledge 477 The Action Class BUY(e): Roles: Buyer, Seller, Object, Money Constraints: Human{Buyer), SalesAgent(Seller), IsObject(Object), Value (Money, Price(Object)) Preconditions: AT (Buyer, Loc (Seller)) OWNS(Buyer, Money) OWNSiSeller, Object) Effects: -•OWNS(Buyer, Money) -iOWNS (Seller, Object) OWNS(Buyer, Object) OWNS(Seller, Money) Decomposition: GIVE(Buyer, Seller, Money) GTVE(Seller, Buyer, Object) The Action Class PURCHASE-TICKET(e): Roles: Agent, Clerk, Ticket, Booth, Money, Station Constraints: Human(Agent), Clerk(Clerk), IsTicket(Ticket), TicketBooth( Booth), At (Clerk, Booth), Value (Money, Price(Ticket)), In (Booth, Station) Preconditions: OWNS(Agent, Money) Effects: OWNS(Agent, Ticket) Decomposition: GOTO(Agent, Booth) BUY(Agent, Clerk, Ticket) Figure 15.4 The definitions of BUY and PURCHASE-TICKET figure display differing levels of detail. BUY is a general action allowing almost any object to be bought and giving the details of the exchange of money and the object. It has preconditions that the buyer has the money and the seller has the object, and effects that change the possession of the money and object. It is accomplished by two GIVE actions. PURCHASE-TICKET is restricted to a particular setting of buying tickets at a ticket booth, and only the important details of the transaction, such as the agent having a ticket afterwards, are provided. PURCHASE-TICKET uses BUY as one of its substeps. This information can be used in several ways. First, given a sentence that asserts the occurrence of such an action, the preconditions and effects suggest some implications of the sentence. Given Jack bought a stereo at the mall, a system could conclude that Jack now owns a stereo, that he no longer has the money, and that there was a store (at the mall) that once owned the stereo but no. longer does. These implicatures could also be used as expectations, as the speaker might refer to one of these conditions next. But expectations based on effects or preconditions are usually too obvious to be very useful for interpreting" the following sentences. The discourse . ;t 9a. Jack bought a stereo at the mall. 9b. He now owns it. while understandable, is boring because 9b is such an obvious consequence of 9a that there seems no point in saying it. Other more interesting expectations arise from actions that are enabled by the buying action or that the buying action is part of. Potentially, any action in the KB with a precondition that matches an effect of the buy action is a possible expectation. For instance, consider 10a. Jack bought a stereo at the mall. 10b. Now he can disturb his neighbors late at night. The connection is that buying a stereo enables Jack playing the stereo loudly, which in turn can generate the action of annoying Jack's neighbors. Clearly, there is a vast range of actions that could be enabled by buying a stereo, so in practice it is not feasible to generate all the expectations in advance. The next few sections describe some techniques for controlling the generation of expectations. 15.5 Scripts: Understanding Stereotypical Situations One way to control the generation of expectations is to store larger units of information, often called scripts, that identify common situations and scenarios in the domain of interest. Then, rather than generating expectations from first principles using causality reasoning, the expectations are encoded in the script. To be useful in controlling inference, scripts must involve many different actions, possibly by different agents, and specify a specific commonly occurring chain of causality between them. The representation of actions developed so far is general enough to be used to represent scripts as well. Figure 15.5 shows a TRAVEL-BY-TRAIN action (or script) that describes a typical scenario in which a passenger takes a train trip. It includes the steps of going to the station and buying the ticket, and then going to the appropriate departure location and boarding the train, and so on. The script also contains a large number of constraints detailing how the objects involved relate to each other. Two main problems arise when you use script-based knowledge in language understanding. The first problem is script selection: How can the system tell which script is relevant? The second problem is keeping track of what part of the script is currently being described. You can make an analogy between this problem and that of trying to follow a play using a playwright's script. You would first have to identify which play is in progress; then you would have to find the place in the script for that play that corresponds to what is currently being performed. This place is called the now point, since it indicates the present time from the perspective of the script. You match sentences into the script structures in different ways depending on what the sentence describes. For instance, sentences describing goals are used to identify the relevant scripts. Sentences describing actions are used to update 478 CHAPTER 15 TRAVEL-BY-TRAIN (e): Roles: Actor, Clerk, SourceCity, DestCity, Train, Station, Booth, Ticket, Money Constraints: Person{Actor), Person(Clerk), City (SourceCity), City (DestCity), TrainStation(Station), TicketBooth(Booth), In(Station, SourceCity), Inlfiooih, Station), At (Clerk, Booth), DepartCity(Train, SourceCity), Destination(Train, DestCity), TicketFor (Ticket, SourceCity, DestCity) Value (Money, Price( Ticket)) Preconditions: Owns (Actor, Money) At(Actor, SourceCity) Effects: ->Owns(Actor, Money) -At(Actor, SourceCity) A (Actor, DestCity) Decomposition: GoTo(Actor, Station) Purchase-Ticket (Actor, Clerk, Ticket, Station) GoTo (Actor, Loc (Train)) GetOn(Actor, Train) Travel(Train, SourceCity, DestCity) Arrive (Train, DestCity) GetOfflActor, Train) Figure 15.5 A script to travel by train the progress of a script. More specifically, there are three ways to introduce a script if none is currently in progress: 1. Describe an action as a goal, as in Jack wanted to take the train to Rochester; in these cases the desired action serves to name the script. 2. Describe a state as a goal, as in Jack needed to be in Rochester; in these cases the content is matched against the effects of the scripts to find ones that could achieve the desired state. 3. Describe an action in execution, as in Jack walked up to the ticket booth; in these cases the content is matched against the decompositions of the scripts to find ones that could have this action as a step. Once a script is selected, it is updated when descriptions of actions, such as in Jack walked up to the ticket booth, are given. In these cases the now point of the script is updated to be just after the described action. Script and Role Instantiation Consider how the simple TRAVEL-BY-TRAIN script could be used to analyze the following story fragment: Using World Knowledge 479 GoTo (Saml, Station (Tl)) PurchaseTicket(SamI, Clerk(Tl), Ticket(Tl), StationCJl)) GoTo (Saml, Loc (TR1)) GetOn(Saml, TRT) TraveKTRl, SourceCity (Tl), ROC) Arrive (TR1, ROC) GetOffiSaml, TRl) Figure 15.6 The instantiated decomposition of Tl given sentence 1 la 11a. Sam wanted to take a train to Rochester, lib. He purchased a ticket at the station. To handle 11a, the system has to recognize that a sentence of the form "agent wants action" is a statement of a goal to perform an action—in this case taking the train. Assuming that the verb take with appropriate arguments can be used to identify TRAVEL-BY-TRAIN as the relevant script, an instantiation of the script, Tl, is created to describe the current situation. Consider this in detail. The goal described in 1 la could be Travel(El) & Agent(El) = Saml & Instrument (El) ■■ IsTrain(TRl) & Dest(TRl, ROC) TR1& Matching this to Tl, an instantiation of the TRAVEL-BY-TRAIN action, using appropriate translations between the verb case roles and the script roles, would produce the following equality assumptions: Actor(Tl) = Agent(El) =Saml TrainCTl) = Instrument(El) =TRI DestCity (Tl)= ROC The instantiated decomposition of Tl using these assumptions is shown in Figure 15.6. The system now uses this instantiated script to interpret lib. You need to find a set of equality assumptions such that some substep matches the semantic content of the new sentence, which in this case might be PurchaseTicket(E2) & Agent(E2) = HI & Theme(E2) = TIC3 & At-Loc(E2) = STAT3 & Ticket(TIC3) & Station(STAT3) where TIC3 is a new constant generated for a ticket, and HI is the constant generated for the pronoun he. This will match the second step in the decomposition of Tl, which looks as follows in the event-based form: PurchaseTicket(Pl) & Agent(Pl) = Saml & Clerk(Pl) = ClerUTI)& Ticket(Pl) = Ticket(TT) & Station(Pl) = Station(Tl) 480 CHAPTER 15 The match would produce the following equality assumptions: HI =Saml TIC3 = Ticket(Pl) = Ticket(Tl) STAT3 = Station (Pl) = Station(Tl) As a result of this match, the pronoun he is resolved to Saml as desired anj j unique referent for the definite description the station is identified as the train' station Station(Tl) even though it has not been mentioned previously in the" discourse. In addition, there are a large number of implicatures that can be drawn from the constraints defined for the script. We can infer that the station is in the city that Sam is leaving from, that the ticket introduced in lib is a train ticket to Rochester, and that Sam went to the station. Also, we have expectations about Sam boarding the train, as well as the other actions involved in taking the trip. Several fairly obvious generalizations of the script structure would be needed to handle more realistic examples. The decomposition structure would need to be extended to allow a partial ordering on the subactions. Another important extension would be to allow the representation of choices between actions in a script. This would allow a script to follow different paths depending on the state of the world when the actions are executed. Some stories explicitly refer to choices in giving explanations of why some action was performed or was successful. For instance, a script for entering a house may have a different action sequence depending on whether the doors are locked. If they are locked, the agent must have a key and use it to unlock the door. Even with such generalizations, however, script-based techniques only work in limited domains where the actions that can be described in the discourses can be enumerated in advance. 15.6 Using Hierarchical Plans A script can successfully model the content of a discourse only if it proceeds along lines anticipated by the script. A richer framework is needed to model the content of a discourse in which unexpected connections need to be made between actions. Plan-based models were developed to address this deficiency. The same representation of actions is used, but the actions defined tend to be smaller-scale than those in scripts. A plan is a set of actions, related by equality assertions and causal relations, that if executed would achieve some goal. A goal is a state that an agent wants to make true or an action that an agent wants to execute. The classic planning problem in Al takes a goal and a set of possible actions as input and produces a plan, typically a sequence of actions, that will achieve the goal. The reasoning needed in language understanding is different: It involves inferring the plans of other agents based on their actions. This process is generally called plan recognition or plan inference. The input to a plan inference process is a list of the goals that an agent might plausibly be pursuing and a set of actions that have been described or observed. The task is to construct a plan involving all the actions in a way that contributes toward achieving one of the goals. By forcing all Using World Knowledge 481 / GOTO (GS) I Agent(G5) = Agent(Pl) V ToLoc(G5) = Booth (PI) Figure 15.7 A plan to travel by train the actions to relate to a limited number of goals, or to a single goal, the plan-based model constrains the set of possible expectations that can be generated. The script-based system can be seen as a simple plan-based system. Each script defines both a plausible goal and the way to achieve it. The plan inference task involves selecting the appropriate plan that matches the input. A plan-based system, however, is not limited to selecting from a predetermined set of plans. It can use causal reasoning to construct new plans out of any of the actions defined in the domain. Thus the plan inference framework has the potential to model much richer discourse. Let us start with a simple case of plans called hierarchical plans. Hierarchical plans are built out of a set of actions defined in the KB, called the action library. The most important aspect of each action definition is its decomposition relation, as this defines a hierarchical relationship between actions. A subset of the actions in the library are identified as plausible goals. A hierarchical plan is defined as a tree of actions, rooted by an action that is a plausible goal, where the mother/daughter relation indicates a decomposition relationship. The entailments of a plan are the consequences of each action definition in the tree, together with the equality relations resulting from the matches to the input, and matches of actions into decompositions of other actions. Figure 15.7 shows a simple hierarchical plan constructed by combining the TRAVEL-BY-TRAIN and PURCHASE-TICKET actions defined previously, and assuming that the TRAVEL-BY-TRAIN action is identified as a plausible goal. 482 CHAPTER IS PURCHASE-TICKET (PI): Roles: Agent = SI, Clerk = CI, Booth = LOCI Constraints: At (CI, LOCI) Decomposition: GOTO(Suel, LOCI) BUY(Suel, CI, Ticket (PI)) Figure 15.8 The parts of the instantiated action PI that entail sentence 12b Assume for the moment that the plan in Figure 15.7 is already constructed. It then can be used to account for the coherence of the following discourse: 12a. Sue wanted to take a train to Rochester. 12b. She walked up to the ticket clerk. Sentence 12a serves to identify Sue's goal as traveling to Rochester by train. Utterances that describe goals must match an action that is the root node of the plan. This would match the root of the plan in the same way described with scripts in the last section and produce the equality assumptions Actor(T2) = Suel Train(T2)= TR37 DestCity(T2) = ROC Sentence 12b, with the content GoTo(E3) & Agent (E3) = PRO! & ToLoc(E3) = LOCI & At (CI, LOCl)& TicketClerk(Cl) matches the first step of the decomposition of PI, given the equality assumptions G5 = E3 Agent(G5) = PROl (= Agent(PT) = Aotor(T2) = Suel) ToLoc(G5) = LOCI (= Booth(Pl)) Clerk(Pl) = CI With these equality assumptions, the plan entails the contents of 12b. You can see this by considering part of the definition of PI after the roles are instantiated, as shown in Figure 15.8. Only the formulas used to entail the contents of 12b are shown. Of course, the success of this method depends on having an algorithm that can construct such a plan given the input. A technique called decomposition chaining can be used here, which involves searching through decomposition relations until a chain of actions is found where the first matches the input and the last matches a plausible goal. When the goal is already defined, top-down methods usually prove to be most effective. For example, after processing 12a, the plan would consist of the single node TRA VEL-BY-TRA/N (T2). A top-down • breadth-first search through the decompositions could be used to generate Using World Knowledge 483 expectations to match against the content of 12b. This strategy would generate the following expectations (in the order listed): First check all the immediate substeps of 72; GOTO(Gl)—fails to match as LOCI * Station(T2) PURCHASE(P1)—fails to match because of type incompatibility GOTO( G2)—fails to match because LOCI * Loc (TR37) GET-ON(G3), TRAVEL(T3), ARRIVE(Al), GET-OFF(G4)— all fail to match because of type incompatibility Next check the substeps of Gl: assume there is no decomposition Next check the substeps of PI: GOTO(G5)—matches with equalities defined above. Several difficult problems were skipped over in this example, however, regarding how this particular set of matches was found. For example, why doesn't 12b match step 1 of T2, GoTo(Actor, Station), as shown in Figure 15.5? The answer must lie in the subsystem that reasons about locations. It would have to conclude that the location of the clerk, LOCI, does not equal the location of the station. Such issues are typically handled heuristically, and a thorough study of how location reasoning integrates with plan-based matching remains to be done. The other issue is what to do when the goal is not defined. You can either search top-down from all plausible goals looking for a match, or use a bottom-up technique, matching into the substeps of any action, and then use the actions found to match into the substeps of other actions. Complications arise either way, as there will often be many possible matches. For instance, assume the first sentence was 12b. How might the goal be identified? There will likely be many actions in the KB that involve a GOTO step. Perhaps the KB has a GO-FOR-A -WALK action that is a plausible goal in addition to TRAVEL-BY-TRAIN and that probably contains a GOTO action. How might a system decide between the two? There is no foolproof answer. Some systems use heuristic preferences for one reading over another. Another strategy is to not decide yet, but wait for additional input that might affect the decision. 15.7 Action-Effect-Based Reasoning While decomposition chaining on hierarchical plans is an effective technique that constrains the search space, it is often not powerful enough to interpret many discourses. The problem is that not all causal relations are easily encoded as decomposition relationships. This section considers other methods for inferring more general plans that greatly expand the range of situations a system can interpret, at the price of expanding the search space. Consider a simple example involving the action-effect relationship. The discourse 13a. Jack needed to be in Rochester by noon. 13b. He bought a ticket at the station. 484 chapter 15 Using World Knowledge 485 TRA VEL-BY- TRA IN(T3) Actor(T3) = Jackl DestCity(T3) = ROC has-as-step has-as-step has-effect At(Jackl, ROC) has-as-step PURCHASE-TICKET(P1) Buyer(Pl) = Jackl Seller(Pl) = Clerk(T3) Object(Pl)= Ticket(T3) has-as-step has-as-step BUY(Bl) Buyer(Bl) = Jackl SellerfBl) = Clerk(P3) Object(Bl) = Ticket(T3) Figure 15.9 The plan recognized from discourse 13 can be interpreted only if the system realizes that the first sentence describes the effect of performing the action TRAVEL-BY-TRAIN to Rochester, and the second is a substep of that plan. To do this, states must be allowed as goals as well as actions, and the plan inference algorithm must search not only from actions to other actions via decompositions but also from actions to states via the actions' effects, and vice versa. For example, sentence 13a would be analyzed initially as a goal state-At(JackJ, ROC). The logical form of 13b would not match directly into the expectation. To find the connection, the system would first have to match the appropriate effect of the TRAVEL-BY-TRAIN action to this goal state and then use decomposition chaining to find the match to the PURCHASE action described in 13b. The resulting plan is shown in Figure 15.9. Nodes representing actions are drawn as ovals, and those representing states as rectangles. Three types of causal links are used: .. has-as-step—relates an action to an act in its decomposition enables—relates a state to an action that has it as a precondition has-effect—relates an action to one of its effects OWNS(Jackl, CD3) enables /^CD(F2, ^\ _^.1 Agent(F2) = Jackl ) V Instruments) =CD3 J Figure 15.10 The plan involving an enablement relation from discourse 14 Furthermore, nodes that have matched the logical form of a sentences are labeled in boldface, whereas the others are labeled in plain text. For example, the graph in Figure 15.9 indicates that an effect of the TRAVEL-BY-TRAIN action is the state At (Jackl, ROC). More complex examples of action-effect-based reasoning arise when two actions are related, not by both being steps of some other action, but by one action enabling the other to occur (that is, one action's effects satisfy the preconditions of the other). For example, consider the discourse 14a. Jack bought a CD player. 14b. He played his favorite CDs all night long. Let's assume that there is no pre-existing action in the library that involves buying a CD player and then using it to play CDs, but there are definitions of the individual actions. Then the action of playing a CD could have a precondition that the agent owns a CD player, and the action of buying something could have an effect that the agent now owns the thing bought. This precondition and effect can match to identify an enabling relation between the two actions. The resulting plan is shown in Figure 15.10. A Plan Inference Algorithm This section formulates a plan inference algorithm in a more precise manner. The plan structure that serves to generate expectations will be called the E-plan. In general the system might have to maintain many different E-plans to handle ambiguity, though here the algorithm will assume that it always starts with a single E-plan. Generalizing to a set of E-plans is straightforward. The new algorithm will still involve decomposition chaining, but it also will check for immediate action-effect and action-precondition connections. In stating the algorithm, some additional definitions will be helpful. In particular, the nodes in an E-plan P that have matched a logical form are called the realized nodes, whereas those not yet matched directly to a logical form are called expectation nodes. Expectation nodes are further classified into the following sets: 486 CHAPTER 15 ( PURCHASE-TICKETl) 0WNS-M0NEY1 | owns-TICKET1 ^—^— 3 p . AtHome(p), to IPLAN-18d Goal Hierarchy: Inside-House (Jackl), EnterViaDoor(Jackl, Doorl) Actions: Knock(Jackl),3p UnlockDoor(p) ** FAILED via update 8, precondition failure, to IPLAN-18e Goal Hierarchy: Inside-House (Jackl). EnterViaDoori Jackl, BackDoorl), At (Jackl, BackDoorl) Actions: GoTo( Jackl, BackDoorl) Using World Knowledge 495 Figure 15.14 A trace of the iplans for discourse 18 definition for the OpenDoor action, which requires the door to be unlocked. Clause 18c is then interpreted as a response to a failed action, which in this case is an alternate way to achieve the active goal. Clause 18d indicates that this attempt also failed due to precondition failure, producing IPLAN-18d. Finally, 18e is interpreted as indicating an abandonment of the active goal and the development of a new way to achieve the higher level goal of Inside-House(Jackl). There are many constructs in language that explicitly refer to concepts and, actions that need lo be defined by such a model. For example, discourse 18 showed two uses of the connective but, in each case treated as having a sense indicating a precondition failure. Other expressions are easy to find. Consider the discourse 19a. Sue had to find a way to get enough money for the ticket. 19b. She tried to use the automatic teller machine, but it was broken. 19c. She eventually gave up and walked to the bus station and took a bus. The phrases find a way, try, and give up all directly relate to activities that can be defined within the iplan model. The iplan model is just a start, however, and many other complications arise that indicate a need for still more general models. For instance, plans that involve repetition of some activity, such as pounding in a nail with a hammer, or cyclic events, such as walking to school each day, cannot be represented in any general fashion. For instance, the system might need to recognize that an agent bought a bus pass so that he or she could take the bus into work each day. The one action of buying the pass serves as an enabling condition for a whole series of bus trips. Important aspects of a story could be missed if the action of buying a bus pass was linked only to a single occurrence of the TAKE-BUS action. Other goals are difficult to represent even though they don't involve repetition. For instance, what is the goal of the action of going to the theater? Intuitively you know it is to see the play and ultimately to enjoy yourself. You would have to represent this goal explicitly to understand a story in which an evening out at the theater was ruined because your wallet was stolen. To handle this, you would need a theory of what makes activities worthwhile and enjoyable. Complications also arise because people usually don't have a single goal, and when considering situations where multiple goals are described, the system needs to understand situations where these goals conflict or are in concord. For example, a discourse might describe a person's dilemma in deciding between studying for an exam tomorrow (and thus passing) or going out to see a movie. The system would need to be able to represent this dilemma and reason about such things as why the goals conflict. It would also have to understand descriptions of how the person attempts to avoid the conflict and achieve both goals. Another discourse might describe a situation where two people have conflicting goals (say both want to own a particular race horse) and their respective actions would need to be interpreted in light of this conflict. Similar cases occur when goals complement each other and when people are cooperating toward mutually compatible goals. Summary Knowledge about causality and everyday activities is essential for understanding much discourse. Indeed, you cannot correctly handle many word ambiguities and examples of reference without using such knowledge. This knowledge is used to generate expectations that are matched into possible interpretations of the input. Significant effort has been made to find ways to control the generation of expectations. Techniques in this area range from the fairly large-scale script structures to more flexible plan inference systems that use general knowledge about actions and goals. Significant work remains to be done before these techniques can be applied successfully in realistic domains. Related Work and Further Readings The notion of coherence is crucial to motivate techniques for understanding discourse. It is important to realize that this involves finding semantic connections between sentences rather than finding structural properties. For instance, Halhday 496 CHAPTER 15 and Hasan (1976) claim that discourse coheres as a result of certain linguistic devices, such as reference—using pronouns and definite descriptions to refer back to objects mentioned in other sentences (deep and surface forms, one-anaphora, and so on) ellipsis—using the structure of one sentence to "fill in" another conjunction—connecting sentences by and, further, but, and so on lexical connections—using words that are related to each other, such as opposites, paraphrase, repetition, and so on This approach is dramatically different from the computational theories presented in this chapter. Halliday and Hasan claim that these phenomena cause cohesion, whereas the computational approaches say that the content and topic flow of a discourse make it coherent and the phenomena listed above arise because of coherence. In other words, referential connections do not make a text coherent: rather, a coherent text enables referential connections. This distinction is considered at length in Hobbs (1979). Almost every language understanding system that uses knowledge of the world to interpret sentences uses some form of matching. Often, the formal properties of the matching algorithm are not explored, though it seems clear that the formulation in Section 15.2 captures the essential parts of most of the algorithms used. The formalization of equality-based matching mostly follows Charniak (1988), who also discusses using the technique for handling definite reference. Kautz (1990) presents a similar analysis within a general framework for plan recognition. One of the strongest proponents of general abductive techniques for natural language understanding is Hobbs (Hobbs et al., 1993). One of the earliest works that used general world knowledge to identify the connections between sentences was by Schank and Rieger (1974). They identified 16 different classes of connections, most of which have since been reformulated into the plan-based reasoning systems. Rieger's system involved searching through all possible inferences from each sentence, and thus was quite inefficient. Much of the work that has followed has been aimed at removing the inefficiencies by using more specialized representations of knowledge about actions. Schank and Abelson (1977) introduced the idea of scripts, and Cullingford (1981) built the first script-based system. Section 15.5 describes this work in some detail but recasts it in a more conventional action representation. DeJong (1979) used scripts to extract partial information from newspaper stories. These systems all used forms of matching and inference to resolve anaphoric reference. The work on representing action draws from the large literature in AI dealing with planning. Two classic foundational papers are McCarthy and Hayes (1969) and Fikes and Nilsson (1971). A good overview of planning can be found: in the articles in Allen, Hendler, and Tate (1990). These representations were adapted for use in plan inference systems by Allen and Perrault (1980). The formalism used here is similar to that proposed by Litman and Allen (1987; Using World Knowledge 497 Hü 1990). Precisely defining the causal relationships between actions is quite tricky. A good reference from philosophy is that of Goldman (1970), who introduced the generation relation and distinguished it from enablement and decomposition. The work was adapted and extended for computational systems by Pollack (1990). Plan recognition has been studied as a problem in AI for some time (for example, Schmidt et al. (1978)), and has similarities to work in automatic diagnosis and automatic tutoring. It was applied to language understanding by Wilensky (1983) and Allen and Perrault (1980). Hierarchical plans were developed by Sacerdoti (1977). Plan recognition techniques can be used both for understanding the content of language, as in this chapter, and for understanding the speaker's purpose in speaking, as will be described in Chapters 16 and 17. Carberry (1991) contains a good survey of the use of plan recognition in language understanding. Charniak and Goldman (1993) describe techniques for building a probabilistic plan recognition system. The algorithms described in Sections 15.6 and 15.7 draw ideas from Allen and Perrault (1980), Kautz (1990), and Ferguson and Allen (1993). Wilensky (1983) was the first to extensively explore the use of knowledge about intentional behavior for interpreting stories. Section 15.8 describes some of these techniques in terms of the iplan construct. Wilensky covers a wide range of other problems as well, including goal conflict and subsumption, not covered here. The other influence on Section 15.8 is the development of reactive planning and intelligent agent architectures in the planning literature. A few papers on these topics can be found in Allen, Hendler, and Tate (1990). These issues will be explored again in more detail in Chapter 17. Exercises for Chapter 15 I 1. (easy) For each of the following sentences, state some facts about typical behavior that could be used to justify them as expectations generated from the sentence Jack walked to school. a He is going to attend class. b. His car is broken. c. He will be tired at school. 2. (easy) Express the following two formulas in graph form. What knowledge about types, equality, and inequality is needed to allow the two to match? What equality assumptions are generated by the match? State the entailments resulting from the match. Acquire (El) &Agent(El) = Jackl & Theme (El) = Bookl & AtLoc(El) = Store34 Buy(E2) &Agent(E2) = Hel & Theme(E2) = hi & Price(E2) = $10 498 CHAPTER 15 Using World Knowledge 499 3. (easy) State the most likely causal relationship (if any) and the lempnui relationship between each of the following sentences if said after the " sentence Jack went to the store. - ::fS a He took his car from the garage. b. He had taken the money from Sue's room. c He bought some peaches. d. Now he has plenty of beer. e His car broke down along the way. 4. (medium) Consider the following discourse: a. Jack bought some roses at Honest John's Flower Mart. b. He paid thirty dollars for them c. and Honest John swore that they were fresh. d. But they wilted as soon as he left the store. For each sentence give the following: • a representation of the meaning using the event-based KRL • the equality assumptions needed to match an expectation generated earlier (except for first sentence) • the expectation that would have to be generated from the sentence in order to account for the later sentences • a commonsense rule that could be used to generated the expectation from the sentence 5. (medium) Implement a version of the matching algorithm described in Section 15.2. You will need to implement a subsystem to reason about type hierarchies, as described in Chapter 13, and another subsystem to keep track of what constants are not equal. Given the type hierarchy and inequality information, your program should take two graphs and return a set of equality assumptions if the two graphs match. Demonstrate your system on the examples shown in Section 15.2, as well as some additional examples that show the different capabilities of your program. 6. (medium) Assuming the action definitions in Figures 15.4 and 15.5, give a definition of the GetOn action that includes an interaction with the train conductor. With these definitions, describe in detail the steps taken by a : plan inference algorithm that uses decomposition chaining to recognize the following sentence, given that the goal is known to be the action TRAVEL-BY-TRAIN. List each expectation in the order that it is generated and state why the input matches or fails to match. Jack gave the conductor a ticket. Draw the resulting decomposition chain showing the values of each rule 8. $9 t- (medium) Trace the full plan inference algorithm, as described in Section 15.7, as it processes the following two sentences: Jack needed to have a ticket. He walked to the ticket clerk. In particular, define a meaning for each sentence, and trace each step of the plan inference algorithm as it searches for a solution. Discuss any problems you have with ambiguity and suggest ways to decide on the appropriate interpretation. Draw the final plan constructed to connect the two sentences. (hard) Implement a plan recognition system that uses the technique of decomposition chaining. It should use definitions of actions with decompositions, but need not explicitly represent role values. In other words, each action is represented as an atom, like the example shown in Figure 15.11. Test your system on some sample discourses, where each sentence has been converted to your representation by hand. CHAPTER Discourse Structure 16.1 The Need for Discourse Structure 16.2 Segmentation and Cue Phrases 16.3 Discourse Structure and Reference 16.4 Relating Discourse Structure and Inference 16.5 Discourse Structure, Tense, and Aspect 16.6 Managing the Attentional Stack 16.7 An Example Discourse Structure 503 This chapter examines techniques for representing and reasoning about extended discourse beyond finding local connections between sentences as seen in Chapter 14. Extended discourse cannot be viewed simply as a linear sequence of sentences. Rather, in many cases the utterances cluster together into units, called segments, that have a hierarchical structure. Section 16.1 discusses the problems in viewing discourse as a linear sequence of sentences. Section 16.2 then defines the notions of discourse segments and cue phrases that signal segmental structure. Section 16.3 shows how the segment structure affects the interpretation of referential expressions, especially pronouns. Section 16.4 discusses how segments interact with inference to facilitate an understanding of the content of the discourse. Section 16.5 discusses tense and aspect, and shows how the interpretation of the temporal and causal connections between eventualities requires the use of segmental structure. Section 16.6 puts all the different components discussed earlier together to specify a model of discourse understanding. Section 16.7 presents an example that illustrates the issues discussed in the chapter. 16.1 The Need for Discourse Structure The reference mechanisms presented in Chapter 14 were based on the structure of the previous sentence and on recency constraints. In dialogues where the topic may shift and change, however, you can see that these techniques are inadequate. Consider the following fragment, which could occur near the end of a dialogue between two persons, E and A, while E helps A assemble a lawn mower: la. E: So you have the engine assembly finished. lb. Now attach the rope to the top of the engine. lc. By the way, did you buy gasoline today? Id. A Yes. I got some when I bought the new lawn mower wheel. le. I forgot to take my gas can with me, so I bought a new one. If. E Did it cost much? lg- A No, and I could use another anyway to keep with the tractor. lh. E: OK. li. Have you got it attached yet? The antecedent of it in sentence li is the rope last mentioned seven sentences earlier in 1 b, even though objects mentioned since then, such as the gas can in sentence le, would satisfy any of the selectional restrictions that would be derived for it. Thus the history list mechanism fails to make the correct predictions in this dialogue. In fact, no simple generalization based on a linear ordering of discourse entities can provide a satisfactory solution. Intuitively, though, it is clear what is going on. Sentences lc through lg are a subdialogue incidental to the interaction involving attaching the rope. In lh, E indicates that the current topic is completed by using the phrase OK. Thus in the interpretation of li, the relevant previous context is based on lb. An account of this structure requires a notion of discourse segments—stretches of discourse in which the 504 CHAPTER 16 sentences are addressing the same topic—and a generalization of the history list; structure that takes the segments into account. You might think that a generalization of the plan inference models derivedr in the last chapter might be useful for identifying the segments. Using such techniques, the system might be able to recognize that lc is not a possible continuation of the plan to attach the rope, and thus represents a digression. Once the digression is completed, the plan recognizer could analyze li as querying the-, status of the action introduced in lb. But trying to do all the work within the plan recognizer would be difficult. Whenever there is a shift of topic, such as at 1 c and lh, the plan reasoner would have to fail to find a connection between the old sentence and the new, and on the basis of this failure, initiate a new topic. This could be quite expensive, and might not be possible in some cases, since there might be an obscure interpretation that would allow a sentence such as lc to be viewed as a continuation of the action described in lb (for instance, the gasoline-might be used to clean the engine before attaching the rope). Furthermore, you no doubt recognize that E explicitly told A that the topic had changed in lc by using -the phrase By the way. Such phrases, known as cue phrases, play an important role in signaling topic changes in discourse. In addition, a plan-based model may not be appropriate in other conversational settings, such as debates, where intersentential relationships such as "sentence X supports the claim in sentence Y" or "sentence X contradicts the claim in sentence Y" may be relevant. Yet the same cue phrases could be used in this setting. These arguments lead to the conclusion that a theory of discourse structure cannot be explained solely in terms of action reasoning. This chapter examines a model of discourse structure that allows each of the techniques discussed in the last two chapters to be generalized and integrated. The key idea is that a discourse can be broken down into discourse segments, each of which is a coherent unit and analyzable using techniques similar to those already presented. 16.2 Segmentation and Cue Phrases While the need for segmentation of discourse is almost universally agreed upon, there is no consensus on what the segments of a particular discourse should be or:. how segmentation could be accomplished. One reason for this lack of consensus is that there is no precise definition of what a segment is beyond the intuition that certain sentences naturally group together. Notwithstanding these difficulties, a good model of segmentation is essential to understanding discourse. It divides the problem into two major subproblems: (1) What techniques are needed to analyze the sentences within a segment, and (2) how segments can be related to each::: other. For practical purposes, a discourse segment consists of a sequence .ofs clauses that display local coherence. The following properties should hold withirE a segment: Discourse Structure 505 Event Described Informational Relation Communicative Goal El: Jack goes to store Describe El as start of story E2: Jack drives car E2 part of El Elaborate on El E3: Jack buys lobsters t E2 before E3,E3 Elaborate on El part of El E4: Jack gets home E4 provides temporal Elaborate story after El setting for E5 E5: Jack prepares E5 follows E4, E4 Elaborate story after E4 for feast enables E5 Figure 16.1 Informational relations versus communicative goals • Some technique based on recency (for example, a history list) should be usable for referential analysis and the handling of ellipsis. • A fixed time and location characterize the clauses, or there is a simple progression of time and location (as in simple narratives). • A fixed set of speakers and hearers are participating. • A fixed set of background assumptions is relevant. The last property requires that the modality of the text remains constant. For example, the text cannot switch from describing a sequence of actual events to describing a hypothetical event within a single segment. These are the structural requirements on a segment. There are two approaches to characterizing what defines a segment. The intentional view is that all the sentences in a segment contribute to a common discourse purpose; that is, the same communicative goal motivates the speaker to say each sentence in the segment. The informational view is that all the sentences in a segment are related to each other by some temporal, causal, or rhetorical relations. For example, in narratives the sentences in a single segment should combine together to describe a coherent event or situation. There is often a close correspondence between these two definitions. For instance, in most narratives, the writer's discourse intentions closely correspond to the informational level analysis. Consider the following start of a story: 2a. Jack shopped early in the day. 2b. He took his car 2c. and he bought a dozen live lobsters. 2d. When he got home, 2e. he spent the day preparing the feast. Figure 16.1 shows an analysis at both the informational level and the intentional level. The informational relations tend to describe the "fine structure" of the 506 CHAPTER 16 discourse, primarily how the events are causally and temporally related. The intentional level tends to address more global structural issues in terms of the communicative goal of relating the story. While it seems that there is always an intentional level analysis, there are discourse situations where there is no informational analysis. In discourse I, for example, sentence lc does not have any informational relationship to the content of la and lb. Rather, it is introducing a new topic for discussion. This can be analyzed at the intentional level as a change to a new topic. Examples like this motivate some researchers to argue that the intentional analysis is primary. But both levels are essential, and which approach seems more important will depend on the form of the discourse studied. In dialogues and debates, the intentional analysis seems most informative. In narratives and descriptive essays, on the other hand, the informational view seems most informative. Each view provides a useful analysis of the discourse, and neither can replace the other. With this background, the issue of segmentation can be explored in more detail. We define a segment as a sequence of clauses, possibly interrupted by subsegments, forming a hierarchical structure. In some views, each clause forms its own primitive segment, and these segments combine to form larger segments. For reasons to be seen later, we will not take this view. A clause docs not necessarily form a segment by itself, although it can do so under the right circumstances. The most important aspect of segments is that they have a hierarchical structure. The explanation of why the pronoun it in sentence li cannot refer to the gas can mentioned in le depended on this fact. Since li is not in the segment defined by sentences lc-lh, the gas can is not available as a discourse entity for li. Rather, sentences la, lb, and li define a segment. The history list generated from these four sentences correctly predicts the antecedent for it in 1 i. This example shows an important function of discourse segments: They define the local context for the interpretation of referential expressions. The hierarchical structure then controls the availability of various different local contexts that might be used to process the current sentence. The second important function of segments is to organize the information conveyed in a way that facilitates the identification of the relationship between a new sentence and the prior discourse. To support this identification process there must be some representation of the semantic content of each segment constructed by inferential processing. Each segment is associated with a local discourse state (or simply discourse state), which consists of (at least) the following: • the sentences that are in the segment • the local discourse context, generated from the sentences in the --~ segment, using techniques described in Chapter 14 • the semantic content of the sentences in the segment together with the semantic relationships that make the segment coherent Discourse Structure 507 SEGl a) Jack and Sue went to buy a new lawn mower b) since their old one was stolen. SEG2 c) Sue had seen the men who took it and d) she had chased them down the street, e) but they'd driven away in a truck. f) After looking in the store, g) they realized they couldn't afford a new one. SEG3 h) By the way, Jack lost his job last month i) so he's been short of cash recently, j) He has been looking for a new one, k) but so far hasn't had any luck._ 1) Anyway, they finally found a used one at a garage sale. Figure 16.2 The segment hierarchy represented by boxing SEGl (a,b,f,g,l) SEG2 (c, d, e) SEG3 (h,i,j,k) Figure 16.3 The same segment hierarchy represented as a tree A complete discourse will typically involve many segments. Figure 16.2 shows the segmental structure of a dialogue represented by boxing of text. A segment is said to contain the segments that appear within it. The same information can be represented in tree form, as shown in Figure 16.3. Additional concerns need to be addressed when considering on-line algorithms that understand discourse on a sentence-by-sentence basis. Such a processing model must be described in terms of extending an agent's representation of the discourse so far with a new sentence to create an updated representation of the discourse. This representation is called the attentional stack (or discourse stack) because it reflects what the agent is attending to in order to understand the next sentence. The attentional stack consists of the discourse stales reflecting the current structure of the ongoing discourse. The states on the discourse stack correspond 508 CHAPTER 16 SEGl(a) After a SEGl(a, b, f) After f SEGl(a,b) After b SEG3(h) SEGl(a, b, f, g) After h SEG2(c) SEGl(a, b) After c SEG3(h, i, j, k) SEGl(a, b, f, g) After k SEG2(c, d,e» SEGl(a, b) After e SEGlta, b, f, g, \) After 1 Figure 16.4 Part of the sequence of discourse stacks for the same discourse to the set of segments that could be extended by the next clause. The top state on the stack corresponds to the most deeply embedded segment that can-be-intended. Each state on the stack corresponds to a segment that contains the segments of the states above it. To begin a new segment in the discourse, a new slate must be pushed on the stack. To extend a segment corresponding to a state lower on the stack, all the states above it must be popped from the stack. This might sound complicated, but the stack model is quite simply related to the hierarchical segment structure. In particular, if you consider the sequence of discourse stacks, one after each clause, the sequence resembles a depth-first traversal through the tree of discourse segments. Figure 16.4 shows snapshots from the sequence of -discourse stacks for the discourse shown in Figures 16.2 and 16.3. To show the relationship between the discourse states and their corresponding segments, the ; discourse states are labeled with the segment name followed by a list of the 7 clauses seen so far in the segment. This provides a unique name for each of the » discourse states produced during a discourse. For example, discourse state* SEGl(a, b) is the discourse state corresponding to segment SEG1 at the point when clauses a and b have been processed. While not shown in the figure to save -space, segment names will also need to include the completed subsegments so that the state of a segment can be tracked over multiple subsegments. Thus the -full name for the state after clause f would be SEGl(a, b, seg2, f). One of the more important indicators of the structure of a discourse is the use of cue phrases to signal the relationship of the next clause to the preceding? discourse. Depending on the goals of their research, different researchers use different sets of cue phrases, but they can all be divided into two broad classes -depending on what they signal. The first class identifies semantic relationships.* between clauses or states, and the second class indicates discourse structure directly without identifying a semantic relationship. To introduce the first class, consider the two sentences Jack went to the store. Sam stayed home. Cue Phrases Typical Use Cue Phrases for Typical Use for Structure Semantic Relations anyway end digression and continuation by the way start digression because causation/reason bye end dialogue but contrast first intro. subtopic furthermore new subtopic (itemization) however contrast incidentally start digression meanwhile new topic last new subtopic (at same time) (itemization) so conclusion next new subtopic then causal/ temporal (itemization) therefore summary now intro. subtopic though contrast OK close topic Figure 16,5 Some cue phrases and their uses Taken as presented, there is no obvious relationship between the two sentences except an implied temporal overlap. But many words could be added to explicitly indicate the intended relationship between the events. For instance, you could indicate that the reason that Jack went to the store was because Sam stayed home, Jack went to the store because Sam stayed home. or that Sam stayed home because Jack went to the store, Jack went to the store. So Sam stayed home. Jack went to the store. Therefore, Sam stayed home. or that Sam stayed home even though you would have expected him not to, Jack went to the store but Sam stayed home. Jack went to the store. However, Sam stayed home. or that these two events are both evidence for some other conclusion, Jack went to the store. Furthermore, Sam stayed home, or finally, that a certain temporal relationship holds, Jack went to the store. Meanwhile, Sam stayed home. The second class of cue phrases signals the discourse structure directly without necessarily indicating a semantic relationship. Typically, they indicate segment boundaries. They include phrases used to end the current topic under discussion (such as OK, fine), to end the discourse itself (such as bye, thanks), to signal a digression (such as by the way, incidentally), to signal the end of a digression (such as anyway), or to indicate a particular discourse organization, such as itemization (for example, first, second, next, last). Figure 16.5 lists some cue phrases in these two broad classes together with some of their typical uses. 510 CHAPTER 16 Discourse Structure 511 SEGl(la, lb): Center: 0, Cp: Rl, Others: Tl, El Content: PullRope(Rl), TopOffTl, El), Engine(El) Figure 16.6 The discourse stack after sentence lb 16.3 Discourse Structure and Reference The example at the beginning of this chapter used a reference problem u> motivate the need for a hierarchical discourse structure. The attentional stacks described in the last section provide the mechanism to account for this problem. The digression creates a new discourse state that temporarily hides the original discourse state when it is pushed on the stack. When the digression ends, its st,iii. is popped off the stack, and the original state becomes available. Consider this example in more detail. Figure 16.6 shows the discourse stack at the end ul sentence lb, Now attach the rope to the top of the engine. The current discourse entities and their properties are shown as part of the local discourse state fori lie segment. In particular, entities Rl (the rope), Tl (the top), and El (the engine) are available for subsequent reference, and Rl is the preferred next center. In sentence lc, the cue phrase By the way signals a digression, so a new discouw state is pushed on the stack. This top state is then extended by utterances I.I through lg, resulting in the stack shown in Figure 16.7. The discourse entitle-, available for reference describe the new gas can, G3, and the tractor, T2. The clue word OK in lh indicates the end of the digression, and the discourse state for SEG2 is popped off the stack, resulting in the discourse stack shown in Figuie 16.8. Note that this state is the same as the one in Figure 16.6 that arose afu i sentence lb. Thus when utterance li is processed, the pronoun it will refer to Rl, as expected. It is not possible for it to refer to the gas can, G3, because thai discourse entity is no longer available in the discourse state. Pronouns play an important role in most arguments about discouw structure because, as you saw in Chapter 14, there are strong constraints on where the antecedent can appear. In particular, in most cases the antecedent is in the previous sentence and is subject to recency constraints. That is why example-, such as the pronoun it in sentence li pose such a problem. The hierarchical discourse model provides an elegant solution to the problem that retains intuition-, about the importance of recency. The analysis for digressions is fairly straightforward because cue phrase-typically signal the segment boundaries. Some other forms of discourse also h.iu-a clearly defined segment structure. For example, itemization constructs clien u-f\ SEGl(15a) Jack goes to Helen's past V SEGl(15a, 15c) Jack goes to Helen's Jack drops roses . past^/ SEG3(15d) Carpet is cleaned • pasty perf SEGl(15a, 15c) Jack goes to Helen's Jack drops roses , past^ After 15b After 15c After 15d Figure 16.16 Tense trees and segmentation for discourse 15 If these utterances all fall in the same segment and thus fit into the same tense tree, the events in 15b and 15d will both be assigned to the same node in the tense tree, with the consequence that the event of buying the roses will orient the event of Helen cleaning the carpet—clearly an unintuitive analysis, as they seem unrelated. This analysis is avoided if each segment can be associated with exactly one node in the tense tree. A sentence that shifts to a tense below the tense of the present segment signals a push, and a sentence that uses a tense above the tense of the current segment signals a pop and resumption. With this, the analysis of 15 appears as shown in Figure 16.16. Sentence 15b signals a push as it moves to the past perfect tense. Sentence 15c, on the other hand, resumes the simple past, signaling a pop and resumption of segment SEGl. Sentence 15d, also in the past perfect, signals a segment push, creating a new segment not related to SEG2. This produces the desired interpretation. 524 CHAPTER 16 SEG2(16b, 16c) Jack buys roses Jack pays $30 pas^/ • SEGl(16a) Jack walks to Helen's • past^ Figure 16.17 The attentional stack after sentence 16c Unfortunately, tracking tense is not always so simple. Sometimes a speaker uses the past perfect tense to signal a shift to an event back in time, but then elaborates on that event using the simple past, as in the following discourse: 16a. Jack was walking over to Helen's house. 16b. He had bought some roses at Honest John's Flower Mart. 16c. He paid thirty dollars for them 16d. and Honest John swore they were fresh. 16e. But they wilted before he got to Helen's. Here, 16b signals a new segment in the past of the event in 16a, and 16c and 16d further elaborate on the event in 16b. But these two sentences are in the simple past. What has happened is that there has been a shift in temporal perspective.. This might be handled by allowing a modification of the tense tree associated with the segment. The attentional stack after sentence 16c would be as shown in Figure 16.17. Given this context, if the next sentence is in the simple past, it will be ambiguous as to whether it continues the top segment or resumes the lower segment. Such issues would have to be decided based on inferential processing, as they cannot be determined by the tense information. 16.6 Managing the Attentional Stack The previous sections have explored different issues affecting discourse structure. This section brings these issues together to specify a simple model for the on-line processing of discourse. In order to be specific, it will concentrate on a particular form of discourse, namely that which describes events and situations in the world. This includes much narrative, instructional text, and dialogues about specific tasks—essentially any discourse where the inferential component; is based on causal and temporal reasoning. These forms of discourse will he referred to as causally related discourse. With the form of inference defined, the discourse purposes of a segment. can be explored in more detail. In particular, each segment is motivated by the Local Discourse Context: Center: Jack 1, Cp: Jack 1, Others: Roses 1 Last Eventuality: Eb (Jackl buys Rosesl) • Situation Es: past jr Events: Ea (Jackl goes to Storel), Ea c Es Eb (Jackl buys Rosesl), Eb c Es S Relations: Ea <: Eb Ea,Eb Ea enables Eb Figure 16.18 SEGl(17a, 17b): the discourse state after sentence 17b goal to describe some situation. The situation might be an event that happened or a state that holds, and it might be arbitrarily complex. For example, it might describe an interacting set of actions by different agents. As long as the actions are causally or temporally connected, they can be grouped together into a single situation. We will assume that each clause describes an eventuality (event, process, or state, depending on what the clause describes). These eventualities are actually simple situations, and are incorporated into larger situations by finding a causal or temporal connection to the other eventualities already in the situation. We say that a situation S dominates a situation S' if S' represents a subpart of the situation represented by S. Given this background, a local discourse state associated with a segment will contain the following information: • the current local discourse context (derived primarily from the last sentence in the segment, as described in Chapter 14) • a history list generated from earlier sentences in the segment • the tense tree for the segment, labeled with all the eventualities in the segment • the eventuality last described in the segment • a situation that captures all that has been described in the segment so far Figure 16.18 shows the discourse state after processing 17b in discourse 17: 17a. Jack went to the store. 17b. He bought some roses. 17c. He wanted to surprise Helen. The local discourse context and last eventuality reflect the last sentence in the segment. Here you see that the center is Jackl, the preferred next center is also Jackl, and the other possible next center is Rosesl. The last eventuality is Eb, described by sentence 17b. The situation being built for the segment is Es, and it currently consists of Ea and Eb, where Ea enables (and precedes) Eb. Both Ea and Eb occur within Es. The tense tree encodes that the two events mentioned were described in the simple past tense, and Ea orients Eb. 526 CHAPTER 16 Local Discourse Context: Center: Jackl, Cp: Jackl, Others: Helen! • Last Eventuality: Ec (Jackl wants to surprise Helenl) past X Situation Es: Events: Ea (Jackl goes to Storel),EacEs / Eb (Jackl buys Rosesl), Eb c Es Ea, Eb, Ec Ec (Jackl wants to surprise Helenl), Ec c Es Relations: Ea <: Eb Ea enables Eb Eb c Ec Figure 16.19 The discourse state after being extended with sentence 17c Manipulating Discourse States There are two ways in which a new sentence may affect the discourse states. It may extend an existing state (continuing an existing segment) or it may create a new state (starting a new segment). When a sentence S describing an eventuality E extends a state, it does the following: 1 • It updates the local discourse state with the structure derived i from S. j • It updates the history list. j • It adds E to the tense tree, and derives the appropriate orients | relations and other temporal constraints. J • It sets the last eventuality of the state to E. -™*f • It invokes the inference component to derive the appropriate " ' '1 consequences to extend the contents of the situation described | so far. J Consider an example. Figure 16.19 shows the new discourse state after the stan. ; in Figure 16.18 has been extended with sentence 17c. ; There are three different circumstances in which a new discourse state is j created, depending on how it relates to the discourse state on the top ot the 1 discourse stack. Specifically, these are . .... j digression—the new state is unrelated to the old state. » temporal shift—the new state is temporally related to the old state. J expansion—the new state is expanding the last eventuality in the old state. - --------^ - - - -- -;y ' /-;/ - ■* ■ In a digression, the new slate is based on the new sentence and is unicl.iU<.' _ '. to the prior discourse. Thus the new slate is built solely from the analysis of 111' new sentence. — -rj~. Local Discourse Context: • Center: Jackl, Cp: Jackl, past S Others: Birthday 1, Helenl Last Eventuality: Ed (Jackl forgets Birthdayl) i 1 Situation Es2: Es2 < Es perf Events: Ed (Jackl forgets Birthdayl), Ed c Es2 * ► Ed Figure 16.20 The new discourse state created for sentence 17d Local Discourse Context: Center: Jackl, Cp: Jackl, Others: Rosesl past / Last Eventuality: Ec (Jackl gets the roses on sale) Situation Eb: / Events: Ec (Jack! gets the roses on sale), Ec c Eb • Ec Figure 16.21 The new discourse state created by sentence 18c In a temporal shift the new state includes a temporal constraint relating the new situation to the previous situation. Consider an extension to discourse 17: 17d. He had forgotten her birthday the day before. 17e. He had been busy at work and hadn't noticed the date. Sentence 17d introduces a new subsegment describing a situation in the past of the events described so far. The new state pushed on the stack above the state shown in Figure 16.19 is shown in Figure 16.20. Note the temporal constraint that the new situation Es2 is before Es, the situation of the state below it on the stack. With this context, 17e will be temporally located not only because it is oriented by Ed, but also because it is in the past of Es. The third case, an expansion, introduces a new segment that is expanding detail about the last eventuality. This case is handled by using the last eventuality as the situation to be constructed in the new segment. Consider a variant of discourse 17: 18a. Jack went to the store. 18b. He bought some roses. 18c. He got ones that were on sale 18d. and paid cash for them. Sentence 18c starts a new segment that elaborates on the purchasing action in 18b. The discourse state at this point was shown in Figure 16.18. The new state created for 18c is shown in Figure 16.21. Note thai the situation being described in this segment is Eb, rather than a new situation. This entails the desired implication that Ec is during Eb. 528 CHAPTER 16 Determining Stack Updates So far this section has considered the different operations that can be done u, update the attentional stack, but it has not considered how to decide v. Ii.n upd.iii^ to make in a certain situation. This problem can be broken into two main snh. problems characterized by the two questions • Can the new sentence extend the top state? • Can the new sentence create a new segment to be pushed oni the stack above the current top state? You could find a complete set of possible interpretations by asking these hm questions for each discourse state on the stack. In the previous sections several. constraints were developed based on discourse segment purposes, tense analysis, and cue phrases, all making demands on the inference component. I he^ ,.,.11-straints suggest some methods for answering the questions. While they are inn simple in some ways, they give a feel for the mode of processing. Constraints on Extending a Discourse State Reference Constraint—a sentence S can extend a state only if its anaphoric components can be resolved to discourse entities in the state. Tense Constraint—a sentence S can extend a state only if its tense is identical to the tense defined for the state, or is the same tense except that the perfect aspect has been dropped, signaling a shift in temporal perspective. Inference Constraint—a sentence can extend a state only if the eventuality described by the sentence can be part of the situation that the state describes. The constraints on pushing a new state above an existing state (called the originating state) will be different depending on whether the push represents a digression, temporal shift, or expansion. But in all cases the originating state typically provides the local discourse context for resolving the anaphoric terms ku the new sentence. Beyond that, each type of push has its own constraints, such ass the following: .......v.. Constraints on Pushes T~ Reference Constraint on all Pushes—a sentence S may push a new state only if its anaphoric components can be resolved to discourse entities in the originating state. Constraint on Temporal Shifting—either the tense of S extemis ihv. tense tree in the originating state, or an explicit temporal phrase is used to locate the new state. 5^^^ it Discourse Structure 529 Input Parameters: STACK—the discourse stack S—the new sentence POP, PUSH, CONTINUE—the flags for cue phrases Algorithm: 1. 2. If POP is set to true, then pop STACK once and continue. If PUSH is not set, then search STACK for a state that can be extended by S. The first one found is the answer. If step 2 failed to produce an answer, and CONTINUE is not set to true, then search STACK for a state from which S can push a new state. Figure 16.22 Algorithm for incorporating the next sentence into the discourse stack Constraint of Expansion—a new sentence S must describe part of the situation marked as the last eventuality in the originating state. Constraint on Digressions—digressions must be explicitly marked by a cue phrase or other linguistic device. Because of ambiguity, there often will be many different interpretations of how a sentence might update the attentional stack. For these cases we will need some preferences on interpretations, such as the following: • A continuation is preferred over a push of a subsegment. The interpretation that involves the least number of pops of the stack is preferred. Structural Cue Phrases The final consideration is the effect of structural cue phrases on the model. Each cue phrase can be classified in terms of three factors: whether it indicates a termination of a segment (a POP), the start of a new segment (a PUSH), or the continuation of a segment (a CONTINUE). An individual cue phrase may indicate one or more of these functions. The word anyway, for instance, signals both a POP and a CONTINUE. Of course, these phrases may have effects on inferential processing as well, but this is the only information needed for our present purposes. Figure 16.22 shows an algorithm that uses cue phrases. It takes as input the discourse stack, the analysis of Ihe new sentence, and three variables—POP. PUSH, and CONTINUE—that are set to true if a corresponding cue phrase is present in the sentence. The first step is a check for an explicit POP; if it is found, the top state is popped. Then the algorithm checks repeatedly for possible extensions of states on the stack and then for possible pushes of new states onto the stack. 530 CHAPTER 16 BOX 16.3 Reference to Situations Created by Discourse Segments Further evidence for segmentational structure can be found in examining the events or situations that can be referred to anaphorically in a discourse. Consider the following discourse: a. When Jack entered the room, everyone threw balloons at him. b. In retaliation, he picked up the ladle and started throwing punch at everyone. c. Just then, the chairman walked into the room, d Jack hit him with a ladleful, right in the face, e. Everyone talked about it for years afterwards. In e, the pronoun it may refer to the situation described by a through d. These utterances form a segment that describes a complex situation, which is referred to anaphorically in e. Which of the other events described in the discourse could also be anaphoric antecedents at position e? It appears that only the event described in d is also available for reference, as in the alternate continuation e'. It was a foolish thing to do. It is not possible to construct a sentence at position e with a pronoun that refers to the event of the chairman walking in, however. Webber (1991) argues that the events/situations available for anaphoric reference at any given time consist of the last mentioned eventuality and the situations constructed in each of the discourse states lower on the discourse stack. 16.7 An Example Consider how the model described in the last section could be used to account for the discourse in Figure 16.2, repeated here as discourse 19: 19a. Jack and Sue went to buy a new lawn mower 19b. since their old one was stolen. 19c. Sue had seen the men who took it and 19d. she had chased them down the street, 19e. but they'd driven away in a truck. 19f. After looking in the store, 19g. they realized they couldn't afford anew one. 19h. By the way, Jack lost his job last month 19i. so he's been short of cash recently. 19j. He has been looking for a new one, 19k. but so far hasn't had any luck. : ™ 191. Anyway, they finally found a used one at a garage sale. Clause 19a creates a new discourse state SEGl(19a), as expected, and l"b extends this state. The connective since should help the inference component,, identify that the eventuality in 19b is the motivation tot: the eventuality in 19a. Discourse Structure 531 SEG2(21c): past Local Discourse Context: Center: OldLawnMl, Cp: Suel, Others: Menl • Last Eventuality: Ec (Suel sees theft) perf Situation Es2: Es2 < Esl Events: Ec (Sue sees theft), Ec c Es2 * » Ec SEGl(21a, 21b): Local Discourse Context: • Center: {Jackl, Sue 1), Cp: OldLawnMl past / Others: NewLawnMl, Jackl, Suel Last Eventuality: Ea (Jackl and Suel go shopping) m Ea,Eb Situation Esl: Events: Ea (Jackl and Suel go shopping), Ea c Esl Eb (OldLawnMl stolen), Eb c Esl Relations: Eb < Ea, Eb motivates Ea Figure 16.23 The discourse stack after clause 19c Clause 19c, in the past perfect tense, signals a temporal shift, and a new discourse state SEG2(19c) is pushed on the stack above SEGl(19a, 19b). Figure 16.23 shows the discourse stack after 19c. The next clause, 19d, is an acceptable extension of SEG2( 19c), as is 19e. Clause 19f is untensed and subordinate to 19g, so its interpretation is determined by 19g. Clause 19g is in the simple past, so either it is resuming SEGl(19b) or there has been a perspective shift and it is continuing SEG2(19c, 19d, 19e). The inference component has to decide whether it is more likely that Ee (driving in truck) orients Eg (can't afford lawn mower), or Eb (old one stolen) orients Eg. It is very unlikely that the men who stole the lawn mower would be considering buying new lawn mowers, so Ee orienting Eg is implausible. That Jack and Sue are considering buying a lawn mower is much more likely—in fact, expected—given the context in state SEGl(19a, 19b). Thus 19g is interpreted as an extension of SEGl(19a, 19b) after a pop. Note that the interpretation of the pronoun they is affected by this decision. If 19g had extended SEG2(19c, 19d, 19e), it would have referred to the men rather than Jack and Sue. This new discourse stack is shown in Figure 16.24. Now that SEG2(19c, 19d, 19c) is popped from the state, what has happened to all the information it encoded? It clearly shouldn't be completely forgotten, since it forms part of the understanding of the discourse. In fact, later in the discourse, the events it contains could be referred to again. So what is the significance of it being popped off the stack? The answer is that this discourse state cannot be resumed later in the discourse as SEG1 was resumed; in other words, you couldn't just start talking as though the context it defined were still 532 CHAPTER 16 Discourse Structure 533 SEGl(21a, 21b, 21f, 21g): Local Discourse Context: Center: {Jackl, Suel], Cp: {Jackl, Suel) Others: NewLawnMl, Jackl, Suel • Last Eventuality: Eg (J&S can't afford new one) ' past / Situation Esl: / Events: Ea (J&S go shopping), Ea c Esl * Ea, Eb, Kf, Eg Eb (OldLawnMl stolen), Eb c Esl Ef (J&S look in store), Ef cz Esl Eg (J&S realize can't afford new one) Relations: Eb < Ea, Eb motivates Ea EfendsEa, Ef Beliefs Planning \Commitment Intentions Acting Desires Figure 17.1 A BDI model of an intelligent agent because they involve more than one agent, that is, they are communicative actions. In addition, they are defined in terms of the agent's cognitive state rather than in terms of physical properties of the world. 17.2 Language as a Multi-Agent Activity Language and all other forms of communication necessarily involve multiple agents. You cannot communicate by speaking in a room with no one else present. Other forms of communication, such as writing, can extend the communication act out over time. You may write something now to be read in 10 year's time. But in all cases the purpose of communication is for one person to affect the cognitive state of the other. In addition, communication cannot occur if one of the agents does not recognize the other's attempt to communicate. For example, say that Jack and Sue are involved in industrial espionage and have worked out a scheme where Jack leaves his window open when it is safe for Sue to visit him. With this system established, Jack can communicate with Sue using the window. He opens the window, and when Sue sees it she understands that it is safe to visit. Now say that one day Jack opens the window because it is hot, and Sue comes by and decides it is safe to visit. This is a mistake, as the open window does not indicate that it is safe, because Jack didn't open the window with this intention. Conversely, say that even though it is hot, Jack does open the window to signal to Sue. But when Sue comes by, she decides that it is so hot that Jack opened the window to cool off. In this case Jack's attempt to communicate failed because Sue didn't recognize his intention in opening the window! So communication can occur only when one agent intends to communicate and the other agent recognizes that intention. The other crucial requirement for communication is that there be an agreed-upon set of conventions with agreed-upon meanings. In the previous example, only two conventional symbols are required—the window being open (meaning it is safe) and the window being shut 544 CHAPTER 17 (meaning it is not safe). Jack and Sue have defined their own private language that functions perfectly well, as long as Jack is careful on hot days. These same requirements underlie linguistic communication. But they are so ingrained in language that we often don't notice them. In fact, just seeing a sequence of words spelling out a sentence is enough for us to decide that it is ari: attempt to communicate. This is because, unlike in the window example, the physical acts performed to speak and write are seldom used for purposes other than communication. When someone starts uttering some words, we assume they are trying to communicate, because the act of uttering words is rarely used for any other putpose. Note that the window example made its points by exploiting this ambiguity. The act of opening the window could be used for two different purposes—to communicate with Sue and to get some fresh air. This illustrates an important distinction between the physical act performed and the communicative act. Austin (1962) made this point for language and observed that there are three acts performed whenever something is said: the locutionary act—the act of uttering a sequence of words, the illocutionary act—the act that the speaker performs in saying the words. the perlocutionary act—the act that actually occurs as a result of the utterance. The window language example illustrates these different acts. Jack opening the window is the locutionary act, and as we saw it can be performed with or without communicative intent. When Jack opens the window with the intention to signal that all is clear, he is also performing an illocutionary act, that of signaling that all is clear. If Sue sees the open window and recognizes Jack's intention, and thus knows that all is clear, then a perlocutionary act, convincing Sue that all is clear, has also occurred. Some verbs may describe illocutionary acts while others describe perlocutionary acts. Illocutionary acts can be used in sentences that explicitly name the act being performed. These uses are called explicit performatives. Thus I may promise to mow the lawn by saying / hereby promise to mow the lawn. Likewise, inform describes an illocutionary act, as in I hereby inform you that' your bank account is overdrawn. The verb convince, on the other hand, describes a perlocutionary act. It is something you might intend to occur when you inform someone of something, but its occurrence is beyond your control. As a result, you cannot say */ hereby convince you that your bank account is overdrawn. Note also that you may achieve perlocutionary acts without the other agent recognizing any intention to communicate. For instance, Sue might convince Jack that his bank account is overdrawn by leaving his bank statement out on a table where he will see it when she is not around. A difficult problem in language understanding is identifying the illocutionary act intended by the speaker. The same locutionary act—say. uttering the sentence Defining a Conversational Agent 545 Do you know the time? might generate the illocutionary act of asking a yes/no question, or the act of asking for the time, or the act of offering to tell the hearer the time, depending on the speaker's intention. Identifying the appropriate intention underlying a locutionary act is the key problem in speech act theory. Of course, there are ways in language to signal intentions explicitly. Using explicit performatives is one way to do that, although it results in very formal, stilted language. Another is the use of the word please. If we add please to this sentence, it would not be possible to interpret it as a yes/no question or an offer. Unfortunately, such explicit signals are relatively rare in English, and the recognition of intention from context is a crucial problem in the process of identifying the correct interpretation. The final consequence of language being a multi-agent activity is that it requires some means for the agents to coordinate their communicative acts and to monitor whether they are being understood. In dialogue, agents use various mechanisms to signal that they understand each other. This includes explicit statements such as I understand, simple acknowledgments such as OK, and other devices such as repeating back certain parts of the previous utterance. A common example of this occurs when people communicate telephone numbers. Typically, the first agent says the number, and then the other agent says it back, and the first confirms it. These are all methods for two agents to establish that titey agree on what was communicated. Confirming this mutual agreement is a crucial aspect of any dialogue. 17.3 Representing Cognitive States: Beliefs There are many ways to represent the beliefs of the agent, varying in how introspective and aware you want the agent to be. For instance, we have been using a knowledge base (KB) of facts throughout the last few chapters. You could simply say that anything in the KB is believed by the agent and leave it at that. But this will produce a limited agent. It can't, for instance, represent information about the beliefs of other agents, so it is unlikely that it can be motivated to communicate. In addition, it has no self awareness—it can't distinguish between what it believes and what it observes to be true in the world. Thus it would be hard to give a good account of using perception to learn more about the world. The simplest extension is to divide the KB up into different areas, called belief spaces. A space is simply a set of propositions and thus specifies a KB in its own right. Each one could be used to represent some agent's beliefs. For a two-party conversation, we might need two spaces: one capturing one agent's beliefs and the other representing the other agent's beliefs. But this isn't a very intuitive model of the beliefs of a single agent, as it appears to give the agent direct access to the other agent's beliefs. Rather, it would be better if all beliefs were relative to the agent we are modeling. To capture this, we will allow a predicate BEL, which relates an agent and a space and is true if the spaee 546 CHAPTER 17 Defining a Conversational Agent 547 Bel(Jackl, Dog(Fidol)) Bel(Jackl, V x: Dog(x) . Bark(x)) Bel(Jackl, BEL(Suel, Dog(Fidol))) Bel(Jackl, BEL(Suel, V x: Dog(x). Bark(x))) Bel(Jackl, BEL(Suel, V x : Dog(x). Bark(x) ^Fierce(x) Statements in Modal Logic BEL(Jackl, BSD BS1: Jack's beliefs Dog(FidoI) Vx : Dog(x). Bark(x) BEIfSuel, BS2) BS2: Jack's beliefs about Sue's beliefs Dog(Fidol) Vx: Dog(x). Bark(x) Vx : Dog(x) . Bark(x) zj Fierce(x) Statements Using Belief Spaces Figure 17.2 Representing beliefs in belief spaces characterizes beliefs of the agent. Since propositions using this predicate can occur in other spaces, it allows us to define a hierarchy of nested belief spaces. A simple example is shown in Figure 17.2. The two sides of the figure show the same set of propositions, first using a modal operator Bel for belief as described in Section 8.3, and then using the belief space approach. This KB asserts that Jack's beliefs are characterized by space BS1, which contains the propositions that Fido is a dog, that dogs bark, and that Sue's beliefs arc described in space BS2. Space BS2 describes what Jack believes that Sue believes: that Fido is a dog, that dogs bark, and that all dogs that bark are fierce. In such a model, inference is always done with respect lo a space. For example, with respect to space BS2, we could infer that Fido is fierce, concluding that Jack believes that Sue believes that Fido is fierce. A similar conclusion could not be drawn in space BS1, so Jack doesn't believe that Fido is fierce. This brings out an important distinction about what it means to believe something. An agent's explicit beliefs are the propositions that are listed in the appropriate space. An agent's implicit beliefs are those propositions that can be derived from the explicit ones by some inference process. When we say that someone believes something, say that Jack believes the world is Hat. we usually mean something most similar to the notion of explicit belief, possibly with some very limited forms of inference allowed. But consider the case in which Jack has a set of beliefs that logically entail that the world is Hat. although Jack has never realized this connection by making these inferences. In this case, although he has an implicit belief that the world is flat, we probably wouldn't usually say that Jack believes the world is flat. Maintaining the distinction between what you believe and what you believe others believe is crucial for understanding communication. This can be seen most clearly when the two conversants have different beliefs. Suppose Jack believes that there is a combination lock on his locker but believes that Sue thinks the locker has a key lock. When she asks for his key, he should be able to recognize BOX 17.1 Sentential Models of Belief There are two general approaches to defining a semantics for the belief operator. The first treats Bel as a modal operator and gives a possible worlds semantics as described in Section 8.8. The only difference between Bel and Know is that the accessibility relation for Bel is not reflexive. As a consequence, while Know(F) z> P is a valid axiom for know, it is not a reasonable axiom for Bel. Just because an agent believes something doesn't mean that it must be true. The other approach is closer to the intuitions of the belief space representation. The Bel operator relates an agent to a proposition, which requires us to extend the logical framework to allow propositions as objects. This is often done by introducing a quotation operator. While Happy(Jackl) is a proposition, the expression \Happy (Jackl)] is a term that denotes the proposition. With this, you would now write the formula for Jack believes that he is happy as Bel(Jackl, \Happy(Jackl)]) Once this step has been taken, you then need to rebuild any axioms that you desire for reasoning about an agent's beliefs, as the usual axioms do not apply to quoted expressions. Examples of such representations can be found in Haas (1986) and Konolige (1986). that Sue's plan is to open the locker, even though opening his locker with a key is an incoherent plan from his point of view. As another example, belief models are crucial for determining the intended referent of a noun phrase. For example, suppose Sue knows that a certain book— say, The Revenge of Mrs. Smith—is in Jack's office, but she believes that Jack believes it is on a table in the lounge. Then if he asks her, Have you read the book on the table in the lounge?, she should recognize that he means The Revenge of Mrs. Smith, even though the description Jack used to refer to it is inaccurate. Thus one agent's beliefs about another's beliefs play an important role in language understanding. The natural question to ask is how deep a nesting of beliefs is required to understand all language. It is easy to see a need for a third level of nesting. Suppose Jack believes that Sue thinks she just successfully lied to him when she said it was raining outside. In this case Jack believes the following: Jack believes: It is not raining. Jack believes that Sue believes: It is not raining. Jack believes that Sue believes that Jack believes: It is raining. Both Jack and Sue believe it is not raining, but Jack also believes that Sue believes that he thinks it is raining, since he thinks that she thinks her lie was successful. 548 CHAPTER 17 Similar examples can be constructed that require ever deeper levels nt nested beliefs. A way out of this problem is to introduce the notion of shared knowledge, which is the knowledge that both agents know and know that the " other knows. Shared knowledge arises from the common background and situation that the agents find themselves in and includes general knowledge about the world (such as how common actions are done, the standard type hicuuln _ classifications of objects, general facts about the society we live in, and so on), • Agents that know each other or share a common profession will also have2 considerable shared knowledge as a result of their previous interactions- andthelrn: education in that field. If agents construct their plans using only shared knowledge, they can be reasonably sure that the plan can be successfully rccog- _ nized. Thus shared knowledge plays a crucial role in successful communication. While individual beliefs may play a central role in the content of a conversation,--^ most of the knowledge brought to bear to interpret the other's actions will be r shared knowledge. Knowledge About Another's Beliefs Perhaps the most tricky problem in building belief models is that much knowledge is of the form that you know that some other agent knows something, yet you do not know what they know. For instance, Jack can believe that Sue knows her mother's name even though Jack does not know the name. You cannot represent this situation simply by adding a fact into the appropriate belief space, because there is no formula of the form NameOfiMotheriSuel), xxx) that can be""' in the appropriate belief space since Jack does not know Sue's mothor"s name. If we use a Skolem constant—say, Sk33—and add NameOfiMotheriSuel),--&/, i*, 10 Sue's belief space, it simply asserts that Sue believes that her mother has a name, not the right fact at all. In modal logic this is equivalent to the formula 1. BeKSue, 3! x . NameOfiMotheriSuel), x)) The common solution to this problem in modal logic approaches is to allow quantifiers to outscope the belief operator. Thus formula 1 means that Sue believes that her mother has a name, whereas 2. 3! x . Bel(Suel, NameOfiMotheriSuel), xj) -: means that she actually knows what the name is. Unfortunately, most knowledge representations handle existential quantification by skolemization, and quantifier ; scoping with modal operators is not preserved by a simple skolemization scheme We can capture this scoping distinction, however, by indexing every Skolem and variable by the belief space in which it is created. Thus adding a fact corresponding to formula 1 to the KB would create a Skolem dependent on the belli I space BS2 (that is. what Jack believes Sue believes), and the resulting fact 3. NameOfiMotheriSuel), SkiBS2) ir KB jjack ^yVariable ?XBS1 ?*BS2 Constant BS1 Mary KB succeed succeed jsue sklBSl succeed succeed BS2 sk2BS2 fail succeed Figure 17.3 Matching variables and constants indexed by belief spaces is added to space BS2. Adding a fact corresponding to formula 2 would create a Skolem dependent on BS1 (that is, what Jack believes), and would add 4. NameOfiMotheriSuel), Sk2BS]) to the space BS1. This method of retaining the scoping of existentials must have a corresponding proof method for testing existential statements and be able to make a distinction between a query such as Does Sue believe that her mother has a name? and one such as Does Sue know her mother's name? To do this, the inference system must be able to restrict variables so that they only match objects defined in appropriate belief spaces. Say formula 1 is asserted, resulting in formula 3 being added to space BS2. Now suppose formula 2 is queried. Using the standard methods, formula 2 would translate into a query of the form 5. BeliSuel, NameOfiMotheriSuel), ?x)) which would result in querying the formula NameOfiMotheriSuel), ?x) in belief space BS2. But this would unify with formula 3 (corresponding to assertion 1). The problem arises because the variable ?x should only match objects defined in space BS1, and Ski was defined in BS2. If variables are indexed by spaces as well, the problem can be avoided. In particular, unification would be extended so that a variable defined in space X can only match objects that arc defined in spaces accessible to X. The accessibility relation would be defined by the nesting hierarchy of belief spaces. Thus a constant defined in BS1 would be accessible in BS2. but not vice versa. Figure .17.3 shows the hierarchy of belief spaces and specifies a chart showing what variables can match what constants. For convenience, we can define a new operator KnowRef that states that an agent knows a unique object satisfying some description, represented as a lambda expression. In particular, 6. KnowRef {A, lxPx)=3\ y. BeliA, HXxPx)y)) =3\ y . BeliA, Py) Given this, formula 2 is equivalent to KnowRefiSuel, Xx NameOfiMotheriSuel), x)) 550 CHAPTER 17 Defining a Conversational Agent 551 Another important form of knowledge involves knowing that someone else knows whether a certain fact is true (yet the original agent doesn't know whether it's true or not). For example, Jack may believe that Sue knows whether or not she owns Fido. This is represented as a disjunction of beliefs in modal logic that is, Jack believes that either Sue believes she owns Fido, or Sue believes she doesn't own Fido. Thus there is a distinction similar to quantifier scoping previously discussed. If a disjunction is within the belief operator 7. Bel(Suel, Own(Suel, Fido) v->Own(Suel, Fido)) then Sue believes that either she owns Fido or she doesn't. This is almost certainly believed by Sue or by any other rational agent. The case with the disjunction outside the belief operator 8. Bel(Suel, Own(Suel, Fido)) vBeliSuel, -^Own(Suel, Fido)) represents that Sue knows whether she owns Fido or not. This is more difficult to express in terms of belief spaces. Here we introduce a new operator, Knowlf,-which captures the meaning expressed in formula 8: 9. KnowIfiA, P) = Bel(A, P) v Bel(A, -tf>) . Reducing this to an expression using belief spaces would require introducing additional mechanisms that allow you to explicitly assert that a particular formula is in a belief space. Then know would be defined in terms of a disjunction of this relation. We will use the Knowlf operator direcdy in the representation, however, and not decompose into a more primitive form. One final aspect is the most difficult to deal with but ultimately is far more important than the two forms of knowing discussed thus far. In fact, the two preceding forms could be viewed as limited special cases of knowledge about what others know. In particular, what if your stereo is not working and you want to get help fixing it? You have knowledge that someone, such as Joe of Joe's Stereo Repair, can help you. Joe knows about fixing stereos and can tell you what you need to know. But you don't know enough to know what propositions it is that Joe must know about in order to fix stereos. You may not even know the vocabulary to express those propositions. Yet you do have some beliefs about Joe's beliefs and knowledge that enable you to reason that Joe is the person to ask about your stereo. Only very specialized and limited techniques have been developed so far to address this question. Typically, this knowledge is embedded in heuristics that are hand-tailored to the expected user of the system. For example, for each action class, a system might explicitly represent what users know enough to be able to execute the action. Then the system could introduce action classes that are only partially described. For example, a (possibly partial) set of preconditions and_ effects could be defined for an action class FixStereo, and yet the system may have no decomposition for such actions. It would then be asserted that Joe knows how to execute actions of the type FixStereo. Thus, given a goal Working SIX where SI is a stereo, the system might find that FixStereo is relevant because its effects match the goal, but it would not be able to decompose the action. It could then look up which agent it believes knows how to perform the action. You might call this relation KnowingHow. 11A Representing Cognitive States: Desires, Intentions, and Plans There are many terms used in the literature that seem related to desires and intentions, including plans, goals, wants, and purposes. This section attempts to provide definitions for these terms that capture the range of concepts needed to model a conversational agent. Of course, not all researchers will use these terms in the way defined here, but these are some common interpretations. First consider the distinction between desires and intentions. Desires reflect what states of the world the agent finds pleasant or unpleasant. Typically, an agent has many desires, often conflicting with each other. For instance, you might desire to take the day off and go to the beach. On the other hand, you might also desire to do well on your final exam, and to do this, you should stay home and study. These two desires conflict, and your behavior will result from some compromise between the two. In contrast, your intentions generally do not conflict. You can't simultaneously intend to go to the beach today and intend to stay home and study. Intentions are strongly related to behavior, and you can only do one of these things at a time. So while both desires and intentions affect your behavior, they are quite different. Desires seem to arise subconsciously and allow you to evaluate the attractiveness of certain states. Intentions, on the other hand, arise from rational deliberation about what course of action you should follow. Looking again at Figure 17.1, you see that desires and beliefs are considered as input to some deliberation process that eventually commits to a course of actions, captured by your intentions. There are two notions of intention. The first is often called intention in action and refers to the property of acting intentionally. For instance, much importance is given to the distinction between a situation where you run into someone on your bicycle unintentionally and a situation where you run into someone intentionally. For the latter you could be charged with assault, but the former is an unfortunate accident. The second notion of intention is called future-directed intention and reflects decisions you have made about your future action. For example, if after some deliberation you decide to forget about the exam and go to the beach, then you have adopted a future-directed intention to go to the beach. Of course, having a future-directed intention often leads to intentional action. Because you intend to go the beach, you may then ride your bicycle there intentionally (avoiding pedestrians, we hope). But future-directed intentions are not necessarily realized. On your way out to the bike rack, you might meet friends on their way to study and feel so guilty that you change your mind. In this 552 CHAPTER 17 Defining a Conversational Agent 553 case you still had the intention to go to the beach, but you abandoned it and now; have the intention to study. For our purposes, the notion of future-directed intention is the most critical concept. We introduce an operator Intend that relates an agent to an action that the agent intends to perform. There are several requirements on having intentions that capture part of what it means to be a rational agent: Rationality Constraint 1—An agent can't intend to do two actions that it believes are mutually exclusive. Rationality Constraint 2—An agent can't intend an action it believes it can't perform. Expressing these constraints in a formal logic is complex. Rather, we will depend on the processes of planning and plan adoption to procedurally guarantee that our agent doesn't adopt intentions unless it believes it can perform them. Clearly, plans, goals, and planning are intimately related to intentions. But the relationship is more complicated than it might seem. First, note that if you say Jack has a plan for robbing the bank. you may mean one of two very different things: • Jack intends to rob the bank. • Jack has worked out a scheme by which someone could rob the bank. In the first reading, having the plan seems to be a way of describing one of Jack's future-directed intentions, whereas in the second, having the plan seems to say that Jack knows, in principle, how to rob the bank. In this latter sense, a plan specifies a set of actions that could achieve a specific effect, like a recipe. A recipe gives a set of actions (steps in cooking) that will cause some effect (a meal). Having a recipe does not mean that you intend to make the meal. The plans-as-recipes sense seems to correspond to the notion of plans in the AI literature, as described in Chapter 15. When a planning system constructs a plan, there is no commitment by any agent to execute the plan. Rather, the task is simply to find a "recipe" that would achieve the goal statement. Note that." corresponding to the two notions of "having a plan," there are two notions of the term goal. In one sense a goal seems much like an intention. In the other a goal is simply the name for the resulting state of a plan-as-recipe. Given this, pick some readings for the different terms. The term plan will always be used in the plan-as-recipe sense. In other words, a plan is a set of actions with a specified effect, called the plan's goal. A goal can be either a proposition that will be attained by the actions in the plan or an action that will be performed by the actions in the plan. Thus the term plan recognition refers to the . task of constructing a plan-as-recipe that could account for a set of observed actions. 3- Bel'Jackl, Intends (Jack!, RideBike(Jackl)) & Generates (RideBike(JackI \ Intends(Jackl, GoToBeachiJackl)) & GoToBeach (Jack!))) IntendsiJackl, Generates(RideBike(Jackl), * GoToBeach(Jackl))) Knowing a plan to go to the beach Intending to go to the beach Figure 17.4 The two readings of Jack has a plan to go the beach. What about the other sense of having a plan related to intention? To avoid confusion, the term plan will never be used in this sense. Rather, we will talk about the intentional structure of an agent. The intentional structure of an agent is part of its mental state, like its beliefs and desires. The intentional structure arises when the agent commits to performing a plan. Not surprisingly, there will be a strong correspondence between the structure of the plan and the resulting intentional structure. Consider the simplest case. Say that Jack knows a very simple plan to go to the beach, namely riding his bike there. This plan is simply a set of beliefs that assert that if Jack rides his bike in the right direction, he will get to the beach. Later, Jack has a desire to go to the beach. Knowing the plan to get there by bike, he might adopt this plan. He now has the intention to go to the beach by riding his bike. Jack's beliefs about the plan and the intentional structure from adopting the plan are shown in Figure 17.4. Note that the intentions must include both of the actions and the generation relationship between them. If you omitted the action of bike riding, then Jack would intend to go to the beach but not have a way specified to get there. If you omitted the action of going to the beach, then Jack would intend to ride his bike but not necessarily end up at the beach. If you omitted the generation relation. Jack might accomplish each action separately, say by first riding his bike and then walking to the beach. The distinction between knowing a plan and having a set of intentions is sometimes lost in the literature because one natural way to represent the two situations in a knowledge representation uses a very similar representation for each. Figure 17.5 contrasts the representations of Jack knowing about, the plan-as-recipe and Jack having an intention. It uses the graph representation developed in Chapter 15 and places the graph in a space that represents the plan Jack believes will work. A natural way to represent the intentional structure, on the other hand, is to also use a new type of space that acts as an argument to the Intend operator. The confusion arises because, except for the difference in the interpretation of the spaces, the same plan graph is used in both cases. 554 CHAPTER 17 mlm- BSl: Jack's beliefs ValidPlan(PSl) PS1: A plan to accomplish GoToBeach(Jackl) RideBike (Jackl) generates GoToBeach (Jackl) BS1: Jack's beliefs Intend (Jackl, IS1) IS1: Jack's intentional state RideBike(Jackl) generates GoToBeach( Jackl) Jack knows a plan to go to the beach Jack intends to go to the beach Figure 17.5 Knowing of a plan versus intending to perform a plan using spaces To summarize, here's a list of all the concepts discussed in this section and their correlates in our BDI model. belief—the operator Bel and belief spaces. future-directed intention—the operator Intend and intentional state spaces. intention-in-action—performing an action because it is in the intentional space. plan-as-recipe—a plan, typically represented as a plan graph in a plan space. having a plan in the intentional sense—same as future-directed intention. goal (of an agent)—a future-directed intention, goal (of a plan)—an end effect in a plan graph, desires—unanalyzed; realized in the decision-making process that creates goals and evaluates plans, wants—another word for desires. 17.5 Speech Acts and Communicative Acts As previously stated, speech acts are actions that are performed by speaking, and we hope to be able to represent them using the model of action developed in Chapter 15. In order to do this, however, several complications must be 33 - $ a* Sat Defining a Conversational Agent 555 BOX 17.2 Intention and Action Generation In many theories of intention, you can only intend actions. This might call into question whether the intentions expressed in Figures 17.4 and 17.5 are well formed. In particular, they seem to imply intending a generation relation between actions. If this bothefs you, then the intentions can be recast as complex actions. In this case the complex action would be going to the beach by riding a bike. To do this, a suitable action constructor would have to be defined to create such actions. Pollack (1990) introduces an operator By, such that given any two actions a and p, Byiafi) is the complex action of doing a by doing p. Given this, the third intention in Figure 17.4 would be Intend(Jackl, By(GoToBeach(Jackl), RideBike (Jackl))) This way, all intentions are stated in terms of actions. addressed because of the multi-agent nature of communicative acts, and because their effects will be defined in terms of changes of the agents' cognitive states. For planning purposes, it is the intended perlocutionary acts that are the most important, because these capture the intended effects that agents have when they speak. For example, an agent would construct a plan in terms of convincing the other agent that some fact is true or that the other agent should perform some act. To accomplish such an action, the agent would perform an illocutionary act (such as asserting or requesting) and hope for the other agent to do the right thing so that the perlocutionary act is accomplished. And, of course, to accomplish the illocutionary act, the agent would perform a locutionary act that would involve uttering an appropriate sentence. The model developed here involves communicative acts, which correspond to perlocutionary acts that are generated by illocutionary acts. For example, the communicative act ConvinceBylnform involves getting another agent to believe some proposition by saying it is true. It is instructive to consider what is involved in the successful performance of such a communicative act. Consider What must be true for Sam to successfully ConvinceBylnform Helen that it is raining: 1. Sam says the words It's raining with the intention to inform Helen of this fact. 2. Helen hears the sentence It's raining and recognizes Sam's intention to inform her of this fact. Because of this, Helen decides that Sam intends to get her to believe that it is raining. 3. Helen considers the evidence for the fact that it is raining, including the fact that Sam said that it is raining. 4. Helen comes to believe that it is raining. All of these steps are essential. To convince yourself of this, consider removing any of the steps. If you remove step 1, then Sam didn't say anything so certainly 556 CHAPTER 17 Defining a Conversational Agent 557 The Action Class ConvinceBylnform(e): Roles: Speaker, Hearer, Prop Constraints: Agent(Speaker),Agent(Hearer), Proposition Prop), Bel (Speaker, Prop) > a Preconditions: At (Speaker, hoe (Hearer)) I Effects: Bel(Hearer, Prop) The Action Class MotivateByRequest(e): Roles: Speaker, Hearer, Act Constraints: Agent (Speaker), Agent(Hearer), Action(Act) } Preconditions: At (Speaker, hoc (Hearer^) Effects: Intend(Hearer, Act) J Figure 17.6 The definitions of two communicative acts didn't inform Helen that it was raining. You have removed the locutionary act. If___J you remove just the intention condition, then you get the implausible scen.n in ol Sam just randomly uttering sounds (say, Sam is under hypnosis). Even if Helen is i fooled-and performs all the rest of the steps, Sam could later deny thai he J informed her that it was raining. He could claim that he didn't know what he was saying so she couldn't attach any significance to it. If you remove step 2, then Helen didn't hear the sentence. If you remove ^ just the condition that Helen recognize Sam's intention, then Helen heard Sam speak but has no idea why he spoke to her; that is, she has failed to recognize the illocutionary act. Even if she did the rest of steps that are possible in this scenario ; and comes to believe that it is raining, she would deny that Sam informed her of ~ j this fact. If you remove step 3, then Helen, while recognizing that Sam tried to inform her of something, completely ignores the information he told her. This would be impolite to say the least, but assuming it is the case, even if she then -1 comes to believe that it is raining, this belief will have nothing to do with what - -J_ Sam said. y^^ n If you remove step 4, then Sam told Helen that it is raining, but failed to - * convince her. In this case the illocutionary act Inform succeeded, but the commu- i nicative act ConvinceBylnform failed. - „ While you can develop action representations for all three levels of the J speech act, it is the communicative actions that provide the principal connection ii to the agent's general reasoning. Figure 17.6 shows the definitions of two basic communicative acts needed in any model of communication, corresponding to * simple informing and requesting. These definitions build in the assumption that agents are sincere. For instance, a constraint oh ConvinceBylnform is that the , „, speaker believes the proposition asserted. This condition, of course, would need - -f-to be dropped to handle acts like lying, but such complicated acts are beyond what is needed for most conversational agents we might be able to build. The decomposition, which would include the illocutionary act, is not shown for the ". | BOX 17.3 Classifying Speech Acts An interesting question is how many different kinds of speech acts are there? Searle (1979) argues that there are only five classes of illocutionary acts, each class capturing a different general point or purpose. Each act in the class is then distinguished by additional conditions. The classes are as follows: representative class—the speaker commits to the truth of what is expressed, including the acts described by the verbs inform, deny, affirm, and confirm. directive class—the speaker attempts to influence the intentions and behavior of another agent, including the acts described by the verbs request, command, invite, ask, and beg. commissive class—the speaker commits to some future action, including the acts described by the verbs promise and commit. expressive class—the speaker expresses a psychological state or reaction, including acts described by the verbs apologize, congratulate, thank, and welcome. declarative class—the speaker performs some conventional or ritual action by the speech act, including acts described by the verbs christen, fire, resign, and appoint. moment. Issues related to interpreting illocutionary acts will be delayed until Section 17.9. The next section considers a model that allows an agent to plan communicative acts, just like any other acts, to accomplish goals in its world. This sets the stage for modeling the understanding of other agents' communicative acts. 17.6 Planning Communicative Acts Consider a situation in which Jack tries to buy a ticket for a train. Assume that he doesn't know the exact price of the ticket. Also assume that he knows a plan for buying, which involves the following two substeps (from the definition of Buy in Figure 15.4): 10. Give (Jackl, Clerkl, Price(Ticketl)) 11. Give (Clerkl, Jackl, Ticket 1) Jack cannot simply adopt this plan directly as a set of intentions. For one thing, 11 involves action by another agent, which Jack can't directly intend. This step is a crucial part of Jack's plan, so it can't simply be ignored. In fact, there is something Jack must do as part of 11, because giving is a multi-agent action. It takes two agents to give: one to provide and the other to receive. Thus we might replace 11 by the action that is Jack's part in the giving action. In fact, there is a verb for this in English, namely receive, but to make the analysis more general we will not use it. Rather, we define an action operator MyPartOf, which takes an agent and a multi-agent action A and produces an 558 CHAPTER 17 Defining a Conversational Agent 559 action that is the agent's role or part in A. Thus, the action MyPartOjXJackl, Give (Clerkl, Jackl, Ticket 1)) might involve Jack putting out his hand, waiting;: for the ticket, and then grasping the ticket. The action MyPartOj(ClerkI, Give (Clerkl, Jackl, Ticket 1)) would involve putting the ticket into Jack's hand and letting go of it. Note that this function can apply to single agent actions as-well. If the agent specified is the agent of the action, then it is the identity function, for example, MyPartOflJackl, GraspiJackl, Ticketl)) = Grasp(Jackl,..... Ticket 1). On the other hand, if the agent is different, then the function denotes some action of not interfering while the other agent does it. The intentions adopted from this plan might be to do the following actions 12. MyPartOf(Jackl, Give(Jackl, Clerkl, Price(Ticketl))) 13. MyPartOf(Jackl,Give(Clerkl, Jackl, Ticketl)) These actions are at least things that Jack can intend to do, but it probably is not a very useful set of intentions. In particular, it will do no good for Jack to do his part if the clerk doesn't do his or her part. So far, nothing in Jack's intentions addresses the issue of getting the clerk to cooperate in his plan! For this, Jack will have to communicate with the clerk and get the clerk to have the right intentions. Before developing this, however, consider one more problem. Jack may not even be able to execute the first action because he doesn't know the price of the ticket. This reveals a general constraint on action that has been ignored until now. To execute any action, an agent must know the values of the parameters. One way of viewing this constraint is that there is a set of implicit preconditions on every action that the agent must know the value of the parameters. These are sometimes called the knowledge preconditions. In this example, the implicit precondition would be 14. KnowRef (Jackl, XxiPrice (Ticketl) = x)) This then becomes a new goal for Jack. Intuitively, he could achieve this goal in several ways. He could look up the price on a fare schedule, or he could ask someone who knows the price. This latter method involves language and so will-be used for the example. Suppose Jack has reason to believe that a ConvinceBy-Inform act by the clerk will achieve his goal. To get the clerk to do this action. Jack might plan to ask the clerk to perform the Inform act. But what is the act that he is asking the clerk to perform? Because Jack doesn't know the price of the ticket, he can"! represent the action that the clerk must perform. This is the same problem as when we needed to represent the fact that someone knows something, but we didn't know what it was. The solution is to introduce two additional,; Inform acts used for planning that use the KnowRef and Knowlf operators introduced earlier, as shown in Figure 17.7. To summarize the situation so far, there are two problems with Jack's plan; There's no reason to expect that the clerk will perform his or her part of the plan as required, and Jack doesn't have enough information to perform his own part of the plan. Both of these problems can be overcome if Jack can affect the clerk's The Action Class Con\inceBylnformRef\e): Roles: Speaker, Hearer, Predx Constraints: Agent(Speaker),Agent(Hearer), UnuryPred(Pred x), KnowRef (Speaker, Pred^) Preconditions: At(Speaker, Loc(Uearerf) ....... Effects: KnowRef (Hearer, Predx) The Action Class ConvinceByInformIf(e): Roles: Speaker, Hearer, Prop Constraints: Agent (Speaker), Agent (Hearer), Proposition(Prop), Knowlf (Speaker, Prop) Preconditions: At (Speaker, hoc (Hearer)) Effects: Knowlf (Hearer, Prop) Figure 17.7 Variants of ConvinceBylnform for acquiring information intentional structure so that he or she tells him the price of the ticket and participates in the buying action. Specifically, he needs the clerk to do the following: 15. My PartOf (Clerkl, ConvinceBylnfonnRef(Clerkl, Jackl, X x(Price(Ticket!) = x))) 16. MyPartOf(Clerkl, Give(Jackl, Clerkl, Price(Ticketl))) 17. MyPartOf (Clerkl,Give( Clerkl, Jackl, Ticketl)) Jack cannot make these actions happen by willing them, but he can act in a way to attempt to get the clerk to acquire the intentions to perform these acts. Specifically, he may perform MotivateByRequest acts for each of these acts. For the first, he may plan to perform the act 18. MotivateByRequest(Jackl, Clerkl, ConvinceByhiformRefi Clerkl, Jackl, X x(Price (Ticketl) = x))) which, if successful, will achieve the effect 19. lntend(Clerkl, ConvinceBylnformRejXClerkl, Jackl, Xx(Price(Ticketl) = *))) Note that these acts really should all have a MyPartOf operator around them to indicate who is doing what, but we will drop this from now on when it is clear what is intended. The request act, 18, might get realized by a locutionary act such as the question What is the price of a ticket to Toronto? Similarly, Jack might attempt to motivate the clerk to do actions 16 and 17 by issuing two more requests. But this would produce very stilted dialogue, such as the following: What is the price of a ticket to Toronto? Please take this money I am offering you. Please give me a ticket. 560 CHAPTER 17 If this were what was required every time two agents interacted, very little would ever get accomplished. But how does Jack know he can get away with saying m\ less? The answer lies in the belief that the clerk is an intelligent agent as well and will use general knowledge about the world to interpret actions. Furthermore, from the actions the clerk observes, he or she will be able to recognize what Jack is trying to do. This will be explored in more detail in the next section. But let's consider some possible lines of inference that Jack might depend on the clerk to explore in this situation. Assuming they have a shared knowledge about common actions like buying things, Jack could depend on the clerk using this information to interpret his actions. For instance, assume after finding out the price he says OK, please give me a ticket. This is a request for action 17. Jack would depend on the clerk using their shared knowledge about buying things so that the clerk also expects 16 to happen and forms the appropriate intentions. Alternatively, Jack might just say OK and hand over the money. In this case, he is depending on the clerk observing that he is doing his part of the buying plan, and he expects that the clerk will be motivated to perform his or her part. This type of example takes us away from language, however, so will not be pursued further. As a final example, Jack might simply identify the more abstract goal of buying a ticket, depending on the clerk to know the details. He might say OK, I'd like to buy one. This sentence identifies Jack's intention to perform the joint action and depends on the clerk adopting the complementary intentions so that it is successful. In summary, the final set of actions that Jack might intend after all this reasoning might be as follows: 20. MotivateByRequest(Jack)', Clerkl, ConvinceByInformRef{Clerkl, Jackl, Xx(Price (Ticketl) = x))) 21. MyPartOf(Jackl, ConvinceBylnformRefXClerkl, Jackl, Xx(Price(Ticketl) = *))) 22. MotivateByRequest{Jackl, Clerkl, Buy(Jackl, Clerkl, Ticketl)) 23. Give (Jackl, Clerkl, Price(Ticketl)) 24. MyPartOf (Jackl, Give (Clerkl, Jackl, Ticketl)) Intention 20 involves asking for the price of the ticket, 21 involves listening to ; the answer, 22 involves the request to buy a ticket, and 23 and 24 are the steps zM involved in buying the ticket. This set of intentions, together with those adopk d by the clerk, might result in the following dialogue: ......~r -s 25a. Jack: Hi. How much is a ticket to Toronto'.' 25b. Clerk: Twenty five dollars. 25c. Jack: OK. I'd like one. 25d. Clerk: OK. Here. The plan that might have motivated these intentions is shown in Figure 17.8. where Jackl and Clerkl have been shortened to Jl and CI to save space. The details of the decomposition of the Buy action are also not shown. Defining a Conversational Agent 561 (MotivateByRequest(Jl, CI, ConvinceBylnformReffCl, Jl, Xx(Price(TicI) ÍMotiv V Con (ÁíotivateByRequestfJl, CI, Buy(Jl, CI, Ticl))~^ has-effect has-effect IntendfCl, Buy(JI, CI, Tic!)) IntendfCl, ConvinceBylnformReffCl, Jl, Xx(Price(Ticl) =x))'j > motivates i ConvinceBylnformReflCl, J1,X x(Price(Ticl) =x)) C Buy(Jl, CI, Ticl) enables has-effect V has-effect KnowRefljpX x(Price(Ticl) =x)) Owns(Jl, Ticl) Figure 17.8 Jack's final plan for buying the ticket 17.7 Communicative Acts and the Recognition of Intention The example in the last section showed that without an ability to recognize another agent's intentions, multi-agent action would be extremely difficult, if not impossible. In particular, it would be almost impossible to understand and act appropriately in a dialogue. This section considers the recognition-of-intention problem and how it relates to speech act interpretation. Chapter 15 discussed the use of plan recognition to find the causal and intentional connections between eventualities described in sentences. It would seem that the same techniques could be applied here to recognize the intentions of the speaker, For the most part, this will be true, but there are some complications. In this section we will refer to the agent who performs the action as the acting agent, and the agent who observes the action as the observing agent. The issue is how the observing agent can recognize the intentions of the acting agent. First, just because you can identify a plan that includes all the observed actions doesn't mean that this plan describes the agent's intentions. There may be many possible plans that account for the observed actions, and the acting agent is only performing one of these. So finding a plan using some plan recognition 562 CHAPTER 17 algorithm only suggests a possible set of intentions that the acting agent might have. In fact, determining whether this is the correct plan may be theoretically impossible, but we seem to survive by attributing intentions to agents based on < the most likely plan we can recognize. Also, as mentioned in Section 17.3, the plan recognized must be plausible, given the acting agent's beliefs, and might seem wrong or inefficient from the observing agent's point of view. In practice, this means that plan recognition can only be based on shared knowledge, although a few particular facts about the other agent's beliefs can be used if available. Consider the locker example from Section 17.3 again. Jack knows he has a combination lock on his locker, while Sue believes he has a key lock. They share knowledge about the need to unlock lockers to open them, knowledge about the effects of different communicative acts, and other knowledge of the situation. They must both know, for instance, that it is reasonable for Sue to want to get into Jack's locker. Given this, what can Jack recognize from the question Sue: Can I have the key to your locker? Skipping over the details as to how Jack identifies this locutionary act as a request for the key, he can infer a plausible plan including this action as follows. First, the utterance is identified as part of the action MotivateByRequest(Suel, Jackl, Give(Jackl, Sue], KeyTo(Locker1))) Using their shared knowledge about communicative acts. Jack knows that Sue believes that an effect of this act is Intend(Jackl, Give(Jackl, Suel, KeyTo(Locker]))) which in turn, using shared knowledge about intentional behavior, would motivate the action Give (Jackl, Suel, KeyTo (Lockerl)) which, given their shared knowledge about giving actions, would have the effect Have(Suel, KeyTo (Lockerl)) Now the crucial step. Jack believes that Sue thinks he has a key lock on his locker. Thus he knows that Sue believes that having the key would enable the action Open(Suel, Lockerl) Assuming Jack finds this action to be a reasonable intention in tiiis domain, he would have constructed a plausible explanation of why Sue asked him for the key. Of course, the plan constructed will not work, but this doesn't matter for the purposes of recognizing the plan. What matters is that Sue thinks it will work. Assuming this is the only plan that seems plausible, he would assume that it defines Sue's intentional structure. The plan recognized is shown in Figure 17.9. —t Defining a Conversational Agent 563 Bel(Jackl, Bel(Suel, ValidPlan(SPl))) SP1: A plan to achieve Open (Suel, Locker]) (MotivateByRequest(SueI, Jackl, Give (Jackl, Suel, Keyl))) I effect Intend(Jackl, Give(Jackl, Suel, Keyl)) (Give(Jackl, Suel, Keyl) ) I effect Have (Suel, Keyl) enables (Open(SueI, Locker]) ) Figure 17.9 A plausible plan recognized from Can I have the key to your locker? Given this plan, the primary intentions that Jack would attribute to Sue would include Intend(Suel, MotivateByRequest(Suel, Jackl, Give (Jackl, Suel, KeyTo(Lockerl)))) Intend(Suel, Achieve(Have(Suel, KeyTo(Lockerl)))) Intend(Suel, Open(Suel, Lockerl)) In practice, a computational system that used an algorithm such as this would have to be given a list of possible intentions that would serve as its expectations. In the example, Intend(Suel, Open(Suel, Lockerl)) would be one of the expectations. Given an input, the system would then use plan recognition to suggest some plausible plans, and hope that only one of them "matches" an expected intention. Of course, you do not need to recognize the ultimate intention behind every utterance. As long as there are plausible intentions that could motivate the utterance, identifying the exact one is often not necessary. For instance, let us assume that there is no expectation that Sue might intend to open John's locker. This, after all, is a rather specific intention that was convenient to make this example simple. But let's say it is not an expected intention. Rather, the expected intentions are more global things like studying for classes and going to the beach. Now, let's assume it is shared knowledge that Jack's locker contains both the textbooks for a course Jack and Sue are taking and also some beach towels. Now, 564 CHAPTER 17 when Sue asks for the key to the locker, Jack cannot identify Sue's intention-she might be intending to study or to go to the beach. The plan shown in Figure 17.9, however, is a subpart of either one. The plans diverge at the next action: whether Sue takes out the books or the towels. Now Jack might have some interest in exactly what Sue's intention is, but he does not need to identify her intention in order to understand her request. He has identified sufficient motivation for the request based on what's in Figure 17.9 and the knowledge that opening the locker is a reasonable thing to do for several expected intentions Thus, the analysis may be fine as it stands. In addition, the depth to which you need to recognize intentions in order to feel you have understood depends on your own goals in the interaction. In some~ conversations—say, when negotiating a peace plan—it is very important to know exactly what other people intend when they are speaking. In other conversations—say, when someone stops you on the street and asks for the time—you5 have almost no interest in the other person's intentions beyond the obvious one of ~ wanting to know the time. In a computational system operating in a specific domain, the required level of detail will need to be specified based on the range ;: of tasks the system is supposed to accomplish. Once the Intentions are recognized, the next thing to explain is what the observing agent does with the information about the acting agent's intentions;-For example, what will Jack do now that Sue has asked him for the key? This question leads us again to agents' planning and reasoning processes, which are-discussed in the next section. 17.8 The Source of Intentions in Dialogue If you look back at Figure 17.1. you will see that the intentions arise from a — combination of the agent's beliefs and desires. It believes it is in a certain slate, or will be in a certain state in the future, and this state is evaluated with respect to- -the agent's desires. New intentions arise from the decision to do something to change the way things will be if the agent continues on its present course. WeTZ will represent desires as a function on states that evaluates the attractiveness of different states. With this formalization, the decision-making process can be .... ... viewed as a continual striving to attain, or remain in, the more, attractive states.- A particular conversational agent will have its desires specified as part of" 1 the domain. For instance, a simple database question-answering agent might have only one desire, to answer all questions put to it. Thus it prefers states where™ there are no unanswered questions, and when it finds itself in a state where a question is unanswered, it would act to put itself in a more attractive state by answering the question. But even this simple agent must be more complex if you consider it carefully. If all the agent cares about is not having unanswered questions, then why- • not simply not process any input at all? Then it will never know it is in a state -with art unanswered question because there will be no questions. So there needs" Defining a Conversational Agent 565 to be a constraint on the agent's behavior that we might call the attentiveness constraint. When a question is asked, it will attempt to understand it. Now this behavior might not arise from a desire—it might be built into the system as an automatic process that occurs whenever a question arises. This is a quite plausible approach, even for modeling humans. If someone speaks to you, it is almost impossible not to interpret what that person is saying. You may decide to do nothing about whal was said or to ignore the speaker, but it is very hard to not interpret the sentences that you hear. So, moving back to our conversational agent, we will assume that the language-understanding process occurs automatically on any input. Given that it cannot ignore its input, what prevents the system from simply making up answers? If the system is to useful, then there must be some reason why it should give accurate answers. Using the BDI model, this means that the agent should believe what it says. This is often called the sincerity condition, and it plays an important role in many approaches to speech act theory. Thus we would want to make sure that the desirability function evaluates states such that, all other things being equal, states resulting from sincere responses are preferred over states resulting from insincere ones. We will say that a desirability function that has this property has a sincerity preference. Note that sincerity was built into the definition of the ConvinceBylnform act previously defined, so thai if the agent never performs an action unless its constraints are satisfied, it will have to be sincere. With the addition of the attentiveness constraint and a desire to be sincere, there are enough constraints to drive the behavior of a simple question-answering agent. But this agent will still not be considered very helpful. Consider the example of Jack's locker again. Assume the question-answering agent is playing the role of Jack. Say that Sue asks Jack whether he has the key to the locker. Jack inspects his beliefs and, finding that he has no key to (he locker (since one doesn't exist), can sincerely answer no! If you knew the background and overheard this conversation, you would consider Jack's answer misleading and uncooperative. The problem is that by answering no, Jack has implicitly indicated that Ihc question was well-formed, thereby confirming a presupposition in the question that the locker opens with a key. As a result, what Sue believes is shared knowledge is now diverging from what Jack believes to be the case. Although an agent might have some purpose that makes it want to allow such a misconception, typically conversants attempt to remedy misconceptions immediately in the dialogue. Thus we add a shared knowledge preference to the desirability function. An agent prefers states that minimize the difference between what each agent believes is shared knowledge. If the other agent says something that implies something that our agent believes is false, then our agent will prefer states that involve resolving the contradiction to states that do not. Thus, an answer such as No, the locker has a combination lock would be preferred over a simple No. Both responses are sincere, but only the first also obeys the shared knowledge preference. 566 CHAPTER 17 Defining a Conversational Agent 567 \ '9; BOX 17.4 Conversational Maxims One of the most influential proponents of the view that language can be explained by examining general rational behavior and multi-agent cooperation is Grice (1975). He proposes that implicatures in conversation can be analyzed by assuming a set of conversational maxims that all agents will follow unless there is some reason not to. As a result, these maxims sanction assumptions that are crucial for understanding discourse. Grice's four main maxims are as follows: Maxim of Quantity—Make your contribution as informative as is required, but not overly informative. This maxim would prevent withholding key information, such as a violated presupposition, on the one hand, and excessive verbosity, on the other. Maxim of Quality—Do not say things for which you lack evidence. This maxim is reflected in our sincerity preference discussed in Section 17.8. Maxim of Relation—What you say should be relevant to the current topic. This maxim sanctions the coherence assumptions discussed in Chapter 15. : Maxim of Manner—Avoid obscurity of expression and ambiguity. The maxims are not fixed rules, but rather are preferences and default assumptions. A speaker may choose to violate a maxim for any number of reasons: to mislead the hearer, for example, or to communicate something by explicitly violating one, For instance, you might reply to one question by answering a different question, thereby indicating that you wish to avoid answering the original question. i —w The sincerity and shared knowledge preferences apply in almost any conversational situation and so can be taken as quite fundamental properties of dialogue itself. Additional preferences arise in more limited classes of situations. One that is particularly relevant for most applications is the helpfulness preference. This is a preference for responses that allow the other agent to. achieve their intentions more efficiently. Consider a conversational agent whose task is to answer queries about train schedules. It knows that in order for agents to board trains, they have to get to the appropriate locations at the appropriate: times. Knowing this, if a person asks When does the train to Windsor leave.', it might answer by giving both the departure time and its location, saying something like Four o'clock from gate seven. This answer is preferred over a simple-Four o 'clock because it eliminates the need for another act by the person to find out where the train leaves from. Of course, if the conversational agent believes that the person already knows where the train leaves from, then this second answer is not more helpful, it's just more verbose. To satisfy this preference, our agent should consider as much of the other agent's plan as can be reliably recognized and identify what information is needed. Of course, there must be a lirnit to how much is said as a response. Suppose in a certain situation the agent recognizes the other agent's plan in detail, and finds a wealth of information that the agent may require during the course of the plan. Clearly, providing too much information can be as bad as not providing7 enough. This could be encoded as a conciseness preference: If all other things are equal, prefer the response that is more concise. (Don't you wish more people used this constraint?) Of course, it might be that the helpfulness preference and the conciseness preference do not agree on the most preferred state. In this case some mechanism must be developed to adjudicate between conflicting choices. 17.9 Recognizing Illocutionary Acts As discussed earlier, linguistic communication doesn't occur unless the relevant intentions are recognized. Unfortunately, there is no simple mapping between a set of words and a particular intention. For example, you already saw that the sentence Do you know the time? can be a question, a request, or an offer, depending on what the speaker intended. Thus it seems that the connection between the actual language and what is communicated is quite remote. But, if this is the case, how can one agent recognize what another agent intends based on what is said? This section explores this issue. The most common approach to this problem is based on the literal meaning hypothesis. This approach assumes that all sentences have a literal meaning that is based purely on the conventions of language. For example, the sentence Do you know the time? has a literal meaning in which it is a yes/no question asking about whether the hearer knows the time. This will be called the surface speech act and is identified from the syntactic mood of the sentence (that is, it is an interrogative sentence). Given the literal meaning, the intended speech act is then derived by an inference process, in our case using plan recognition techniques. There are two ways to view the surface speech act. In the first, the surface speech act is an illocutionary act but may be faulty in some way. For example, to sincerely ask a question, the speaker must want to know the answer. Consider a situation in which Helen knows that Jack is unaware of the time, and she says Do you know the time? as an offer. The surface speech act is a yes/no question about whether Jack knows the time and is faulty because Helen does not want to know the answer. The realization that the surface speech act is faulty initiates the search for an alternative intended reading. One problem with this approach is that the literal act need not always be faulty for the utterance to have an indirect meaning. For instance, Helen might not know whether Jack knows the time or not. In this case, Helen might say Do you know the time? and intend both the yes/no-question interpretation and the offer interpretation. In the other approach the surface speech act is not an illocutionary act but an act at another level of analysis, and inference must be used to identify the intended speech act. In this case the surface speech act is part of the semantic analysis of the sentence. For instance, Do you know the time? might be an Interrog act with propositional content Know(Jackl, the time). Then, each of the illocutionary acts would specify which surface forms can be used to realize it. Figure 17.10 shows the definitions of the illocutionary act RequestRef with 568 CHAPTER 17 The Action Class MotivateInformRef(e): Roles: Speaker, Hearer, Predx Constraints: Agent (Speaker), Agent(Hearer), UnaryPred(Predx), -iKnowRefi Speaker, Pred^ KnowRef'(Hearer, Pred^ Preconditions: At (Speaker, hoc (Hearer)) Effects: Intend (Hearer, InformRef (Hearer, Speaker, PredJ) Decomposition: RequestRef(Speaker, Hearer, Pred^ DecideTo (Hearer, InformRef (Hearer, Speaker, Predx)) The Illocutionary Act RequestRef (e): Decompl: WhQuestion(Speaker, Hearer, Pred^ Decomp 2: Interrog (Speaker, Hearer, KnowRef (Hearer, Predx)) Decomp 3: ImperiSpeaker, Hearer, InformRef (Hearer, Speaker, Predx)) Defining a Conversational Agent 569 i Figure 17.10 Speech act definitions for wh-requests multiple decompositions, each corresponding to a particular surface speech act form. Also shown is the communicative act MotivatelnformRef, which is a specialized version of the MotivateByRequest act. The three decompositions identify three different surface forms that arc commonly used to convey a RequestRef act. These forms correspond to the sentences What is the time?, Do you know the time?, and Tell me the time; respectively. If we make the reasonable assumption that these action definitions are shared knowledge between the conversing agents, then it is simple to see how; _' an agent may plan an appropriate form of utterance to realize a MotivatelnformRef action. Similarly, a simple decomposition chaining technique could be used to recognize a particular surface form as a particular communicative act. Of course, each surface form may appear in many different illocutionary acts, so there is still an ambiguity problem. For instance, Figure 17.11 shows definitions" of the acts for yes/no questions. The question Do you know the time? now matches decompositions in both InformRef and Informlf. Specifically, the literal meaning produced by the semantic interpreter would be Interrog(HelenI, Jackl, KnowRef(Jackl, X t. NowTime(t))) This will match decomposition 2 of RequestRef and decomposition 1 of Requestlf producing two possible interpretations: RequestRef (Helen!, Jackl, X t. NowTime(t)) Requestlf(Helenl, Jackl, KnowRef'(Jackl, Xt. NowTime(t))) To choose between these, the system can check whether each makes sense in the context. To adopt a particular reading means that the constraints defined for the reading must be presupposed by the utterance. Thus if a constraint in one interpretation is not satisfiable, that reading is unlikely. If we assume that it is shared The Action Class MotivateInformIf(e): Roles: Speaker, Hearer, Prop Constraints: Agent (Speaker), Agent(Hearer), Proposition(Prop) -iKnowIJ'(Speaker, Prop) Kn6wIf(Hearer, Prop) Preconditions: At (Speaker, hoc (Hearer)) Effects: Intend(Hearer, Informlf(Hearer, Speaker, Prop)) Decomposition: Requestlf (Speaker, Hearer, Prop) DecideTo (Hearer, Informlf(Hearer, Speaker, Prop)) The Illocutionary Act Requestlf (ey. Decompl: Interrog (Speaker, Hearer, Prop) Decomp 2: Interrog (Speaker, Hearer, KnowIj{Hearer, Prop)) Decomp 3: Imper(Speaker, Hearer, InformlflHearer, Speaker, Prop)) Figure 17.11 Speech act definitions for yes/no questions knowledge that Helen already knows the time, that is, KnowRef (Helenl, X t . NowTime(t)), then the RequestRef reading would be eliminated and the utterance would be interpreted as a yes/no question (or some other act, such as an offer). If, on the other hand, it is shared knowledge that Helen does not know the time but Jack does, then only the RequestRef interpretation is plausible. If none of this information is shared knowledge, then both interpretations remain plausible. The interpretations that are left after filtering based on the constraints can then be used to initiate further plan recognition to attempt to determine which interpretations best fit into the current situation. Note that even if the literal reading is eliminated and an indirect reading is taken, the original form of the utterance affects what responses are allowed. This suggests that at least the form of the literal meaning must be retained somehow in the final analysis. For instance, consider the following as responses to the utterances Do you know the time? and Tell me the time, both intended as requests for the time: 26a. Yes, it's three o'clock. 26b. No, I'm afraid not. 26c. Three o'clock. 26d. OK. It's three o"clock. 26e. No. I can't. Responses 26a, 26b, and 26c are fine for Do you know the time?, but 26d and 26e seem quite unnatural. On the other hand, in response to the direct request Tell me the time, responses 26a and 26b seem very unnatural, but the other three are fine.. Not all indirect speech acts occur in conventional forms. For instance, the sentence It's cold in here could be uttered as a request that the hearer shut the window, but there is nothing in the structure of the sentence that might suggest such an interpretation. Rather, this sentence can only be interpreted by using the 570 CHAPTER 17 Defining a Conversational Agent 571 BOX 17.5 Recognizing Intentions and Defining Illocutionary Acts Not all communication results from recognition of intention, and not all intentions that are recognized contribute to the definition of what speech act was performed. This makes the definition of illocutionary acts extremely difficult. Here is a series of examples that bring up the problems. Assume that Jack lisps slightly, and thus when he introduces himself to Sue, she might come to believe that he lisps. This fact is communicated to Sue but has nothing to do with what he is trying to say, that is, his illocutionary act. Since Jack didn't intend to convey that he lisps but merely intended to tell Sue his name, you would not say that Jack informed Sue that he lisps. To perform an illocutionary act, the speaker must intend to perform the act. But this is not enough: Say that Jack doesn't really lisp at all but imitates a lisp when he introduces himself to Sue. In this case Jack does intend Sue to believe that he lisps, but you still wouldn't say that Jack informed Sue that he lisps, because Sue doesn't recognize his intention to get her to believe that he lisps. But even if Sue did recognize Jack's intention, it still isn't an Inform act. Say that a friend told Sue that Jack often plays tricks like this on people he meets. Then the situation is as before, but now Sue recognizes that Jack has an intention to make her believe that he lisps. Despite this. Jack still didn't inform Sue that he lisps. What's missing is that Jack didn't intend Sue to recognize his intention to get her to believe that he lisps. So, to perform an illocutionary act, the speaker must intend for the intention to be recognized. You might think this solves the problem, but there are further counterexamples in which all these conditions are satisfied yet which wouldn't be considered Inform acts. What seems to be required is shared knowledge between the agents of what intentions the speaker intended to be recognized. A computational model for recognizing speech acts using such criterion can be found in Perrault and Allen (1980) and Allen (1983). general knowledge that being cold may be undesirable, and the specific knowledge that the window is currently open and causes it to be cold. Such interpretation can only be recognized by some form of reasoning about an agent's beliefs and intentions. Such a system would need to recognize that the utterance was said with the intention of getting the hearer to close the window (and the intention that this intention be recognized). Due to this intention, the utteranceis interpreted as a request. See Box 17.5 for more details on this problem. 17.10 Discourse-Level Planning The conversational agent model described so far attempts to account for the intentions underlying single utterances but has not addressed any larger-scale coherence of discourse. This section explores plan-based models at the discourse level, in which the relationships between utterances can be used to plan utterances and recognize the structure of utterances the other agent makes. The Communicative Action DefineClass (e): Roles: Speaker, Hearer, Class Decomposition: IdentifySuperClass(Speaker, Hearer, Class) IdentifyProperties( Speaker, Hearer, Class) GiveExample( Speaker, Hearer, Class) The Communicative Action Identify Superclass (Speaker, Hearer, Class): Roles: Speaker, Hearer, Class, Superclass Constraints: Agent(Speaker), Agent (Hearer), SubType (Class, Superclass) Decomposition: ConvinceBylnform (Speaker, Hearer, SubType (Class, Superclass)) Figure 17.12 Some discourse-level actions used to define a class of objects Many theories of discourse involve the use of a fixed set of relationships jj 1 between utterances. Different researchers call these relations different names, ;\ I such as discourse relations, rhetorical relations, conversational games, and dis- i lp course moves, but they all view these relations as defining the structure of well- f formed discourse. As a result, identifying these relationships is essential to |* J understanding utterances in discourse. There is some debate about whether such j, | explicit relations are necessary. For instance, consider whether an explicit f question-answer relation is necessary in a theory of discourse. Given the model i developed in this chapter, if a Motivatelnformlf action is performed successfully, j the receiving agent will then have an intention to perform a ConvinceBylnformlf I action. Thus an answer will typically follow a question because of the prefer- 1 ences that define each agent's behavior. The observation that discourse is often 1 organized as question-answer pairs can be explained without requiring an explicit J question-answer relationship to be recognized. Hence the debate. One group r | argues for the existence of a set of discourse relations that structure discourse, I- while the other argues that such relations are epiphenomenal and arise as a I consequence of a more basic model of multi-agent behavior. No matter who ends up winning this argument, it is true that for any given application, having some set of discourse relations appropriate for the domain is of great benefit. This is especially true on the language-generation side, where the system may need to convey large amounts of information using multiple utterances. In such cases, the organization of the information into utterance-size units is often viewed as a planning process using a set of discourse relations. For r example, consider an application where the system provides an interface to an t encyclopedia database. One of the tasks the system might have to perform is to i: ■ define different classes of objects. Typically, a definition might take several j utterances. By defining a set of discourse-level acts that involve different ways to '-. define classes, this can be made into a planning problem. For instance, an action j ; to define a class might be defined as shown in Figure 17.12. \i Each of the substeps of.DefineClass would be defined in the system in ii terms of other actions or speech acts, just as IdenttfySuperClass is in the figure. ' 572 CHAPTER 17 Defining a Conversational Agent 573 The discourse script TakeOrder(ey. Roles: System, User, Name, Address, CreditCardlnfo, Items Decomposition: Greetings(System. User) GetAddress (System, User, Address) GetCreditlnfolSystem, User, CreditCardlnfo) GetOrder(System, User, Items) VerifyDeliveryStatus(System, User, Items) . Closings (System, User) The discourse script GetCreditlnfo (e): Roles: System, User, CreditCardlnfo Decomposition: MotivatelnformRef{System, User, Xx CreditCardlnfo =x) ConvinceBylnformRef(User. System, X x CreditCardlnfo = x) VerifyNumber(System, User, CreditCardlnfo) Figure 17.13 Parts of the discourse-level scripts for an automated ordering application The system would then use these definitions to help plan a series of utterances that define the object. Given an appropriate database, the system might plan the following speech acts given a query What is a dog?: ConvinceByInform(Speaker, Hearer, SubType(Dog, Animal)) ConvinceBylnform(Speaker, Hearer, Property(Dog, CommonPet)) ConvinceBylnform(Speaker, Hearer, ExampleOfiFidol, Dog)) which might produce the output Dogs are animals. They are common pets. For example, Fido is a dog. To be useful, even in limited applications, there would need to be many different strategies for defining classes and constraints on each class that would help determine which strategy is best to use for any particular task. For instance, some classes of objects might better be defined in terms of contrast and comparison with another class that the user already knows about. In other applications, large discourse-level scripts might be used to determine the structure of dialogue. For instance, consider an automated ordering system that engages in a dialogue with a user to order items from a catalog. In this case there are several stages of the dialogue, involving obtaining the address and credit card number of the user, obtaining the order, and then giving the user information on delivery. Such a system might be driven by a top-level discourse script* as shown in Figure 17.13. Of course, there may be many different actions defined to handle a wide range of circumstances. For instance, the VerifyNumber action would involve at least two scripts. One, when the number is valid, would simply involve the system acknowledging the information. The other, when the number is not valid, would involve further dialogue to reconfirm the number and possibly to obtain an =33.. ist 4 ..j 2i alternative number. The use of the scripts allows the user to rapidly define and modify the behavior of the agent. The penalty paid is in generality. The system can only continue to behave appropriately as long as the user is in accordance with the script. For instance, if the user tries to order the items first and give the card afterwards, the system will not be able to understand the interaction (unless, of course, there is another script to cover this case). In other words, the system must control the development of the dialogue, and there is no possibility for mixed-initiative interaction where the user can control the flow of topics. But in some applications,Tike catalog ordering, such limitations are reasonable. In more general dialogue applications, discourse-level plans become important for another reason. If the user is allowed to control the flow of the dialogue, then the system must be able to recognize what the user is doing. For instance, it must be able to recognize when the user is changing topics as opposed to continuing to talk about the current topic. And it must be able to identify when the user requires additional detail on a particular aspect of the previous dialogue. Such a system must be able to perform plan recognition at the discourse level to identify such behavior, and to relate this level of analysis to the recognition that is performed at the domain level. Summary To model a participant in a dialogue, you need to define a conversational agent. To do this requires defining the notions of belief, desire, and intention and relating these notions to planning models. Communication is a multi-agent activity and requires each agent to recognize the intentions of the other, especially those intentions relating to the communication process itself. A speech act can be defined as a form of communicative act. To identify the intended speech act, a system must take into account the structure of the utterance as well as the immediate context, especially what is shared knowledge about what the speaker and hearer know and want. Speech act models can be generalized to define a level of discourse plan that is useful for conUolling dialogue systems in a wide range of applications. Related Work and Further Readings BDI models of agents have been studied primarily in die context of planning and acting systems (for example, Bratman, Israel, and Pollack (1988)). This work arose from a desire to clarify the nature of intentions, which had previously been ignored in planning work. Bratman (1987) argues that intentions are a primitive construct not reducible to beliefs and desires and that future-directed intentions are of primary importance in explaining rational behavior. There is a considerable literature formalizing operators such as Believe and Intend. Most formal models of belief in the computational literature are derived from Hintikka (1969), who defined belief within a possible world semantics 574 CHAPTER 17 (Kripke, 1963). A good example of a computational model of knowledge is that developed by Moore (1977). Two good examples of work formalizing intention and belief are Cohen and Levesque (1990a) and Rao and Georgeff (1991). The database belief model described uses the techniques suggested by Moore (1973) and developed in detail by Cohen (1978). The resulting system was further refined and used in a system for recognizing speech acts by Allen and Perrault (1980). Konolige (1986) presents a formalization that generalizes the development here. The distinction between explicit beliefs and implicit beliefs is made by Levesque (1984). He develops a logic of explicit beliefs (the propositions you actually believe) based on a weakened set of inference rules, where the implicit beliefs (the propositions you could consistently believe) would be the deductive closure of the beliefs given their normal first-order semantics. Austin (1962) introduced the concept of speech acts. He considered only explicit performatives for the most part. Searle (1975) used a theory of communication developed by Grice (1957) to develop a speech act theory for everyday acts such as informs, requests, and promises. Grice argued that the essential requirement for communication to occur is the speaker's intention to communicate and the hearer's recognition of that intention. Grice (1975) also introduced a set of guiding principles that underlie all communication (see Box 17.4). The necessity for mutual beliefs or shared knowledge is discussed in detail by Schiffer (1972) and Strawson (1964) in philosophy and is used for problems of reference by Perrault and Cohen (1981) and Clark and Marshall (1981). A plan-based model of speech acts was suggested by Bruce (1975) and developed in a series of papers, starting with one by Cohen and Perrault (1979). This paper lays out the general principles of the approach and shows how speech acts can be planned in order to achieve goals using standard planning techniques. Cohen and Perrault did not, however, consider details of generating actual text from speech acts. Appelt (1985) extended this work to generate actual text. He developed techniques for eliminating unnecessary sentences by combining the information in multiple inform acts into a single sentence, and modeled the generation of referring phrases as a planning process. Pollack (1990) emphasizes the difference between recognizing plans and recognizing intentions. She defines a model of intention recognition based on the inference of mental states, although the model of action is limited to the generation relation. Grosz and Sidner (1990) extend this work and use a notion of shared plans to account for discourse intentions. Perrault and Allen (1980) developed a computational model of indirect speech acts using plan recognition from a literal meaning, drawing on the work; by Searle (1975). They also showed how the same model could be used for planning helpful responses. Cohen and Levesque (1990b) develop a model of rational behavior in which indirect speech acts are not explicitly represented, but their properties fall out of the general model. Perrault (1990) describes a model of speech act interpretation that uses default logic. All practical systems, however, use explicit representations of indirect acts to enable effective recognition Defining a Conversational Agent 575 4 techniques (for example, Sidner (1985), Litman and Allen (1987; 1990), Carberry (1990), and Traum and Hinkelman (1992)). These systems encode common forms of indirect acts as part of the decomposition of discourse acts and recognize indirect interpretations using decomposition chaining. Litman and Allen (1987; 1990) also introduce a multi-level model of plans that uses discourse-level and domain-level plans simultaneously to interpret utterances. Exercises for Chapter 17 1. (easy) Consider a situation where Jack is driving his car in one direction while Sue is driving her car in the other. Jack signals a left turn by sticking his hand out the window, and Sue understands the action. Analyze this communicative situation as done in Section 17.2. What is the locutionary act, the illocutionary act, and the perlocutionary act? What are the communicative conventions that allow this act to succeed? 2. (easy) For each of the following sentences, state whether it describes a desire, intention, plan-as-recipe, or is ambiguous. Justify your answer. Can you come up with linguistic tests that distinguish these three readings? a Sue wants to pass with honors this year, but she also wants to leave town all through March to ski. b. Sue knows how to get good grades on her exams. c. Sue intends to study all night for her exam, d Jack plans to go to the concert tonight. e Jack has a plan to get into the concert tonight, f. Jack knows a plan to get into the concert tonight. 3. (easy) Many multi-agent action verbs allow a reading in which a sentence that normally would describe die entire action can be used to describe one agent's part. Thus you can say Jack gave Sue the money, but she wouldn't take it. which might be paraphrased as Jack tried to give Sue the money, but she wouldn't take it. In contrast, sentences describing single-agent actions are hard to interpret this way: *Jack ate the pizza, but it was too hot. The only readings of this would involve Jack eating the pizza (and possibly burning his mouth). It is difficult to read it as Jack tried to eat the pizza, but it was too hot. Use this argument to show that communicative acts described by verbs such as inform and warn are multi-agent actions. What does this test say about perlocutionary verbs such as convince and scare ? 576 CHAPTER 17 (medium) Draw out the representation of the following formulas using belief spaces: a Bel(Jackl, Happy (Suel)) b. BeKJackl, Owns(Jackl, Fidol)) c. BeKJackl, Bel(Suel, -Happy(x))) h. 3 a-. Bel(Jackl, BeKSuel, Age(Suel,x))) i. BeKJackl, Bel(Suel, 3 x.Age(Suel,x))) j. Bel(Jackl, 3 x. Bel(Suel, Owns(Jackl, x))) (medium) Consider the speech act Promise. Can this communicative act be defined in terms of the modalities presented in this chapter, namely beliefs, desires, and intentions? If not, what additional modalities are needed? Give a definition of this action in the style used in Section 17.5. Why would a conversational agent need to plan promises in order to achieve its goals? Describe a situation in which a plan using a promise would be useful. (medium) Does the definition of the ConvinceBylnform&cX allow for a case in which an agent plans to he? In particular, consider whether the constraint that the speaker believes the proposition should be removed from the definition of the communicative act. What would the hearer believe about the speaker's beliefs if the condition is kept? What would the hearer believe about the speaker's beliefs if it were removed? Which is a better analysis of the act of lying? Using the version you prefer, draw out a plan in which Jack lies about his age. In addition to the plan, show all of Jack's beliefs that are relevant, and show the beliefs that Sue would have if the lie were successful. (medium) Complete the definitions of the actions needed to refine the TakeOrder action in Figure 17.13 down to the level of speech acts. Make up a dialogue that might occur over the telephone for ordering an item, and show how the dialogue follows your dialogue script. Do you need to extend the language for defining action decompositions in order to handle a wide range of dialogues? (hard) Write a program for adding and testing simple propositional beliefs (allowing cxistentials), based on the algorithm descriptions in Section 17.3. Trace its operation in adding and testing the formulas corresponding to the facts in Exercise 4. i S — "f- APPENDIX An Introduction to Lo and Model-Theoretic Semantics A.l Logic and Natural Language A.2 Model-Theoretic Semantics A.3 A Semantics for FOPC: Set-Theoretic Models An Introduction to Logic and Model-Theoretic Semantics 579 Logic has a long history, beginning before Aristotle. Modern logic, however, takes its form from the work of Frege in the late 19th century. The set-theoretic semantics for logic was developed by Tarski (1944). It is important to keep in mind here that there are many possible different notations for a logic, but the principles remain the same. In particular, many of the notations used in AI are expressively equivalent to the first-order predicate calculus (FOPC), or a subset of it. Section A.l introduces the first-order predicate calculus, and Sections A.2 and A.3 develop a semantics for FOPC that defines the notion of truth. A.1 Logic and Natural Language Logic was developed as a formal notation to capture the essential properties of natural language and reasoning. As a consequence, many of the structural properties have a parallel in natural language. For instance, an important distinction is made between expressions that identify objects and expressions that assert properties of objects and identify relationships between objects. Noun phrases perform the first task in language, and sentences and clauses perform the second. In FOPC the same distinction is made between terms, which denote individuals or objects in the world, and propositions, which make claims about objects in the world. There are two major classes of terms: constants and functions. The constants, which correspond most closely to proper names in language, will be written in italics, usually with one or more digits following, e.g., John], John!, CI, and so on. Constants do not have any inherent meaning by themselves. Rather, their properties arise solely from the formulas in which they are used. There is an important difference between names in natural language and constants in FOPC: While natural language is ambiguous, each symbol in FOPC has a single meaning. The name John can be used in natural language to refer to many different people. This cannot happen in logic, which would require a unique constant to represent each different person. Functions in FOPC correspond to noun phrases that refer to an object in terms of some property or relationship to other objects, such as John's father ox the king of Prussia. They are written in FOPC as an expression consisting of a function name, which is a new sort of constant, followed by a list of arguments (other terms) enclosed in parentheses. For example, the function corresponding to John's father mightbefather(Johnl). Simple propositions in FOPC consist of a predicate name, which is another sort of constant, and a list of arguments, which are terms, enclosed in parentheses. These correspond to simple sentences in natural language, where the predicate names correspond to the verbs. For example, the sentence John likes his father might correspond to the formula Likes(Johnl, father(Johnl)). More complex propositions arc built out of simple ones using a set of logical operators, which have correlates in natural language. The simplest logical operator is the negation operator, written as -i, which corresponds to one common use of 580 APPENDIX A An Introduction to Logic and Model-Theoretic Semantics 581 negation in natural language. The sentence John doesn't like his father would correspond to the following formula: =r^-j| - Likesiy, John!) (that is, for all objects y, if y is a man then y likes John). ..| Since logical operators can be nested within each other, a potential ambi- ~~^~~~* guity arises in formulas depending on what operator is within the scope of what - -' r other operator. For example, the formula ~ ~ J could be read as the conjunction of two formulas (->P and Q), or as the negation " of the formula P&Q. To eliminate this possible ambiguity, a convention is used _-:5zS-where the negation operator always takes the smallest possible scope. If a w u'i i scope is desired, then parentheses can be introduced. Thus the preceding formula , ^ takes the first interpretation described, whereas the second would be written as '; -iP&Q) One other convenient notation was introduced earlier with the logical----- quantifiers. A dot (.) in a formula is used to indicate that the scope of the Phrase FOPC Equivalent John Johnl John's father, the father of John father(Johnl) John likes his father Likes (Johnl, father (Johnl)) John doesn't like his father -iLikes (Johnl, father(Johnl)) Either Sue is happy, or she is rich (or both) Happy(Suel) v Rich(Suel) Either Sue is rich, or she is poor Rich(Suel) © Poor(Suel) Sue is happy and rich Happy (Sue!) & Rich(SueJ) If Sue is rich, then she is happy Rich(Suel) 3 Happy (Suel) John is rich if and only if Sue is Rich(Johnl) = Rich(Suel) There is a fish in the pond 3f. FishiJ) & /«(/,' Pond44) All fish are in the pond V/. Fish(J) ZDln(f, Pond44) Figure A.l The fopc correlates of some natural language phrases preceding operator extends to the end of the entire formula. This is a convenient way to eliminate deep nesting of parentheses. Consider the following formula, which corresponds to a reading of the sentence Every boy likes some girl: v b. BOY(b) d3j. GIRL(g) & Likes(b, g) Without the dot convention this would be written as follows: V b (BOY(b) zd 3 g (GIRL(g) & Likes(b, g))) The notation is summarized in Figure A.l. Inference: Logic as a Model of Thought A primary motivation for developing FOPC is that it allows a precise formulation of the notions of inference and valid argument. For instance, you will recognize the following as an acceptable line of reasoning: All dogs love bones and Fido is a dog. Thus Fido must love bones. On the other hand, you will probably find the following argument faulty: Some dogs have fleas and Fido is a dog. Thus Fido must have fleas. Acceptable lines of reasoning can be stated in an abstract form called an inference rule. For example, the first argument just cited could be stated in FOPC as the following rule, where D is a predicate true only of dogs, and LB is a predicate true only of objects that like bones: Vx. D(x)=>LB(x) D(FIDOl) LB (FIDO 1) 582 APPENDIX A Implication-Elimination Rule (or Modus Ponens) Universal-Elimination Rule P ^1 P— V x. px 9 pa Figure A.2 Two inference rules Step Formula Justification 1. V x. D(x) z> LB(x) 2. L\FID01) 3. DiFIDOl)^ LB (FIDOl) 4. LB (FIDO!) premise premise univ-elim from step 1 (x replaced by FIDO I) implication-elim from steps 2 and 3 Figure A.3 A proof lhat Fido likes bones This is read as follows: If the formula V x . D(x) 3 LB(x) and the formula D{FID01) are true, then the formula LB(FIDO!) also must be true. An alternative way to read this, stated in computational terms, is as follows: To prove LB (FIDOl), prove D (FIDOl) and V x. D(x) 3 LB(x). In practice, inference rules are not specific to a particular predicate name; they are expressed using schemas in which parameters of the form p, q, r, s stand for arbitrary propositions, a, b, c stand for arbitrary terms, and x, y, z stand for arbitrary logical variables. If an index is added, as in px, then the parameter stands for an arbitrary proposition involving a term x. Two inference rules of FOPC that account for the earlier argument are shown in Figure A.2. Multiple inference rules may be combined to derive conclusions by constructing a proof. A proof consists of a set of premises and a sequence of formulas, each derived from the previous formulas and the premises using some inference rule. For example, the simple argument about Fido loving bones can now be formally justified by the proof in Figure A.3. In general, two inference rules are defined for every logical operator in the logic—the elimination rule and the introduction rule. Elimination rules take a formula that contains the connective as a premise and have a conclusion that does not contain the operator, while introduction rules lake one or more premises that do not contain the operator and have a conclusion that contains the operator." While many of the rules can be expressed in the notation developed so far,- some, require an extension. These typically involve making some assumption and then showing that some other formula can then be proved or that a contradiction can be proved. A contradiction arises in aproof if there is~a formula P such thattrre~~ proof establishes both P and ->P. For example, the rule of nog-introduction An Introduction to Logic and Model-Theoretic Semantics 583 Step Formula Justification 1. -iLB(JACKl) premise 2 Vx.D(x)=>LB(.x) premise 2.1. L\JACK1) , assumption for subproof 2.2. LXJACK1) o,LB(JACKI) univ-elim from step 2 2,3. LB(JACK1) modus ponens from steps 2. /, 2.2 2.4. ^LB(JACK1) reiterating step 1, revealing a contradiction 3. ^D(JACK1) neg-introduction, steps 2.1, 2.3, 2.4 Figure A.4 A proof deriving a negation via a contradiction intuitively is as follows: To prove -ip, assume p is true and show that this results in a contradiction. Suppose you are given that Jack does not love bones, and you want to prove that Jack is not a dog, given that all dogs love bones. You assume that Jack is a dog and show that this allows you to conclude that Jack loves bones, which contradicts your original premises. This proof is shown in Figure A.4, where the subproof based on the assumption is indented to separate it from the main proof. In step 2.1, the assumption is made, and a contradiction is derived in steps 2.3 and 2.4. Step 2.4 is simply copied from step 1 using a simple inference rule called reiteration, which allows a formula previously derived to be copied later in the proof. While not formally necessary, it allows for more intuitively clear proofs. In this case it allows us to point out the contradiction by placing steps 2.3 and 2.4 together. A similar inference rule based on making assumptions is needed to formu -late implication introduction, which is stated informally as follows: Assume p, and then, if you can derive q in the subproof, then you can conclude p 3 q. Another is needed for the rule of universal, introduction. Some additional inference rules are shown in Figure A. 5. Given a set of inference rules, there will be formulas that can be proven starting with no premises. These are called tautologies or, more simply, theorems and reflect inherent properties of the logic itself. One fundamental theorem, given the system of inference rules, is the law of the excluded middle, which states that every proposition is either true or false (it can't be ''undetermined"). This law stales lhat any formula of the form ______................ .... P v-p must be true. It is very important to realize that this law docs not say that a particular reasoning system can prove p or prove ->/;; all it says is that p is either true or false in actual fact. Two important theorems are called De Morgan's Laws: 1. -{p & q) = -p v -tq 2. ipv q) = -ip&-*q 584 APPENDIX A An Introduction to Logic anil Model-Theoretic Semantics 585 And-Introduction Rule P And-FJimination Rules p &q p &.q p&q Or-Introduction Rule P P 1 3-Introduction Rule Pa pvq Neg-Elimination Rule 3x. Px Neg-lntroduction Rule assume p derive q and -iq -p Figure A.5 Some additional inference rules . Using theorem 1, you can see that the law of the excluded middle, p v ->p, is equivalent to -<(-ip & p); that is, any proposition p cannot simultaneously be true and false. The syntactic description of legal formulas and the set of inference rules defines a rich framework in which many things can be proven. But this is not the most important point of logic for die present purposes. For these applications, the most important aspect of logic is that you can independently specify a notion of truth and assign a "meaning" to arbitrary formulas. This is the role of logical semantics, and it is the topic of the rest of this appendix. A.2 Model-Theoretic Semantics The aim of a formal semantics for logic is to be able to answer the following sorts of questions. Is there a definition of truth independent of what is provable from a set of axioms? How can you tell if a system contains enough inference rules? How can you tell if the set of inference rules proposed is reasonable—that they can't be used to "prove" things thai arc false? How can you tell if a set of axioms is coherent and could describe an actual situation? These questions can be answered by formalizing the notion of what an actual situation could be and showing how formulas in the logic map to assertions about that situation. Such situations will be called models of the logic, and the approach to semantics described here is called model theory. First consider a subset of FOPC consisting solely of propositions and the logical connectives. In other words, there are no terms, no variables, and no quantifiers. This subset is usually called the propositional calculus. A model for the propositional calculus will map every propositional formula to the values T (for true) and F (for false). Such a model is constructed by specifying the ~4 =3" mapping for all atomic propositions and then deriving the values for compound formulas using definitions of the logical operators. In other words, the logical operators are mapped to functions from truth values to a single truth value. As such, this method embodies the assumption that the meaning of the logical operators can be defined solely in terms of the truth values of their arguments and is independent of the actual propositions themselves. For example, suppose the propositions P and Q are true (both are mapped to T). Then if P & R is true for some proposition R, Q & R must also be true. The operator & is not affected by whether P or Q is its first argument, since both have identical truth values. To define a model, you define a function V, called the valuation function, that maps formulas to the values T or F. For instance, given a logic containing one proposition P, there are only two possible distinct valuation functions, Vi and V2, such that Vj (P) = T, and V2CP) = F. Given a logic with two propositions, P and Q, there are four possible valuation functions, and in general, given n propositions, there are 211 possible valuation functions. Each valuation function defines a possible model of the logic. Every valuation function has the following properties that recursively define the meaning of the logical operators: V.l. Vm(p & q) = T if Vm(p) = T and V m(q) = T, and is F otherwise. V.2. Vm(pp) = T if \m(p) = F, and is F otherwise. V .3. Vm(p v q) = T if V m(p) = T or if V m(<7) = T, and is F otherwise. V.4. Vm(p z> q) = T if V m{p) = F or if V m(q) = T, and i s F otherwise. V.5. Vm(p = q) = T if Vm(p) = V m(q), and is F otherwise. The definitions V.l through V.3 should agree readily with your initial intuitions when the operators were first defined using their natural language analogs. For instance, a conjunction p & q is true only if both p and q are true, which is exactly what rule V.l says. The definition for implication in V.4, however, requires some explanation. Consider the sentence If John was in the room, then he saw the murder. If in fact John was not in the room, what is the truth value of this sentence? Stated more generally, what is the truth value of the formulap ZDq when p is false? While there are several options available, each one defining a possible operator, the interpretation in common use, and the one assumed in constructing the inference rules for implication, is that p id q is true in such cases. This is captured in rule V.4. Given a model (that is, a valuation function), you can determine the truth values for any formulas in the propositional calculus with respect to that model. Consider the case of a logic with two propositions, P and Q. Exactly four models are possible, each capturing one of the possible ways to assign T or F to each of the propositions. These models are summarized in tabular form in Figure A.6, in which each row of truth Values represents a single model. The truth values for simple formulas built from a single application of a logical connective are shown as well. All of these values are derived using rules V.l through V.5. Inspecting the tabic, you can see a new way to demonstrate logical equivalence between 586 APPENDIX A Model P e -./» P&Q P^Q -JVC 1. T T F T T T T T 2 T F ■F F T F F F 3. F T T F T T F T 4. F F T F F T T T Figure A.6 The possible models for P and Q formulas. Rather than constructing a proof that the formula P 3 Q is equivalent to the formula -i/1 v Q, you can show that they have the same truth value in all possible models. Consider determining the truth value for these two formulas with respect to model 2 in Figure A.6. Using rule V.4, you see that the value for P r> Q is F, Using rule V.2, you see that the value of ->P is F, and via V.3 you see that the value of ->P v Q is F. Thus the two formulas have the same truth value in model 2. Similarly, they have the same truth values in all the other models as well. No matter what the valuation function, the two formulas have the same true value. Thus the formulas are logically equivalent. So you now have two independent definitions of the logical operators: one using inference rules and the other using valuation functions. Can you show that both methods agree on all formulas? If so, you have gone a long way toward answering the questions posed at the beginning of this section. The Relationship Between Logics and Semantics Start by examining whether the inference rules arc sound—diat is. they can never be used to prove something unless it is true in all possible models. For example, consider the rule of modus ponens (implication elimination): /'' > Q and P are true only in model 1 in Figure A.6, in which Q is also assigned T as well. The only model in which the premises arc true also makes the conclusion true. Thus the rule is .sound, at least in logics with only two propositions. It turns out, however, that additional propositions will not affect this argument one way or another, so the rule is sound given any number of propositions. An Introduction to Logic and Model-Theoretic Semantics 587 Model P Q R -(Q&R),P,^Q} is inconsistent because no model exists that assigns each of them to T simultaneously. The complete truth table for this example is shown in Figure A.7. As you can see, no model simultaneously assigns T to all three formulas. We can now define the notion of a tautology: A formula f is a tautology if every model assigns f to T. As a simple example, consider the formula P v ->/J. Since it involves only one proposition, you need only consider two possible models: one in which P is assigned T and one in which P is assigned F. In cither model the formula P v -iP is assigned T: if P is assigned T, then by rule V.3, P v ->P is assigned T; if, on the other hand, P is assigned F then ->P is assigned T and hence by rule V.3, P v ->P is assigned T. Thus P v ->P is a tautology. There arc two useful relationships that can be defined between formulas. The notion of entailment is defined as follows: s entails p (written as ,v (=//) if and only if p is assigned T in all models that satisfy .v The notion of provability is defined as follows: p can be proven from s (written as s \- p) if and only if there is a proof of p with s as its only premise 588 APPENDIX A The final question to consider is whether you can show that a given set ol inference rules is sufficient to allow all true formulas to be proven. In other words, tl, by using the model theory, you can show that some formula p is true in all models, then you want to show that p is provable using the inference rules defined lor the logic. Using the previously defined notation of entailment, this would mean that whenever s entails p (that is. s |=p), then p can be proven from > (that is, s h />)•' Such a proof system is said to be complete. Completeness can be shown for a wide variety of different sets of inference rules. Of most importance to computational approaches, the resolution method specifies a single inferencs rule that can be proven to be complete. That is, every true formula can be proven using repeated applications of the resolution rule. A.3 A Semantics for FOPC: Set-Theoretic Models A semantics for the first-order predicate calculus requires an extension of the-notion of a model and is most conveniently expressed in terms of set theory. The complications arise because the meanings of other expressions besides propositions must be defined. The terms in FOPC do not represent truth values but rather physical objects, events, times, locations, and so on. All these objects are considered to be members of a set of objects called the domain. In general, every model can have a different domain. To simplify the presentation here, however, we will assume that all models have the same domain, written as Z. In the prepositional calculus, the valuation function mapped propositional symbols to T or F. In FOPC. propositions themselves have a structure, and the valuation function must define the interpretation of constants, functions, and predicate names. In particular, the valuation function maps each term into some' element in the domain. Consider a domain X consisting of the three elements a, a. and 8, and a model in which the valuation function Vj is defined as follows for three terms AI, Bl, and Cl: V[(Al)=a; V !(#/)= a; andV|(C7)=S ......... The valuation function also maps predicate names to different types of sets depending on the number of arguments it takes. For the moment, consider only unary predicates, such as RED, which map to a subset of X (that is, those objects -specified to be red). For instance, let us assume that the model defines the predicate RED as the set ) a. aj: Vl(RED)={a.a\ The truth value of simple propositions built from a unary predicate name and an. argument can now be defined by the following rule (where P is any unary predi/ii cate name and a is any term): Vm{P( a)) = T if Vm(a) is a member of Vm(P), and F otherwise An Introduction to Logic and Model-Theoretic Semantics 589 X i Jt t l - p -Sit sir* Model AJ Bl Cl RED RED(A1) RED{B1) RED(C1) 1. a G 5 {a} T F F 2. a a cr {a) T T F 3. a a 5 {<*) T F F 4. a a a {a, 6) T T T 5. a a o {a, 8} T T F 6. a a a {«,8} T T T 7. a a a {8,0} F F F 8. a a a {8,o) F F F 9. a a 5 {5, aj F F F Figure A.8 Some possible models assuming the domain 2 = {a, 8, a} Given the definition of V i and the rule defining the truth value of propositions constructed from unary predicate names, it is easy to see that Vi (P(A1)) = T; V i (P(B1)) = T; and V i (P( Cl)) = F Just as in the propositional case, you Could construct truth tables for all possible models. Figure A.8 shows some of the possible models for the logic with the three constants and the one predicate name RED, assuming the domain 2. As you can see, the truth table method is unwieldy for the predicate calculus because there are so many possible models. The full table would have 216 entries. A valuation function maps a predicate name P that takes n arguments to a set of lists of elements of length n. For example, the binary predicate Loves might be defined in V | as follows (where a loves 8 and a loves S): Vi(Loww) = {(a8), (oa)} The rule defining the truth value of simple propositions built out of rc-ary predicates is defined as follows (where P is any rc-argument predicate name, and a\,an arc any terms): V(P(ai,..., a„)) = Tif(V(ai).. and F otherwise V( an)) is a member of V( P), We can now determine the truth of formulas with respect to model 1, for example: Vi(Loves(Al, Bl)) = F; V t(Loves(Bl, Al)) = T; V i(Loves(Al, C1)) = T Once the different valuation function for propositions has been defined, the rules for defining the logical operators are identical to those defined for the propositional calculus. Thus we can calculate the assignment that V | gives to formulas such as ->(Red(Al) v U>ves(Al, Bl)), as shown in Figure A.9. 590 APPENDIX A An Introduction to Logic and Model-Theoretic Semantics 591 Expression e Vi(e) Al a Bl a Red Loves {(aS),(oa)} Red(Al) T Loves (Al, Bl) F Red(AI)v Loves(Al,Bl) T -ÍRed(Al) v Loves(Al, Bl)) F Related Work and Further Readings Figure A.9 Computing the valuation of -i( Red(Al) v Loves (Al, Bl)) for model 1 The Semantics for Quantifiers To define the semantics of formulas with quantified variables, we need the ability to make substitutions for variables in formulas. This can be done by augmenting the notion of a valuation function with what is called a variable substitution. Let us define Vj {x/a j to be an extended valuation function that is identical to V{ except that it defines the valuation of variable x to the domain constant a. For example, given the previously defined valuation function Vi, VH>Ja](Red) = Vl(Red)={a) Thus V]jj/a| {Red(x)) = T, since Vi{xja} (x) is a member of Vi {x/a} (Red). With the notion of variable substitutions, the semantics of formulas involving quantifiers is defined by the following rules on valuation functions: Vm(V x. P) - T if for every element a in E, Vm{xfd}(P) = T Vm(3 x. P) = Tif there is at least one a in X such that Vm{x/p,}(P) = T Intuitively, this is saying that a formula such as V x. P is true if and only if the formula P is true no matter what value the variable x takes, and a formula such as 3 x. P is true if and only if there is at least one value of x thai makes P true. All of the preceding definitions of soundness, consistency, completeness, and so on carry over to the semantics for FOPC without change. For instance, you can show that a set of formulas is consistent by constructing a model that makes each one true. For example, consider the following set of formulas: (V x. Q(x) 3 R(x), Q(AJ), Q(B1), -.0TC/)} A model can be constructed for these formulas as follows: Let I be the set {a, 8, c|>) where VOW) = a, V(#/)= 8, and V(c7) = tb. Let \(Q) = {0,8} and \(R)-[a, 8, it). Now you see that each of the preceding formulas evaluates to T given this model. A good text on logic and its relevance to linguistics is McCawley (1993). An excellent comprehensive source on logic and the use of other formal models in linguistics is Partee et al. (1990). Many introductory AI texts also discuss logic. Good examples are Luger and Stubblefield (1993), Winston (1992), Rich and Knight (1992), and Geneserefh and Nilsson (1987). There are many good introductory texts in logic, although many emphasize proof theory and do not deal with semantics. Barwise and Etchemendy (1987) give an excellent introduction, complete with an interactive program that allows you to explore issues in logic. Exercises for Appendix A (easy) Assume Sam is Bill's father is represented by FatherOfZBill, Sam), and Harry is one of Bill's ancestors is represented by AncestoriBill, Harry). Define the other predicates necessary, and then write a formula to represent the meaning of the sentence Every ancestor of Bill is either his father, his mother, or one of their ancestors. (easy) Give two English sentences that seem difficult to express in predicate calculus and discuss why you think this is so. (easy) Define the valuation function that defines the semantics of the exclusive-or operator as defined in this appendix, and then show that the following rules of inference are sound. P®1 n_ (easy) For each of the following sets of formulas, determine whether or not the formulas are consistent by constructing a truth table. 51 = {/'.-/'} 52 = {P 3fi} S4={jPvG,-iP&-.g} (medium) Express the following facts in FOPC and give a proof of the conclusion. If necessary, add common-sense axioms so that the conclusion follows from the given assumptions. a Assumptions: Tomorrow it will either be warm and sunny or cold and rainy. Tomorrow it will be sunny. Conclusion: Tomorrow it will be warm. b. Assumptions: Tweely is a bird. Tweely cats all good food. Sunflower seeds are good food. Conclusion: There is a bird that cats sunflower seeds. APPENDIX .....li - I ^| Symbolic Computation . .a. B.1 Symbolic Data Structures f B.2 Matching B3 Search Algorithms B.4 Logic Programming B.5 The Unification Algorithm Symbolic Computation 595 This appendix introduces some of the basic techniques of symbolic computation. The text assumes familiarity with some programming language but not necessarily one that deals with symbolic data. If you are familiar with the languages LISP or PROLOG, Section B.l may be safely skipped. If, in addition, you have some background in AI and know the unification algorithm, you may be able to skip the entire appendix. Section B.2 deals with matching and unification. Section B.3 describes basic ideas of search techniques, and Section B.4 combines the techniques of matching and search in discussing PROLOG. Finally, Section B.5 presents the unification algorithm in detail. B.l Symbolic Data Structures This text uses many different data structures to represent the results of different forms of analyses. All of these structures are built from a simple data structure called a list, on which the programming language LISP is based. A simple list is built of a sequence of expressions called atoms, which are simply sequences of characters. For example, (A Happy TOAD) is a list consisting of three atoms—A, Happy, and TOAD. The only property that atoms have is that they can be distinguished by their names. Thus we know that the atom Happy and the atom TOAD are distinct atoms because they are constructed out of different characters. Atoms do not have to be English words; the following is also a valid list of five atoms: (R3 forty XX3Y7 AAA1 ?X3) A list in general may contain not only atoms but also other lists. The following is a valid list consisting of four elements: the atom NP, the list (DET the), the list (ADJ happy), and the list (HEAD toad): (NP (DET the) (ADJ happy) (HEAD toad)) You will see many structures like this later. This one could be a syntactic representation of the English noun phrase the happy toad. There is no limit to how many times a list may be embedded in another list. Thus (((A TWO) (B THREE)) (((D)) E)) is a valid list. Taking it apart, you can see that it is a list consisting of two elements: the list ((A TWO) (B THREE)) and the list (((D)) E). The second element is itself a list consisting of two elements: the list ((D)) and the atom E. The element ((D)) is also a list, consisting of one element, the list (D), which consists of one element, the atom D. The list containing one atom, such as (D), is very different from the atom— that is, D. These might have completely different interpretations by a program. First((A B C)) equals A First(((A B) C)) equals (AB) First(((ABC))) equals (ABC) Rest((A B C)) equals (BC) Rest(((AB)G)) equals (Q Rest(((ABC))) equals ( ), the empty list Figure B.l Examples of list operations Insert(A, (B C)) equals (ABC) Insert((AB),(C)) equals ((AB)C) Insert((A B C), ()) equals ((ABC)) Append((A), (B C)) equals (BCA) Append(((AB)),(Q) equals (C(AB)) Append((A B C), ( )) equals (ABC) Figure B.2 Examples of list constructors The list containing no elements, written as (), is a valid list and may itself be an element of another list. It is called the null or empty list, and is often written as NIL or nil. There arc two basic operations used for taking a list apart: First and Rest. Figure B. I shows the results of applying these functions to various lists. First()—takes any list and returns its first element. It is undefined on the empty list and on atoms. Rest()—takes a list and produces the list consisting of all elements but the first one. It is also not defined on atoms or on the empty list. There are two basic functions for constructing lists: Insert and Append. Figure B.2 shows these functions used in various situations. Insert(, )—returns a new list with as the first element, followed by the elements in . Append(, )—returns a new list consisting of the elements of followed by . Finally, some tests are useful for examining lists. These tests return a value of TRUE or FALSE depending on the structure of their arguments, and are usable in if-then-else type statements: Null()—TRUE only if the is the empty list. Member(, is an clement of the . ..... ..... Member(A, (A B Q) is TRUE MemberfA, ((A) B Q) is FALSE Member((A), (ABC)) is FALSE Member((A), ((A) B C)) is TRUE MemberfB, (A B Q) is TRUE MembertB, (ABC B)) is TRUE MemberfB, ((AB)C)) is FALSE Figure B.3 Examples of the Member predicate Operation LISP Equivalent First((A B C)) (CAR '(A B C)) Rest((ABC)) (CDR '(A B O) InserttA, (B Q) (CONS 'A '(B C)) Append((A), (B C)) (APPEND '(B C) '(A)) ........ - Null((A)) ---- (NULL'(A)) Member! A. (A B C)) (MEMBER'A '(ABC)) Figure B.4 LISP equivalents Multiple occurrences of an atom in a list are allowed, and the Member predicate will return TRUE in such situations. Member succeeds only if the atom is an element of the list, not if the atom occurs within an embedded sublist. Figure B.3 shows examples. While this book can be understood without a knowledge of the LISP language, the LISP equivalents to these functions and predicates are shown in Figure B.4 for those who are interested. Further details oh LISP can be found in textbooks such as Wilensky (1986) and Winston and Horn (1989). Stacks and Queues In the algorithms outlined in this text you will often use data structures named stacks and queues. These can be built simply from list structures but are important enough to consider separately. A stack is a list in which all adding and removing of elements takes place at one end, called the top of the stack. This is analogous to a stack of trays in a cafeteria: When a tray is returned, it is put on the top; when a tray is taken, it is removed from the top. So at any lime, the tray just removed is always the last one that was added. For this reason, slacks arc also often called LIFO (last in/first out) lists. Stacks will be used in algorithms to keep track of items that need considering by the program. The general organization of such a program looks as follows: 1. Initialize the slack with one or more items. 2. Repeat the following until the answer is found: 598 APPENDIX B Symbolic Computation 399 2.1. Remove the top item of stack. 2.2. Consider the item. 2.3. Add any new items to be considered onto the stack. In general, the operation of adding an item to the stack is called pushing the item onto the stack, while removing an item is called popping the stack. A more concrete example of a stack organization is the "in" tray in an office. The person in the office might operate as follows: 1. Letters are added to tray overnight. 2. Repeat until no letters: 2.1. Remove top letter. 2.2. Read it and reply. 2.3. If new letters arrive in the meantime, add them to the tray. Of course, if more letters are arriving than the person can read, some letters that arrived overnight might never get read. Thus sometimes a different data structure is needed. A queue is a list in which all adding is done at one end, and all removing:ii done at the other end. This is analogous to a queue in a bank. People enter at the end of the line, while the people at the front get served. This scheme is often called a FIFO queue (first in/first out), since the first person to arrive is the first one served. If the harried office worker above used a queue instead of a stack to organize the letters, he or she would eventually get to all the letters that arrived overnight, even though new letters arrived at an ever-increasing rate. Similar considerations will arise later in algorithms when you have to decide whether to use a stack or a queue structure. B.2 Matching Most of the techniques discussed throughout will involve some notion of matching lists together. The simplest match between two lists is whether they are identical. To obtain more general forms of matching, you need to allow variables in lists, which can match any element. Variables will be indicated by atoms with a '! prefix. A variable may match any atom, sublist, or another variable. Thus a list with a variable, such as (A ?x C), could match (A B C), (A (B C) C), or (A ?z C). but could not match (A B D C), since ?x can take the place only of a single: element. It is often useful to know what a variable matched against. We say that a variable is bound to the value that it matched. Thus, in the preceding examples, ?x would have been bound to B, (B C), and ?z, respectively, on the three successful matches. These results arc summarized in Figure B.5. Given this background, you can now define the concept of matching more precisely as follows: Listl List 2 Match Result Bindings (A ?x C) (ABC) success ?x <- B (A?xC) (A (B C) C) success ?x <- (BC) (A ?x C) (A?zC) success ?x <- ?z (A?xC) (A B D C) failure Figure B.5 Some simple examples of matchin * with variables Bindings Result 1. ?x <- ?z (A?zC) 2 ?x <- B, ?z ^- B (ABC) 3. ?x (-Q?zf-C (ACC) Figure B.6 Three possible unifications of (A ?x C) and (A ?z C) Two lists are said to unify if there is a set of bindings for the variables in the lists such that, if you replace the variables with their bindings, the two lists are identical. Given that two lists unify, there may be many possible bindings that make them identical, For example, in unifying (A ?x C) with (A ?z C), there are many possible bindings for ?x and ?z, as shown in Figure B.6. When this arises, there is always one set of bindings that produces a new list that could unify with all the other solutions. This one is called the most general unifier. In Figure B.6. result 1 is the most general unifier because it could unify with results 2 and 3. Neither 2 nor 3 can be most general because they do not match with each other. If two lists unify, there is always a most general unifier, and it can be found reasonably efficiently using an algorithm called the unification algorithm. Consider some more complex cases of unification. These arise when the same variable is used more than once in a list. In such cases some possible matches that look like they might, succeed don't, because there is no single value for the variable that will make the two lists identical. The list (A ?x ?x), for example, does not unify with (A B C) because ?x would have to match both B and C. If ?x was bound to B. the first list would become (A B B). which is not identical to (A B C). Similarly, if ?x was bound to C, the first list would become (A C C), which is not identical lo (ABC). More complex examples can occur with sublists. For example, (A ?x C) will match (A (f 7y) C), since you can bind ?x to (f ?y) and the lists are identical. Furthermore. (A ?x C) would match (A (f ?y) ?y), since if you let ?x <- (I" C) and ?y <-■ C, both lists become (A (f C) C). Informally, you can match two lists by going through each list clement by clement, livery time a variable must be bound, rewrite each formula with every 600 APPENDIX B To match (A ?x C ?x) with (A (f ?y) ?y ?z): 1. First elements are: A and A 2 Second elements are: ?x and (f ?y) Bind ?x to (f ?y), rewrite the twoformulas to (A (f ?y) C (f ?y)) and (A (f ?y) ?y ?z) 3. Continue with third elements: C and ?y Bind ?y to C, rewrite the two formulas to (A (f C) C (f C)) arid (A (f C) C ?z) 4. Continue with fourth elements: (f C) and '!/. Bind ?z to (f C), rewrite the two formulas to (A (f C) C (f C)) and (A ft' C) C ft' Q) The lists are identical, so they unify. Figure B.7 An informal trace of a unification procedure To match (A ?x C ?x) with (A (f ?y) ?y ?y): 1. First elements are: A and A 2 Second elements are: ?x and (f ?y) Bind ?x to (f ?y), rewrite the formulas to (A (f ?y) C (f ?y)) and (A (f ?y) ?y ?y) 3. Third elements are: C and ?y Bind ?y to C, rewrite the formulas to (A (f C) C (f C)) and (A (f C) C C) 4. Fourth elements are: (f C) and C Failure, since these cannot be made identical. Figure B.8 An informal trace of a failure to unify occurrence of the variable replaced by its new binding, and continue matching. Tf you are careful not to bind a variable unnecessarily, you will produce a most general unifier. Figures B.7 and B.8 contain a trace of this process on two situations: one a success and one a failure. The unification algorithm is presented in detail in Section B.5. B.3 Search Algorithms Virtually all AI programs use some form of search to find the desired solution. In fact, one area of research in AI expressly studies different search algorithms. This section describes some of the basic ideas underlying search algorithms. Viewed abstractly, all search algorithms operate by exploring a space of different possible states to find some state that satisfies the goal of the search. For instance, in a program thai plays chess, a state would represent a board Symbolic Computation 601 Figure B.9 The search space for a simple combination lock position in the game. The goal would be to find a state where the program has checkmated the opponent. To search for a winning state, the program applies a transition function that takes one state and produces another. In a chess-playing program the transition function would be defined by the moves of chess. Given one state, the program computes a new state that results from a particular move. Typically, a program will consider the set of states that would result from every possible move. These states are called the successor states. Chess is a very complex game and produces a correspondingly complex search space, so let's consider a much simpler example. Say you want to find the combination to a friend's locker. You could randomly try combinations and hope to get lucky, but if you want to systematically consider all possibilities, you must decide on a particular search strategy. We can formalize the problem as a state space search as follows. A state is a list representing the numbers you have dialed so far. A move is accomplished by dialing another number, producing a new state that is represented by appending the new number onto the original state. Assume that we don't know how many numbers need to be dialed. To make the example manageable, also assume that the combination only has two numbers, 1 and 2, on the dial. Figure B.9 shows the space of stales involving up to three numbers. From Ihe initial state (), you can dial either a 1 or a 2. Thus Ihcrc are two successor states, (1) and (2), shown in the figure. From each of these, you can try a second number. Thus each has two successor stales, and so on. Many different search strategies can be explored within a single general scheme that maintains a list of the states that still remain to be explored. The general algorithm is as follows. Let TODO be a list of the states that you need to consider. 1. Initialize TODO to a list containing only the initial state (()). 2. Repeal until the goal state is found: 2.1. Remove a state from the TODO list. 2.2. Check if it is the goal state (docs it open the locker?). 2.3. If not. generate all successors and add them to the TODO list. 602 APPENDIX B Symbolic Computation 603 Depending on the different strategies used for selecting the next state to be processed, different search strategies are produced. There are two basic strategies. The first, depth-first search, uses a stack for the TODO list. Thus it always selects the last state to be added to the stack for further processing. Consider this strategy in operation on the combination lock problem, where the actual combination is (2 1 1). Initially, TODO contains only the initial state (). Two successors are generated and added to the list, namely (1) and (2). After adding them, TODO has the value ((1) (2)). State (1) is selected next. It is found not to be the goal state (that is, it does not open the locker), so two successor states, (] 1) and (1 2), are generated and added to the TODO list, which is now ((1 1) (1 2) (2)). The state (1 1) is selected next, and after failing to open the locker, two successor states, (1 1 1) and (1 1 2), are generated and added to the TODO list, which is now ((1 1 1) (1 1 2) (1 2) (2)). Given that (1 1 1) is not the combination! the successor states (1 1 1 1) and (1 1 1 2) are added to the TODO list. By now you may have noticed that unless the algorithm places an upper limit on how many numbers need to be dialed, this strategy will never find the solution. The other basic strategy, called breadth-first search, treats TODO as a queue rather than a stack. Thus successor states are added to the end of the TODO list. Starting again from the state (), the TODO list takes the following values as the algorithm operates: (d)(2)) ((2) (11) (12)) ((1 1) (12) (2 1) (2 2)) ((12) (2 1) (2 2) (111) (11 2)) ((2 1) (2 2) (1 1 1) (1 1 2) (1 2 1) (1 2 2)) and so on You can see that the two strategies explore the space in quite different orders. If you follow the search strategies by tracing the states considered on the tree representing the search space in Figure B.9, you will see that the depth-first strategy quickly moves down the tree exploring states at ever deeper levels. The breadth-first strategy, on the other hand, systematically explores each possibility at one level of the tree before considering the states at the next level. With a finite search space (for example, we might specify that the combination has at most four digits, so a state such as (1 1 1 1) has no successors), the depth-first strategy will often find a solution faster. With infinite search spaces, however, as in the unconstrained combination lock problem, the depth-first strategy is not guaranteed to find a solution, as it might search down an infinite chain of successor states that does not contain a solution. The breadth-first strategy is guaranteed to find a solution if one exists. Many other search strategies are possible. A strategy called iterative deepening acts like a depth-first strategy except that it stops when it reaches a certain depth in the tree. Tf no solution is found at that depth, it increases the depth and repeats. While this involves duplication of work, it turns out to be better overall than either the pure depth-First or breadth-first strategies. Alternatively, you might use a heuristic search, in which the next state is chosen based on some heuristic information. For instance, if you believe that the combination most likely involves a sequence that does not repeat digits consecutively, you might always choose a state without a repetition if possible, and only consider the others if there are no other possibilities. You might add states involving repetitions to the end of the TODO list, and the others to the front. (d) (2)) ((1 2) (2) (1 1)) ((12 1) (2) (11) (12 2))' and so on A general formulation for heuristic search is often called best-first search. This strategy uses a function that returns a heuristic value for each state. At any iteration of the search, you select the state that has the highest value. B.4 Logic Programming Many natural language applications are implemented in the programming language PROLOG, or in extensions built on top of PROLOG. PROLOG turns out to be a useful tool, as it builds in the notions of unification and search. Thus it provides the basic building blocks necessary for many AI applications. PROLOG is built on the notion of a Horn clause, which can be defined as a logical implication in which the left side of the implication is a single simple proposition. Thus the expression friendly(fidol):- dog(fidol) well-fed(fidol) in PROLOG corresponds to the assertion that Fido is friendly if he is a dog and is well fed. This would be expressed in FOPC as (Dog (Fidol) & Well-Fed(Fidol)) z> Friendly( Fidol) Of course, more useful rules would use variables to apply to a range of objects. The general statement that dogs that are well fed are friendly would be expressed as follows, where variables are indicated by symbols beginning with an uppercase letter: 1. friendly(X):-dog(X) well-fed(X) This is equivalent to the FOPC assertion V x (DOG(x) & WELL-FED(x)) 3 FRIENDLY(x). In PROLOG it is interpreted procedurally: For any object x, to prove that x is friendly, prove that x is a dog and that x is well fed. Horn clauses without a right side are called facts. To assert that Fifi is a dog that is well fed, you would add the facts 2. dog(fifi) :- 3. well-fed(fifi):- 604 APPENDIX B Symbolic Computation 605 Using clauses 1,2, and 3, PROLOG can now prove that Fifi is friendly as follows: Goal: Friendly(fifi) Using the unification algorithm, clause 1 is specialized to apply to the goal (that is, the variable X is bound to fifi) 4. friendly(fifi):- dog(fifi) well-fed(fifi) To prove the goal in light of this rule, it must prove the subgoals 5. Goal: dog(fifi) 6. Goal: well-fed(fifi) Since these subgoals are asserted in clauses 2 and 3, the proof of each is trivial, and the original goal is proved. In general, many different rules might apply to a goal. They will be tried in turn until one succeeds. As another example, consider the following axioms: All fish live in the sea. 7. live-in-sea(X):- fish(X) All cod are fish. 8. fish(X):- cod(X) All mackerel are fish. 9. fish(X):- mackerel(X) Whales live in the sea. 10. live-in-sea(X):- whale(X) Homer is a cod. 11. mackerel(homer) :-Willie is a whale. 12. whale(willie) :- Given these axioms, a system can prove that Willie lives in the sea as follows, using what is called a backtracking search. It uses a depth-first search strategy to systematically search through every possible sequence of applying clauses to see if the goal can be established. Figure B.10 shows a trace of a typical PROLOG search given the previously given clauses and the goal live-in-sea(willie) The best way to get a feeling for PROLOG is to play with the system and use the tracing facility to explore exactly how it operates. PROLOG systems are readily available for most types of workstations and personal computers. 13.5 The Unification Algorithm This section presents the unification algorithm in detail. Remember that two arbitrary lists containing constants and variables unify if there is a set of bindings for the variables that make the lists identical. Gl: live-in-sea(willie) Trying clause 7. which after unification with Gl is live-in-sea(willie):- fish(willie) This creates a new subgoal G2. G2: fish(willie) Rule 8 applies, giving fish(willie):- cod(willie) so there is a new subgoal G3: cod(willie) x No rule applies, try other ways to prove G2 Rule 9 applies, giving fish(willie):- mackerel(willie) so there is a new subgoal G4: mackerel(willie) x No rule applies, try other ways to prove G2 X No other rules apply to G2, try other ways to prove GI Rule 10 applies giving live-in-sea(willie) := whale(willie) So there is a new subgoal G5: whale(willis) Fact 12 unifies with G5 v Thus Goal G5 is Proved. V Goal Gl is Proved. Figure B.10 A proof of live-in-sea(willie) The values of variables can be stored in a data structure called the symbol table, or ST. Suppose you have an ST as follows: variable value ?x A ?y ft ?z B This would mean that ?x has the value A and ?y has the value ?/„ and ?z has the value B (thus 7y has the value B). Two functions need to be defined: ADD-TO-ST(varname, value)—adds a new entry onto the ST. GET-VALUE(varnamc)—returns the value for a variable. GFT-VALUE should check the symbol table repeatedly to find the most specific value for a variable. For example, GET-VALUE('?y) with the preceding ST should return B. If the variable has no entry on the ST, the variable name is returned. Thus, given die preceding table, GET-VALUE('?t) should return ?t. With these tools the matching algorithm between two formulas Tl and T2 is defined in Figure B. 11. Rather than rewriting the formulas each time a variable 606 APPENDIX B Symbolic Computation 607 To MATCH T1 and T2 1. 3. If Tl is a variable, then assign Tl to GET-VALUE(Tl) If T2 is a variable then assign T2 to GET-VALUE(T2) IfTl=T2 then return SUCCESS else if Tl is a variable then ADD-TO-ST(Tl, T2) and return SUCCESS else if T2 is a variable then ADD-TO-ST(T2, Tl) and return SUCCESS else if Tl and T2 are both lists then if MATCH(First(Tl), First(T2)) succeeds then return result from MATCH(Rest(Tl), Rest(T2)) else return FAIL else return FAIL Figure B.ll A preliminary matching algorithm binding is found, the information is stored in the symbol table. If MATCH succeeds, it returns SUCCESS and the symbol table contains the variable values. Steps 1 and 2 in Figure B.ll find the values of variables if they arc already known. Step 3 docs the actual matching. If one of the formulas is still a variable, that variable can be bound to the value of the other formula. Finally, if Tl and T2 are both list structures, it checks each element to make sure each list matches. As an example, consider the trace of the algorithm in Figure B. 12, matching the formula (A ?y ?z) with the formula (?x (B ?x) ?x), starting with an empty ST. The final result is SUCCESS, and the ST is set to the following: variable ?x ?y value A (B ?x) A If you instantiate the variables in one of the formulas with the values in the ST, you will get the answer, which is (A (B A) A). Note that ?y is bound to (B ?x)! but ?x is bound to A, so ?y is actually bound to the value (B A). One minor problem is left to resolve. If you were to try to match the formula (P ?x ?x) with (P (f ?y) ?y) with the present algorithm, it would succeed with the following ST: variable value (f ?y) MATCHING (A ?y ?z) with (?x (B ?x) ?x) Both are lists, so MATCHING A with ?x «- returns SUCCESS, and the ST now has the value A for ?x MATCHING (?y ?z) with ((B ?x) ?x) Both are lists, so MATCHING ?y with (B ?x) <- returns SUCCESS, and ST now has value (B ?x) for ?y MATCHING (?z) with (?x) Both are lists, so MATCHING ?z with ?x Step 2: ?x has value A