Database systems the complete book 2nd edition pdf solutions
Dating > Database systems the complete book 2nd edition pdf solutions
Last updated
Dating > Database systems the complete book 2nd edition pdf solutions
Last updated
Download links: → Database systems the complete book 2nd edition pdf solutions → Database systems the complete book 2nd edition pdf solutions
It also exemplifies mastery of the technique of combining and balancing theory with practice, to give students their best chance at success. In database data models, there is usually a limited set of operations th at can be performed. Ullman, Jennifer Widom Category: Programming Language: English Page: 1 ISBN: Description: Database Systems: The Complete Book is ideal for Database Systems and Database Design and Application courses offered at the junior, senior and graduate levels in Computer Science departments.
Recall that a data model is not just structure; it needs a way to query the data and to modify the data. The first half of the book provides in-depth coverage of databases from the point of view of the database designer, user, and application programmer. Each pair consists of a value of the key type T and a value of the range type S. A single b y te can sto re integers betw een 0 an d 255, so it is possible to represent a varyinglen g th ch a ra c te r strin g of u p to 255 b y tes by a single b y te for th e count o f ch aracters p lu s th e by tes to sto re th e strin g itself. There are many other features to the CREATE TABLE statement, including many forms of constraints that can be declared, and the declaration of indexes data structures that speed up many operations on the table but we shall leave those for the appropriate time. For tuples to pair successfully, they must agree in both the B and C com ponents. This relation is shown in Fig. Thus, whenever we introduce a relation schema with a list of attributes, as above, we shall take this ordering to be the standard order whenever we display the relation or any of its rows. The second half of the book provides in-depth coverage of databases from the point of view of the DBMS implementor. We might wish to use the char acter?
A basic understanding of algebraic expressions and laws, logic, basic data structure, OOP concepts, and programming environments is implied. We show the schema for the relation with the relation name followed by a parenthesized list of its attributes.
Database Systems: The Complete Book, 2nd Edition - The resulting query plan, or sequence of actions the DBMS will perform to answer the query, is passed to the execution engine.
DATABASE SYSTEMS The Complete Book DATABASE SYSTEMS The Complete Book Second Edition Hector Garcia-Molina Jeffrey D. Editorial Director, Computer Science and Engineering: Marcia J. Horton Executive E ditor Tracy Dunkelberger Editorial Assistant: Melinda Haggerty Director of Marketing: Margaret Waples Marketing Manager: Christopher Kelly Senior Managing Editor: Scott Disanno Production Editor: Irwin Zucker Art Director: Jayne Conte Cover Designer: Margaret Kenselaar Cover Art: Tamara L Newman Manufacturing Buyer: Lisa McDowell Manufacturing Manager: Alan Fischer PEARSON P re n tic o H a ll © 2009,2002 by Pearson Education Inc. Pearson Prentice Hall Pearson Education, Inc. Upper Saddle River, NJ 07458 All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher. The author and publisher of this book have used their best efforts in preparing this book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The author and publisher make no warranty of any kind, expressed or implied, with regard to these programs or the documentation contained in this book. The author and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 ISBN D-13-bQb? Pearson Education North Asia Ltd. Pearson Education—Japan, Tokyo Pearson Education Malaysia, Pte. The introductory course, CS145, uses the first twelve chapters, and is designed for all students — those who want to use database systems as well as those who want to get involved in database implementation. The second course, CS245 on database implementation, covers most of the rest of the book. However, some material is covered in more detail in special topics courses. These include CS346 implementation project , which concentrates on query optimization as in Chapters 15 and 16. Also, CS345A, on data mining and Web mining, covers the material in the last two chapters. Chapter 4 is devoted to high-level modeling. We also have moved to Chapter 4 a shorter version of the material on ODL, treating it as a design language for relational database schemas. The material on functional and multivalued dependencies has been mod ified and remains in Chapter 3. We have changed our viewpoint, so th at a functional dependency is assumed to have a set of attributes on the right. We have augmented our discussion of third normal form to include the 3NF synthesis algorithm and to make clear what the tradeoff between 3NF and BCNF is. Chapter 5 contains the coverage of relational algebra from the previous edition, and is joined by part of the treatm ent of Datalog from the old Chap ter 10. The material on views and indexes has been moved to its own chapter, number 8, and this material has been augmented with a discussion of vi PREFACE important new topics, including materialized views, and automatic selection of indexes. The new Chapter 9 is based on the old Chapter 8 embedded SQL. It is introduced by a new section on 3-tier architecture. It also includes an expanded discussion of JDBC and new coverage of PHP. Chapter 10 collects a number of advanced SQL topics. The discussion of authorization from the old Chapter 8 has been moved here, as has the discussion of recursive SQL from the old Chapter 10. Data cubes, from the old Chapter 20, are now covered here. The rest of the chapter is devoted to the nested-relation model from the old Chapter 4 and object-relational features of SQL from the old Chapter 9. Then, Chapters 11 and 12 cover XML and systems based on XML. Ex cept for material at the end of the old Chapter 4, which has been moved to Chapter 11, this material is all new. Chap ter 12 is devoted to programming, and it includes sections on XPath, XQuery, and XSLT. Chapter 13 begins the study of database implementation. It covers disk storage and the file structures th at are built on disks. This chapter is a con densation of material that, in the first edition, occupied Chapters 11 and 12. Chapter 14 covers index structures, including B-trees, hashing, and struc tures for multidimensional indexes. This material also condenses two chapters, 13 and 14, from the first edition. Chapters 15 and 16 cover query execution and query optimization, respec tively. They are similar to the old chapters of the same numbers. Chapter 17 covers logging, and Chapter 18 covers concurrency control; these chapters are also similar to the old chapters with the same numbers. Chapter 19 contains additional topics on concurrency: recovery, deadlocks, and long transactions. This material is a subset of the old Chapter 19. Chapter 20 is on parallel and distributed databases. In addition to material on parallel query execution from the old Chapter 15 and material on distributed locking and commitment from the old Chapter 19, there are several new sec tions on distributed query execution: the map-reduce framework for parallel computation, peer-to-peer databases and their implementation of distributed hash tables. Chapter 21 covers information integration. In addition to material on this subject from the old Chapter 20, we have added a section on local-as-view medi ators and a section on entity resolution finding records from several databases th at refer to the same entity, e. Chapter 22 is on data mining. Although there was some material on the subject in the old Chapter 20, almost all of this chapter is new. It covers asso ciation rules and frequent itemset mining, including both the famous A-Priori Algorithm and certain efficiency improvements. Chapter 22 includes the key techniques of shingling, minhashing, and locality-sensitive hashing for finding similar items in massive databases, e. The chapter concludes with a study of clustering, espe cially for massive datasets. Chapter 23, all new, addresses two important ways in which the Internet has impacted database technology. First is search engines, where we discuss algorithms for crawling the Web, the well-known PageRank algorithm for eval uating the importance of Web pages, and its extensions. This chapter also covers data-stream-management systems. We discuss the stream data model and SQL language extensions, and conclude with several interesting algorithms for executing queries on streams. The formal prerequisites for the course are Sophomore-level treatments of: 1. Data structures, algorithms, and discrete math, and 2. Software systems, software engineering, and programming languages. Of this material, it is important th at students have at least a rudimentary un derstanding of such topics as: algebraic expressions and laws, logic, basic data structures, object-oriented programming concepts, and programming environ ments. However, we believe that adequate background is acquired by the Junior year of a typical computer science program. Exercises The book contains extensive exercises, with some for almost every section. We indicate harder exercises or parts of exercises with an exclamation point. The hardest exercises have a double exclamation point. We shall also make available there the sections from the first edition that have been removed from the second. In addition, there is an accompanying set of on-line homeworks and pro gramming labs using a technology developed by Gradiance Corp. See the sec tion following the Preface for details about the GOAL system. Instructors who want to use the system in their classes should contact their Prentice-Hall represen tative or request instructor authorization through the above Web site. Acknowledgements We would like to thank Donald Kossmann for helpful discussions, especially con cerning XML and its associated programming systems. Also, Bobbie Cochrane assisted us in understanding trigger semantics for a earlier edition. It is our pleasure to acknowledge them all here. Marc Abromowitz, Joseph H. Adamski, Brad Adelberg, Gleb Ashimov, Don ald Aingworth, Teresa Almeida, Brian Babcock, Bruce Baker, Yunfan Bao, Jonathan Becker, Margaret Benitez, Eberhard Bertsch, Larry Bonham, Phillip Bonnet, David Brokaw, Ed Burns, Alex Butler, Karen Butler, Mike Carey, Christopher Chan, Sudarshan Chawathe. Also Per Christensen, Ed Chang, Surajit Chaudhuri, Ken Chen, Rada Chirkova, Nitin Chopra, Lewis Church, Jr. Also John Fry, Chiping Fu, Tracy Fujieda, Prasanna Ganesan, Suzanne Garcia, Mark Gjol, Manish Godara, Seth Goldberg, Jeff Goldblat, Meredith Goldsmith, Luis Gravano, Gerard Guillemette, Himanshu Gupta, Petri Gynther, Zoltan Gyongyi, Jon Heggland, Rafael Hernandez, Masanori Higashihara, Antti Hjelt, Ben Holtzman, Steve Huntsberry. Also Sajid Hussain, Leonard Jacobson, Thulasiraman Jeyaraman, Dwight Joe, Brian Jorgensen, Mathew P. Johnson, Sameh Kamel, Jawed Karim, Seth Katz, Pedram Keyani, Victor Kimeli, Ed Knorr, Yeong-Ping Koh, David Koller, Gyorgy Kovacs, Phillip Koza, Brian Kulman, Bill Labiosa, Sang Ho Lee, Younghan Lee, Miguel Licona. Also Olivier Lobry, Chao-Jun Lu, Waynn Lue, John Manz, Arun Marathe, Philip Minami, Le-Wei Mo, Fabian Modoux, Peter Mork, Mark Mortensen, Ramprakash Narayanaswami, Hankyung Na, Mor Naaman, Mayur Naik, Marie Nilsson, Torbjorn Norbye, Chang-Min Oh, Mehul Patel, Soren Peen, Jian Pei. Also Xiaobo Peng, Bert Porter, Limbek Reka, Prahash Ramanan, Nisheeth Ranjan, Suzanne Rivoire, Ken Ross, Tim Roughgarten, Mema Roussopoulos, Richard Scherl, Loren Shevitz, Shrikrishna Shrin, June Yoshiko Sison, PREFACE ix Man Cho A. So, Elizabeth Stinson, Qi Su, Ed Swierk, Catherine Tornabene, Anders Uhl, Jonathan Ullman, Mayank Upadhyay. Also Anatoly Varakin, Vassilis Vassalos, Krishna Venuturimilli, Vikram Vijayaraghavan, Terje Viken, Qiang Wang, Steven Whang, Mike Wiacek, Kristian Widjaja, Janet Wu, Sundar Yamunachari, Takeshi Yokukawa, Bing Yu, Min-Sig Yun, Torben Zahle, Sandy Zhang. The remaining errors are ours, of course. GOAL is designed to minimize student frus tration while providing an interactive teaching experience outside the classroom. GOAL delivers immediate assessment and feedback via two kinds of assignments: mul tiple choice homework exercises and interactive lab projects. The homework consists of a set of multiple choice questions designed to test student knowledge of a solved problem. When answers are graded as incorrect, students are given a hint and directed back to a specific section in the course textbook for helpful information. By testing the code and providing immediate feed back, GOAL lets you know exactly which concepts the students have grasped and which ones need to be revisited. In addition, the GOAL package specific to this book includes programming exercises in SQL and XQuery. Submitted queries are tested for correctness and incorrect results lead to examples of where the query goes wrong. Students can try as many times as they like but writing queries that respond correctly to the examples is not sufficient to get credit for the problem. Instructors should contact their local Pearson Sales Representative for sales and ordering information for the GOAL Student Access Code and textbook value package. A bout the A uthors HECTOR GARCIA-MOLINA is the L. Lerner Professor of Com puter Science and Electrical Engineering at Stanford University. His research interests include digital libraries, information integration, and database applica tion on the Internet. He currently serves on the Board of Directors of Oracle Corp. ULLMAN is the Stanford W. Ascherman Professor of Computer Science emeritus at Stanford University. He is the author or co-author of 16 books, including Elements of ML Programming Prentice Hall 1998. His research interests include data mining, information integration, and electronic education. He is a member of the National Academy of Engineering, and recip ient of a Guggenheim Fellowship, the Karl V. Karlstrom Outstanding Educator Award, the SIGMOD Contributions and Edgar F. Codd Innovations Awards, and the Knuth Prize. JENNIFER WIDOM is Professor of Computer Science and Electrical Engi neering at Stanford University. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering, she received the ACM SIGMOD Edgar F. Codd Innovations Award in 2007 and was a Guggenheim Fellow in 2000, and she has served on a variety of program committees, advisory boards, and editorial boards. Overview of a Database Management S y s te m................................ Outline of Database-System S t u d i e s............................................... References for Chapter 1..................................................................... Whenever you visit a major Web site — Google, Yahoo! Corporations maintain all their important records in databases. Databases are likewise found at the core of many scientific investi gations. They represent the data gathered by astronomers, by investigators of the human genome, and by biochemists exploring properties of proteins, among many other scientific activities. These systems are among the most complex types of software available. In this book, we shall learn how to design databases, how to write programs in the various languages associated with a DBMS, and how to implement the DBMS itself. In essence a database is nothing more than a collection of information that exists over a long period of time, often many years. In common parlance, the term database refers to a collection of data that is managed by a DBMS. The DBMS is expected to: 1. Allow users to create new databases and specify their schemas logical structure of the data , using a specialized data-definition language. THE WORLDS OF DATABASE SYSTEM S 2 2. Support the storage of very large amounts of data — many terabytes or more — over a long period of time, allowing efficient access to the data for queries and database modifications. Enable durability, the recovery of the database in the face of failures, errors of many kinds, or intentional misuse. Control access to data from many users at once, without allowing unex pected interactions among users called isolation and without actions on the data to be performed partially but not completely called atomicity. These systems evolved from file systems, which provide some of item 3 above; file systems store data over a long period of time, and they allow the storage of large amounts of data. Further, file systems do not directly support item 2 , a query language for the data in files. Their support for 1 — a schema for the data — is limited to the creation of directory structures for files. Item 4 is not always supported by file systems; you can lose data that has not been backed up. Finally, file systems do not satisfy 5. While they allow concurrent access to files by several users or processes, a file system generally will not prevent situations such as two users modifying the same file at about the same time, so the changes made by one user fail to appear in the file. Examples of these applications are: 1. Banking systems: maintaining accounts and making sure that system failures do not cause money to disappear. Airline reservation systems: these, like banking systems, require assurance that data will not be lost, and they must accept very large volumes of small actions by customers. Corporate record keeping: employment and tax records, inventories, sales records, and a great variety of other types of information, much of it critical. These database systems used several different data models for 1. For example, the CODASYL query language had statements that allowed the user to jump from data element to data ele ment, through a graph of pointers among these elements. There was consider able effort needed to write such programs, even for very simple queries. Codd proposed that database systems should present the user with a view of data organized as tables called relations. Behind the scenes, there might be a complex data structure that allowed rapid response to a variety of queries. But, unlike the programmers for earlier database sys tems, the programmer of a relational system would not be concerned with the storage structure. Queries could be expressed in a very high-level language, which greatly increased the efficiency of database programmers. We shall cover the relational model of database systems throughout most of this book. By 1990, relational database systems were the norm. Yet the database field continues to evolve, and new issues and approaches to the management of data surface regularly. Object-oriented features have infilrated the relational model. Some of the largest databases are organized rather differently from those using relational methodology. In the balance of this section, we shall consider some of the modern trends in database systems. The size was necessary, because to store a gigabyte of data required a large computer system. Today, hundreds of gigabytes fit on a single disk, and it is quite feasible to run a DBMS on a personal computer. Thus, database systems based on the relational model have become available for even very small machines, and they are beginning to appear as a common tool for computer applications, much as spreadsheets and word processors did before them. Another important trend is the use of documents, often tagged using XML extensible Modeling Language. Large collections of small documents can 1 C O D A S Y L D a ta B a se Task Group A p ril 1971 R eport, A C M , New York. THE WORLDS OF DATABASE SYSTEM S 4 serve as a database, and the methods of querying and manipulating them are different from those used in relational systems. Corporate databases routinely store terabytes 1012 bytes. Yet there are many databases that store petabytes 101S bytes of data and serve it all to users. Some impor tant examples: 1. Google holds petabytes of data gleaned from its crawl of the Web. This data is not held in a traditional DBMS, but in specialized structures optimized for search-engine queries. Satellites send down petabytes of information for storage in specialized systems. A picture is actually worth way more than a thousand words. You can store 1000 words in five or six thousand bytes. Storing a picture typi cally takes much more space. Repositories such as Flickr store millions of pictures and support search of those pictures. And if still pictures consume space, movies consume much more. An hour of video requires at least a gigabyte. Sites such as YouTube hold hundreds of thousands, or millions, of movies and make them available easily. Peer-to-peer file-sharing systems use large networks of conventional com puters to store and distribute data of various kinds. Although each node in the network may only store a few hundred gigabytes, together the database they embody is enormous. For example, a large company has many divisions. Each division may have built its own database of products or em ployee records independently of other divisions. Perhaps some of these divisions used to be independent companies, which naturally had their own way of doing things. They may use different terms to mean the same thing or the same term to mean different things. To make matters worse, the existence of legacy applications using each of these databases makes it almost impossible to scrap them, ever. As a result, it has become necessary with increasing frequency to build struc tures on top of existing databases, with the goal of integrating the information 1. OVERVIEW OF A DATABASE M ANAGEM ENT SY STE M 5 distributed among them. One popular approach is the creation of data ware houses, where information from many legacy databases is copied periodically, with the appropriate translation, to a central database. Single boxes represent system components, while double boxes represent in-memory data structures. The solid lines indicate control and data flow, while dashed lines indicate data flow only. Since the diagram is complicated, we shall consider the details in several stages. First, at the top, we suggest that there are two distinct sources of commands to the DBMS: 1. Conventional users and application programs that ask for data or modify data. A database administrator: a person or persons responsible for the struc ture or schema of the database. The DBA might also decide th at the only allowable grades are A, B, C, D, and F. This structure and constraint information is all part of the schema of the database. It is shown in Fig. A user or an application program initiates some action, using the data-manipulation language DML. This command does not affect the schema of the database, but may affect the content of the database if the CHAPTER 1. DML statements are handled by two separate subsystems, as follows. A n sw erin g th e Q u ery The query is parsed and optimized by a query compiler. The resulting query plan, or sequence of actions the DBMS will perform to answer the query, is passed to the execution engine. The execution engine issues a sequence of requests for small pieces of data, typically records or tuples of a relation, to a resource manager that knows about data files holding relations , the format and size of records in those files, and index files, which help find elements of data files quickly. The requests for data are passed to the buffer manager. The buffer manager communicates with a storage manager to get data from disk. The storage manager might involve operating-system commands, but more typically, the DBMS issues commands directly to the disk controller. T ran saction P r o c e ssin g Queries and other DML actions are grouped into transactions, which are units th at must be executed atomically and in isolation from one another. Any query or modification action can be a transaction by itself. In addition, the execu tion of transactions must be durable, meaning that the effect of any completed transaction must be preserved even if the system fails in some way right after completion of the transaction. We divide the transaction processor into two major parts: 1. A concurrency-control manager, or scheduler, responsible for assuring atomicity and isolation of transactions, and 2. A logging and recovery manager, responsible for the durability of trans actions. However, to perform any useful operation on data, that data must be in main memory. It is the job of the storage manager to control the placement of data on disk and its movement between disk and main memory. In a simple database system, the storage manager might be nothing more than the file system of the underlying operating system. However, for efficiency CHAPTER 1. The storage manager keeps track of the location of files on the disk and obtains the block or blocks containing a file on request from the buffer manager. The buffer manager is responsible for partitioning the available main mem ory into buffers, which are page-sized regions into which disk blocks can be transferred. Thus, all DBMS components that need information from the disk will interact with the buffers and the buffer manager, either directly or through the execution engine. The kinds of information that various components may need include: 1. Data: the contents of the database itself. Metadata: the database schema that describes the structure of, and con straints on, the database. Log Records: information about recent changes to the database; these support durability of the database. Statistics: information gathered and stored by the DBMS about data properties such as the sizes of, and values in, various relations or other components of the database. Indexes: data structures that support efficient access to the data. In addition, a DBMS offers the guarantee of durability: that the work of a completed transaction will never be lost. The transaction manager therefore accepts transaction commands from an application, which tell the transaction manager when transactions begin and end, as well as infor mation about the expectations of the application some may not wish to require atomicity, for example. The transaction processor performs the following tasks: 1. Logging: In order to assure durability, every change in the database is logged separately on disk. The log manager initially writes the log in buffers and negotiates with the buffer manager to make sure that buffers are written to disk where data can survive a crash at appropriate times. Concurrency control: Transactions must appear to execute in isolation. Transactions are expected to preserve the consistency of the database. Thus, the scheduler concurrency-control manager must assure th at the individual actions of multiple transactions are executed in such an order that the net effect is the same as if the transactions had in fact executed in their entirety, one-at-a-time. A typical scheduler does its work by maintaining locks on certain pieces of the database. These locks prevent two transactions from accessing the same piece of data in ways that interact badly. Locks are generally stored in a main-memory lock table, as suggested by Fig. The scheduler affects the execution of queries and other database operations by forbidding the execution engine from accessing locked parts of the database. Deadlock resolution: As transactions compete for resources through the locks that the scheduler grants, they can get into a situation where none can proceed because each needs something another transaction has. THE WORLDS OF DATABASE SYSTEM S 10 1. The query compiler, which translates the query into an internal form called a query plan. The latter is a sequence of operations to be performed on the data. The query compiler consists of three major units: a A query parser, which builds a tree structure from the textual form of the query. The query compiler uses m etadata and statistics about the data to decide which sequence of operations is likely to be the fastest. For example, the existence of an index, which is a specialized data structure that facilitates access to data, given values for one or more components of that data, can make one plan much faster than another. The execution engine, which has the responsibility for executing each of the steps in the chosen query plan. The execution engine interacts with most of the other components of the DBMS, either directly or through the buffers. It must get the data from the database into buffers in order to manipulate that data. It needs to interact with the scheduler to avoid accessing data that is locked, and with the log manager to make sure that all database changes are properly logged. This section is an outline of what to expect in each of these units. P art I: R ela tio n a l D a ta b a se M o d elin g The relational model is essential for a study of database systems. After ex amining the basic concepts, we delve into the theory of relational databases. That study includes functional dependencies, a formal way of stating that one kind of data is uniquely determined by another. It also includes normalization, the process whereby functional dependencies and other formal dependencies are used to improve the design of a relational database. We also consider high-level design notations. Their purpose is to allow informal exploration of design issues before we implement the design using a relational DBMS. OUTLINE OF D ATABASE-SYSTEM STUDIES 11 P a rt II: R e la tio n a l D a ta b a se P ro g ra m m in g We then take up the m atter of how relational databases are queried and modi fied. After an introduction to abstract programming languages based on algebra and logic Relational Algebra and Datalog, respectively , we turn our atten tion to the standard language for relational databases: SQL. We study both the basics and important special topics, including constraint specifications and triggers active database elements , indexes and other structures to enhance performance, forming SQL into transactions, and security and privacy of data in SQL. We also discuss how SQL is used in complete systems. It is typical to combine SQL with a conventional or host language and to pass data between the database and the conventional program via SQL calls. We discuss a number of ways to make this connection, including embedded SQL, Persistent Stored Modules PSM , Call-Level Interface CLI , Java Database Interconnectivity JDBC , and PHP. P a rt III: S em istru ctu red D a ta M o d e lin g and P ro g ra m m in g The pervasiveness of the Web has put a premium on the management of hierar chically structured data, because the standards for the Web are based on nested, tagged elements semistructured data. We introduce XML and its schemadefining notations: Document Type Definitions DTD and XML Schema. We also examine three query languages for XML: XPATH, XQuery, and Extensible Stylesheet Language Transform XSLT. P art IV : D a ta b a se S y ste m Im p le m e n ta tio n We begin with a study of storage management: how disk-based storage can be organized to allow efficient access to data. We explain the commonly used Btree, a balanced tree of disk blocks and other specialized schemes for managing multidimensional data. We then turn our attention to query processing. There are two parts to this study. First, we need to learn query execution: the algorithms used to implement the operations from which queries are built. Since data is typically on disk, the algorithms are somewhat different from what one would expect were they to study the same problems but assuming th at data were in main memory. The second step is query compiling. Here, we study how to select an efficient query plan from among all the possible ways in which a given query can be executed. Then, we study transaction processing. There are several threads to follow. One concerns logging: maintaining reliable records of what the DBMS is doing, in order to allow recovery in the event of a crash. Another thread is scheduling: controlling the order of events in transactions to assure the ACID properties. We also consider how to deal with deadlocks, and the modifications to our algo rithms th at are needed when a transaction is distributed over many independent CHAPTER 1. THE WORLDS OF DATABASE SYSTEM S 12 sites. We consider how search engines work, and the specialized data structures that make their operation possible. We look at information integration, and methodolo gies for making databases share their data seamlessly. Data mining is a study that includes a number of interesting and important algorithms for processing large amounts of data in complex ways. Data-stream systems deal with data that arrives at the system continuously, and whose queries are answered contin uously and in a timely fashion. Peer-to-peer systems present many challenges for management of distributed data held by independent hosts. Thus, in this book, we shall not try to be exhaustive in our citations, but rather shall mention only the papers of historical impor tance and major secondary sources or useful surveys. Each was an early relational system and helped establish this type of system as the dominant database technology. It also has references to earlier reports of this type. A C M 48:5 2005 , pp. Vianu, Foundations of Databases, AddisonWesley, Reading, MA, 1995. REFERENCES FOR CH APTER 1 13 4. Ullman, Principles of Database and Knowledge-Base Systems, Vol umes I and II, Computer Science Press, New York, 1988, 1989. We give the basic terminology for relations and show how the model can be used to represent typical forms of data. We then introduce a portion of the language SQL — th at part used to declare relations and their structure. The chapter closes with an introduction to relational algebra. We see how this notation serves as both a query language — the aspect of a data model that enables us to ask questions about the data — and as a constraint language — the aspect of a data model th at lets us restrict the data in the database in various ways. In this brief summary of the concept, we define some basic terminology and mention the most important data models. A data model is a notation for describing data or information. The description generally consists of three parts: 1. Structure of the data. The data structures used to implement data in the computer are sometimes referred to, in discussions of database systems, as a physical data model, although in fact they are far removed from the gates and electrons that truly serve as the physical implementation of the data. In the database 17 CHAPTER 2. THE RELATIO N AL MODEL OF DATA 18 world, data models are at a somewhat higher level than data structures, and are sometimes referred to as a conceptual model to emphasize the difference in level. We shall see examples shortly. Operations on the data. In programming languages, operations on the data are generally anything that can be programmed. In database data models, there is usually a limited set of operations th at can be performed. We are generally allowed to perform a limited set of queries operations th at retrieve information and modifications operations that change the database. This limitation is not a weakness, but a strength. By limiting operations, it is possible for programmers to describe database operations at a very high level, yet have the database management system implement the operations efficiently. In comparison, it is generally impossible to optimize programs in conventional languages like C, to the extent th at an inefficient algorithm e. Constraints on the data. Database data models usually have a way to describe limitations on what the data can be. These constraints can range from the simple e. The relational model, including object-relational extensions. The semistructured-data model, including XML and related standards. The first, which is present in all commercial database management systems, is the subject of this chapter. We turn to this data model starting in Chapter 11. We shall discuss this model beginning in Section 2. This relation, or table, de scribes movies: their title, the year in which they were made, their length in minutes, and the genre of the movie. We show three particular movies, but you should imagine th at there are many more rows to this table — one row for each movie ever made, perhaps. The structure portion of the relational model might appear to resemble an array of structs in C, where the column headers are the field names, and each 2. However, it must be emphasized th at this physical implementation is only one possible way the table could be implemented in physical data structures. In fact, it is not the normal way to represent relations, and a large portion of the study of database systems addresses the right ways to implement such tables. Much of the distinction comes from the scale of relations — they are not normally implemented as main-memory structures, and their proper physical implementation must take into account the need to access relations of very large size that are resident on disk. These operations are table-oriented. As an example, we can ask for all those rows of a relation th at have a certain value in a certain column. For example, we can ask of the table in Fig. However, as a brief sample of what kinds of constraints are generally used, we could decide th at there is a fixed list of genres for movies, and that the last column of every row must have a value th at is on this list. Or we might decide incorrectly, it turns out that there could never be two movies with the same title, and constrain the table so that no two rows could have the same string in the first component. The principal manifestation of this viewpoint today is XML, a way to represent data by hierarchically nested tagged elements. The tags, similar to those used in HTML, define the role played by different pieces of data, much as the column headers do in the relational model. For example, the same data as in Fig. The operations on semistructured data usually involve following paths in the implied tree from an element to one or more of its nested subelements, then to subelements nested within those, and so on. For example, starting at the outer element the entire document in Fig. THE RELATIONAL MODEL OF DATA 20 1939 231 drama 1977 124 sciFi 1992 95 comedy Figure 2. Constraints on the structure of data in this model often involve the data type of values associated with a tag. For instance, are the values associated with the tag integers or can they be arbitrary character strings? Other constraints determine which tags can appear nested within which other tags. For example, must each element have a element nested within it? W hat other tags, besides those shown in Fig. Can there be more than one genre for a movie? These and other matters will be taken up in Section 11. A modern trend is to add object-oriented features to the relational model. There are two effects of object-orientation on relations: 1. Values can have structure, rather than being elementary types such as integer or strings, as they were in Fig. Relations can have associated methods. In a sense, these extensions, called the object-relational model, are analogous to the way structs in C were extended to objects in C + +. We shall introduce the object-relational model in Section 10. BASICS OF THE RELATIO N AL MODEL 21 There are even database models of the purely object-oriented kind. In these, the relation is no longer the principal data-structuring concept, but becomes only one option among many structures. We discuss an object-oriented database model in Section 4. The hierarchical model was, like semistruc tured data, a tree-oriented model. Its drawback was that unlike more modern models, it really operated at the physical level, which made it impossible for programmers to write code at a conveniently high level. Another such model was the network model, which was a graph-oriented, physical-level model. However, the gener ality of graphs was built directly into the network model, rather than favoring trees as these other models do. This difference becomes even more apparent when we discuss, as we shall, how full graph structures are embedded into tree-like, semistructured models. A brief argument follows. Because databases are large, efficiency of access to data and efficiency of modifications to that data are of great importance. Also very important is ease of use — the productivity of programmers who use the data. Surprisingly, both goals can be achieved with a model, particularly the relational model, that: 1. Provides a simple, limited approach to structuring data, yet is reasonably versatile, so anything can be modeled. Provides a limited, yet useful, collection of operations on data. Together, these limitations turn into features. They allow us to implement languages, such as SQL, that enable the programmer to express their wishes at a very high level. A few lines of SQL can do the work of thousands of lines of C, or hundreds of lines of the code that had to be written to access data under earlier models such as network or hierarchical. Yet the short SQL programs, because they use a strongly limited sets of operations, can be optimized to run as fast, or faster than the code written in alternative languages. The rows each represent a CHAPTER 2. THE RELATIONAL MODEL OF DATA 22 movie, and the columns each represent a property of movies. In this section, we shall introduce the most important terminology regarding relations, and illustrate them with the Movies relation. Attributes appear at the tops of the columns. Usually, an attribute describes the meaning of entries in the column below. For instance, the column with attribute le n g th holds the length, in minutes, of each movie. We show the schema for the relation with the relation name followed by a parenthesized list of its attributes. Thus, the schema for relation Movies of Fig. Thus, whenever we introduce a relation schema with a list of attributes, as above, we shall take this ordering to be the standard order whenever we display the relation or any of its rows. In the relational model, a database consists of one or more relations. The set of schemas for the relations of a database is called a relational database schema, or just a database schema. A tuple has one component for each attribute of the relation. For instance, the first of the three tuples in Fig. When we wish to write a tuple 2. BASICS OF THE RELATIO NAL MODEL 23 Conventions for Relations and Attributes We shall generally follow the convention that relation names begin with a capital letter, and attribute names begin with a lower-case letter. However, later in this book we shall talk of relations in the abstract, where the names of attributes do not matter. In th at case, we shall use single capital letters for both relations and attributes, e. For example, Gone With the Wind, 1939, 231, drama is the first tuple of Fig. Notice that when a tuple appears in isolation, the attributes do not appear, so some indication of the relation to which the tuple belongs must be given. We shall always use the order in which the attributes were listed in the relation schema. It is not permitted for a value to be a record structure, set, list, array, or any other type th at reasonably can have its values broken into smaller components. It is further assumed that associated with each attribute of a relation is a domain, th at is, a particular elementary type. The components of any tuple of the relation must have, in each component, a value that belongs to the domain of the corresponding column. For example, tuples of the Movies relation of Fig. It is possible to include the domain, or data type, for each attribute in a relation schema. We shall do so by appending a colon and a type after attributes. For example, we could represent the schema for the Movies relation as: Movies title:string, year:integer, length:integer, genre:string 2. Thus the order in which the tuples of a relation are presented is immaterial. For example, we can list the three tuples of Fig. THE RELATIO N AL MODEL OF DATA 24 Moreover, we can reorder the attributes of the relation as we choose, without changing the relation. However, when we reorder the relation schema, we must be careful to remember that the attributes are column headers. Thus, when we change the order of the attributes, we also change the order of their columns. When the columns move, the components of tuples change their order as well. The result is that each tuple has its components permuted in the same way as the attributes are permuted. We expect to insert tuples for new movies, as these appear. We also expect changes to existing tuples if we get revised or corrected information about a movie, and perhaps deletion of tuples for movies that are expelled from the database for some reason. It is less common for the schema of a relation to change. However, there are situations where we might want to add or delete attributes. Schema changes, while possible in commercial database systems, can be very expensive, because each of perhaps millions of tuples needs to be rewritten to add or delete com ponents. Also, if we add an attribute, it may be difficult or even impossible to generate appropriate values for the new component in the existing tuples. We shall call a set of tuples for a given relation an instance of that relation. For example, the three tuples shown in Fig. Presumably, the relation Movies has changed over time and will con tinue to change over time. BASICS OF THE RELATIO N AL MODEL 2. We shall defer much of the discussion of constraints until Chapter 7. However, one kind of constraint is so fundamental th at we shall introduce it here: key constraints. A set of attributes forms a key for a relation if we do not allow two tuples in a relation instance to have the same values in all the attributes of the key. For example, there are three movies named King Kong, each made in a different year. It should also be obvious that y ear by itself is not a key, since there are usually many movies made in the same year. For instance, the Movies relation could have its schema written as: Movies t i t l e , ye a r , le n g th , genre Remember th at the statement that a set of attributes forms a key for a relation is a statement about all possible instances of the relation, not a state ment about a single instance. For example, looking only at the tiny relation of Fig. However, we can easily imagine that if the relation instance contained more movies, there would be many dramas, many comedies, and so on. Thus, there would be distinct tuples th at agreed on the genre component. As a consequence, it would be incorrect to assert th at genre is a key for the relation Movies. While we might be sure that t i t l e and y ear can serve as a key for Movies, many real-world databases use artificial keys, doubting that it is safe to make any assumption about the values of attributes outside their control. Thus, the employee-ID attribute can serve as a key for a relation about employees. In US corporations, it is normal for every employee to have a Social-Security number. If the database has an attribute th at is the Social-Security number, then this attribute can also serve as a key for employees. The idea of creating an attribute whose purpose is to serve as a key is quite widespread. THE RELATIONAL MODEL OF DATA students in a university. You undoubtedly can find more examples of attributes created for the primary pur pose of serving as keys. The topic is movies, and it builds on the relation Movies that has appeared so far in examples. The database schema is shown in Fig. Here are the things we need to know to understand the intention of this schema. BASICS OF THE RELATIO NAL MODEL 27 Movies This relation is an extension of the example relation we have been discussing so far. Remember that its key is title and year together. We have added two new attributes; studioName tells us the studio that owns the movie, and producerC is an integer that represents the producer of the movie in a way that we shall discuss when we talk about the relation MovieExec below. MovieStar This relation tells us something about stars. The key is name, the name of the movie star. It is not usual to assume names of persons are unique and therefore suitable as a key. However, movie stars are different; one would never take a name th at some other movie star had used. Thus, we shall use the convenient fiction th at movie-star names are unique. A more conventional approach would be to invent a serial number of some sort, like social-security numbers, so that we could assign each individual a unique number and use that attribute as the key. We take that approach for movie executives, as we shall see. Another interesting point about the M ovieStar relation is that we see two new data types. The gender can be a single character, M or F. Starsln This relation connects movies to the stars of that movie, and likewise connects a star to the movies in which they appeared. Notice that movies are represented by the key for Movies — the title and year — although we have chosen differ ent attribute names to emphasize that attributes movieTitle and movieYear represent the movie. Likewise, stars are represented by the key for MovieStar, with the attribute called starName. Finally, notice that all three attributes are necessary to form a key. It is perfectly reasonable to suppose that relation Starsln could have two distinct tuples that agree in any two of the three at tributes. For instance, a star might appear in two movies in one year, giving rise to two tuples that agreed in movieYear and starName, but disagreed in movieTitle. MovieExec This relation tells us about movie executives. It contains their name, address, and networth as data about the executive. These are integers; a different one is assigned to each executive. THE RELATIONAL MODEL OF DATA acctNo 12345 23456 34567 type savings checking savings balance 12000 1000 25 The relation Accounts firstName Robbie Lena Lena lastName Banks Hand Hand idNo 901-222 805-333 805-333 account 12345 12345 23456 The relation Customers Figure 2. We rely on no two studios having the same name, and therefore use name as the key. The other attributes are the address of the studio and the certificate number for the president of the studio. We assume that the studio president is surely a movie executive and therefore appears in MovieExec. Indicate the following: a The attributes of each relation. DEFINING A RELATIO N SCHEMA IN SQL 29 E x ercise 2. Give some additional examples. E x ercise 2. There is a current standard for SQL, called SQL99. Most commercial database management systems implement something sim ilar, but not identical to, the standard. There are two aspects to SQL: 1. The Data-Definition sublanguage for declaring database schemas and 2. The Data-Manipulation sublanguage for querying asking questions about databases and for modifying the database. The distinction between these two sublanguages is found in most languages; e. These correspond to data-definition and data-manipulation, respectively. In this section we shall begin a discussion of the data-definition portion of SQL. There is more on the subject in Chapter 7, especially the m atter of constraints on data. The data-manipulation portion is covered extensively in Chapter 6. Stored relations, which are called tables. These are the kind of relation we deal with ordinarily — a relation th at exists in the database and that can be modified by changing its tuples, as well as queried. Views, which are relations defined by a computation. These relations are not stored, but are constructed, in whole or in part, when needed. They are the subject of Section 8. THE RELATIONAL MODEL OF DATA 30 3. Temporary tables, which are constructed by the SQL language processor when it performs its job of executing queries and data modifications. These relations are then thrown away and not stored. In this section, we shall learn how to declare tables. We do not treat the dec laration and definition of views here, and temporary tables are never declared. The SQL CREATE TABLE statement declares the schema for a stored relation. It gives a name for the table, its attributes, and their data types. It also allows us to declare a key, or even several keys, for a relation. There are many other features to the CREATE TABLE statement, including many forms of constraints that can be declared, and the declaration of indexes data structures that speed up many operations on the table but we shall leave those for the appropriate time. All attributes must have a data type. Character strings of fixed or varying length. The type CHAR n denotes a fixed-length string of up to n characters. VARCHAR n also denotes a string of up to n characters. The difference isimplementation-dependent; typically CHAR implies that short strings are padded to make n characters, while VARCHAR implies that an endmarker or string-length is used. SQL permits reasonable coercions between values of character-string types. Normally, a string is padded by trailing blanks if it becomes the value of a component that is a fixed-length string of greater length. Bit strings of fixed or varying length. These strings are analogous to fixed and varying-length character strings, but their values are strings of bits rather than characters. The type BIT n denotes bit strings of length n, while BIT VARYING n denotes bit strings of length up to n. The type BOOLEAN denotes an attribute whose value is logical. The possi ble values of such an attribute are TRUE, FALSE, and — although it would surprise George Boole — UNKNOWN. The type INT or INTEGER these names are synonyms denotes typical integer values. The type SHORTINT also denotes integers, but the number of bits permitted may be less, depending on the implementation as with the types int and short int in C. DEFINING A RELATIO N SCHEMA IN SQL 31 D ates and Tim es in SQL Different SQL implementations may provide many different representa tions for dates and times, but the following is the SQL standard repre sentation. A date value is the keyword DATE followed by a quoted string of a special form. The first four characters are digits representing the year. Then come a hyphen and two digits representing the month. Finally there is another hyphen and two digits representing the day. Note that single-digit months and days are padded with a leading 0. A time value is the keyword TIME and a quoted string. This string has two digits for the hour, on the military 24-hour clock. Then come a colon, two digits for the minute, another colon, and two digits for the second. If fractions of a second are desired, we may continue with a decimal point and as many significant digits as we like. Floating-point numbers can be represented in a variety of ways. We may use the type FLOAT or REAL these are synonyms for typical floating point numbers. A higher precision can be obtained with the type DOUBLE PRECISION; again the distinction between these types is as in C. SQL also has types that are real numbers with a fixed decimal point. For exam ple, DECIMAL n,d allows values that consist of n decimal digits, with the decimal point assumed to be d positions from the right. NUMERIC is almost a synonym for DECIMAL, although there are possible implementation-dependent dif ferences. These values are essentially character strings of a special form. The title is declared as a string of up to 100 characters. THE RELATIONAL MODEL OF DATA CREATE TABLE Movies title CHAR IOO , year INT, len g th INT, genre CHAR 10 , studioName CHAR 30 , producerC INT ; Figure 2. We have assumed that 10 characters are enough to represent a genre of movie; again, that is an arbitrary choice, one we could regret if we had a genre with a long name. Likewise, we have chosen 30 characters as sufficient for the studio name. The certificate number for the producer of the movie is another integer. It illustrates some new options for data types. The name of this table is MovieStar, and it has four attributes. The first two attributes, name and address, have each been declared to be character strings. However, with the name, we have made the decision to use a fixed-length string of 30 characters, padding a name out with blanks at the end if necessary and truncating a name to 30 characters if it is longer. In contrast, we have declared addresses to be variable-length character strings of up to 255 characters. CREATE TABLE MovieStar name CHAR 30 , address VARCHAR 255 , gender CHAR l , b ir th d a te DATE ; Figure 2. A single b y te can sto re integers betw een 0 an d 255, so it is possible to represent a varyinglen g th ch a ra c te r strin g of u p to 255 b y tes by a single b y te for th e count o f ch aracters p lu s th e by tes to sto re th e strin g itself. C om m ercial system s generally su p p o rt longer varying-length strin g s, however. DEFINING A RE LA TIO N SCHEMA IN SQL 33 The gender attribute has values th at are a single letter, Mor F. Thus, we can safely use a single character as the type of this attribute. Finally, the b ir th d a te attribute naturally deserves the data type DATE. But what if we need to change the schema of the table after it has been in use for a long time and has many tuples in its current instance? We can remove the entire table, including all of its current tuples, or we could change the schema by adding or deleting attributes. We can delete a relation R by the SQL statement: DROP TABLE R; Relation R is no longer part of the database schema, and we can no longer access any of its tuples. More frequently than we would drop a relation that is part of a long-lived database, we may need to modify the schema of an existing relation. These modifications are done by a statement th at begins with the keywords ALTER TABLE and the name of the relation. We then have several options, the most important of which are 1. ADD followed by an attribute name and its data type. DROP followed by an attribute name. In the actual relation, tuples would all have components for phone, but we know of no phone numbers to put there. Thus, the value of each of these components is set to the special null value, NULL. As another example, the ALTER TABLE statement: ALTER TABLE M ovieStar DROP b ir th d a te ; deletes the b ir th d a te attribute. As a result, the schema for M ovieStar no longer has th at attribute, and all tuples of the current M ovieStar instance have the component for b ir th d a te deleted. THE RELATIO N AL MODEL OF DATA 34 2. For instance, we mentioned in Example 2. However, there are times when we would prefer to use another choice of default value, the value that appears in a column if no other value is known. In general, any place we declare an attribute and its data type, we may add the keyword DEFAULT and an appropriate value. That value is either NULL or a constant. Certain other values th at are provided by the system, such as the current time, may also be options. We might wish to use the char acter? We could replace the declarations of gender and birthdate in Fig. We may declare one attribute to be a key when th at attribute is listed in the relation schema. We may add to the list of items declared in the schema which so far have only been attributes an additional declaration th at says a particular attribute or set of attributes forms the key. If the key consists of more than one attribute, we have to use method 2. If the key is a single attribute, either method may be used. There are two declarations th at may be used to indicate keyness: a PRIMARY KEY, or b UNIQUE. Any attem pt to insert or update a tuple th at violates this rule causes the DBMS to reject the action th at caused the violation. In addition, if PRIMARY KEY is used, then attributes in S are not allowed to have NULL as a value for their components. Again, any attem pt to violate this rule is rejected by the system. NULL is permitted if the set S is declared UNIQUE, however. A DBMS may make other distinctions between the two terms, if it wishes. Since no star would use the name of another star, we shall assume th at name by itself forms a key for this relation. Thus, we can add this fact to the line declaring name. We could also substitute UNIQUE for PRIMARY KEY in this declaration. If we did so, then two or more tuples could have NULL as the value of name, but there could be no other duplicate values for this attribute. CREATE TABLE M ovieStar name CHAR 30 PRIMARY KEY, a d d ress VARCHAR 255 , gender CHAR l , b i r t h d a t e DATE ; Figure 2. The resulting schema declaration would look like Fig. Again, UNIQUE could replace PRIMARY KEY. THE RELATIO N AL MODEL OF DATA E x am p le 2. However, in a situation where the key has more than one attribute, we must use the style of Fig. For instance, the relation Movie, whose key is the pair of attributes t i t l e and year, must be declared as in Fig. However, as usual, UNIQUE is an option to replace PRIMARY KEY. The database schema consists of four relations, whose schemas are: P roduct m aker, model, type PC model, speed, ram, hd, p ric e Laptop m odel, speed, ram, hd, s c re e n , p ric e P rin te r m o d e l, c o lo r, ty p e , p ric e The Product relation gives the manufacturer, model number and type PC, laptop, or printer of various products. We assume for convenience that model numbers are unique over all manufacturers and product types; that assumption is not realistic, and a real database would include a code for the manufacturer as part of the model number. The PC relation gives for each model number that is a PC the speed of the processor, in gigahertz , the amount of RAM in megabytes , the size of the hard disk in gigabytes , and the price. The Laptop relation is similar, except th at the screen size in inches is also included. The P r i n t e r relation records for each printer model whether the printer produces color output true, if so , the process type laser or ink-jet, typically , and the price. Write the following declarations: a A suitable schema for relation Product. DEFINING A R E L A T IO N SCHEMA IN SQL 37 b A suitable schema for relation PC. E x ercise 2. Relation S hips records the name of the ship, the name of its class, and the year in which the ship was launched. Relation B a ttle s gives the name and date of battles involving these ships, and relation Outcomes gives the result sunk, damaged, or ok for each ship in each battle. Write the following declarations: a A suitable schema for relation C lasses. THE RELATIONAL MODEL OF DATA An Algebraic Query Language In this section, we introduce the data-manipulation aspect of the relational model. Recall that a data model is not just structure; it needs a way to query the data and to modify the data. To begin our study of operations on relations, we shall learn about a special algebra, called relational algebra, that consists of some simple but powerful ways to construct new relations from given relations. When the given relations are stored data, then the constructed relations can be answers to queries about this data. Further, when a DBMS processes queries, the first thing that happens to a SQL query is that it gets translated into relational algebra or a very similar internal representation. Thus, there are several good reasons to start out learning this algebra. Before introducing the operations of relational algebra, one should ask why, or whether, we need a new kind of programming languages for databases. After all, we can represent a tuple of a relation by a struct in C or an object in Java , and we can represent relations by arrays of these elements. The surprising answer is that relational algebra is useful because it is less powerful than C or Java. That is, there are computations one can perform in any conventional language that one cannot perform in relational algebra. An example is: determine whether the number of tuples in a relation is even or odd. By limiting what we can say or do in our query language, we get two huge rewards — ease of programming and the ability of the compiler to produce highly optimized code — that we discussed in Section 2. An algebra, in general, consists of operators and atomic operands. For in stance, in the algebra of arithmetic, the atomic operands are variables like x and constants like 15. The operators are the usual arithmetic ones: addition, subtraction, multiplication, and division. Usually, parentheses are needed to group operators and their operands. A N A LG EBRAIC QUERY LANGUAGE 39 Relational algebra is another example of an algebra. Its atomic operands are: 1. Variables that stand for relations. Constants, which are finite relations. We shall next see the operators of relational algebra. We generally shall refer to expressions of relational algebra as queries. An element appears only once in the union even if it is present in both R and S. Note th at R — S is different from S —R; the latter is the set of elements th at are in S but not in R. When we apply these operations to relations, we need to put some conditions on R and S: CHAPTER 2. THE RELATIONAL MODEL OF DATA 40 1. R and S must have schemas with identical sets of attributes, and the types domains for each attribute must be the same in R and S. Before we compute the set-theoretic union, intersection, or difference of sets of tuples, the columns of R and S must be ordered so that the order of attributes is the same for both relations. Sometimes we would like to take the union, intersection, or difference of relations that have the same number of attributes, with corresponding domains, but that use different names for their attributes. If so, we may use the renaming operator to be discussed in Section 2. Current instances of R and S are shown in Fig. Then the union R U S is name Carrie Fisher Mark Hamill Harrison Ford address 123 Maple St. The difference R —S is 41 2. However, the Fisher tuple also appears in S and so is not in R —S. The value of expression , a 2 ,... The schema for the resulting value is the set of attributes { A i ,A 2,. An instance of this relation is shown in Fig. The result is the single-column relation genre sciFi comedy Notice th at there are only two tuples in the resulting relation, since the last two tuples of Fig. THE RELATIONAL MODEL OF DATA A N ote About Data Quality :- While we have endeavored to make example data as accurate as possible, we have used bogus values for addresses and other personal information about movie stars, in order to protect the privacy of members of the acting profession, many of whom are shy individuals who shun publicity. The tuples in the resulting relation are those that satisfy some condition C that involves the attributes of R. C is a conditional expression of the type with which we are familiar from conventional programming languages; for example, conditional expressions fol low the keyword i f in programming languages such as C or Java. The only difference is that the operands in condition C are either constants or attributes of R. We apply C to each tuple t of R by substituting, for each attribute A appearing in condition C, the component of t for attribute A. If after substi tuting for each attribute of C the condition C is true, then t is one of the tuples that appear in the result of ac R ; otherwise t is not in the result. The latter condition is true, so we accept the first tuple. The same argument explains why the second tuple of Fig. The third tuple has a len g th component 95. Hence the last tuple of Fig. We can get these tuples with a more complicated condition, involving the AND of two sub conditions. This product is denoted R x S. When R and S are relations, the product is essentially the same. However, since the members of R and S are tuples, usually consisting of more than one component, the result of pairing a tuple from R with a tuple from S is a longer tuple, with one component for each of the components of the constituent tuples. By convention, the components from R the left operand precede the components from S in the attribute order for the result. The relation schema for the resulting relation is the union of the schemas for R and S. However, if R and S should happen to have some attributes in common, then we need to invent new names for at least one of each pair of identical attributes. To disambiguate an attribute A that is in the schemas of both R and 5, we use R. A for the attribute from R and S. A for the attribute from S. E xam ple 2. Let relations R and S have the schemas and tuples shown in Fig. Then the product R x S consists of the six tuples shown in Fig. Note how we have paired each of the two tuples of R with each of the three tuples of S. Since B is an attribute of both schemas, we have used R. B in the schema for R x S. The other attributes are unambiguous, and their names appear in the resulting schema unchanged. The simplest sort of match is the natural join of two relations R and 5, denoted R x S, in which we pair only those tuples from R and S that agree in whatever attributes are common to the schemas of R and S. If the tuples r and s are successfully paired in the join R tx S, then the result of the pairing is a tuple, called the joined tuple, with one component for each of the attributes in the union of the schemas of R and S. The joined tuple 44 CHAPTER 2. Since r and s are successfully paired, the joined tuple is able to agree with both these tuples on the attributes they have in common. The construction of the joined tuple is suggested by Fig. However, the order of the attributes need not be that convenient; the attributes of R and 5 can appear in any order. Thus, to pair successfully, tuples need only to agree in their B components. If so, the resulting tuple has com ponents for attributes A from R , B from either R or S , C from S , and D from S. This pairing yields the first tuple of the result: 1,2,5,6. The second tuple of R pairs successfully only with the second tuple of S, and the pairing yields 3 ,4 ,7 , 8. Note th at the third tuple of S does not pair with any tuple of R and thus has no effect on the result of R tx S. A tuple th at fails to pair with any tuple of the other relation in a join is said to be a dangling tuple. For example, no tuple paired successfully with more than one tuple, and there was only one attribute in common to the two relation schemas. We also show an instance in which one tuple joins with several tuples. For tuples to pair successfully, they must agree in both the B and C com ponents. Thus, the first tuple of U joins with the first two tuples of V, while the second and third tuples of U join with the third tuple of V. The result of these four pairings is shown in Fig. While this way, equating shared attributes, is the most common basis on which relations are joined, it is sometimes desirable to pair tuples from two relations on some other basis. For th at purpose, we have a related notation called the thetajoin. The notation for a theta-join of relations R and S based on condition C is R ix c S. The result of this operation is constructed as follows: 1. Take the product of R and S. Select from the product only those tuples that satisfy the condition C. A N A LG EBRAIC QUERY LANGUAGE 49 Equivalent Expressions and Query Optim ization All database systems have a query-answering system, and many of them are based on a language th at is similar in expressive power to relational algebra. Thus, the query asked by a user may have many equivalent ex pressions expressions th at produce the same answer whenever they are given the same relations as operands , and some of these may be much more quickly evaluated. We shall use the operator Ps Ai,A2,-.. The resulting relation has exactly the same tuples as R, but the name of the relation is S. Moreover, the at tributes of the result relation S are named A i, A 2 ,. If we only want to change the name of the relation to S and leave the attributes as they are in R, we can just say ps R E x a m p le 2. Suppose, however, that we do not wish to call the two versions of B by names R. B; rather we want to continue to use the name B for the attribute th at comes from R, and we want to use X as the name of the attribute B coming from S. We can rename the attributes of S so the first is called X. The result of the expression p s x , c , D S is a relation named S th at looks just like the relation 5 from Fig. THE RELATIONAL MODEL OF DATA When we take the product of R with this new relation, there is no conflict of names among the attributes, so no further renaming is done. That is, the result of the expression R x P s x , c , D S is the relation R x S from Fig. This relation is shown in Fig. As an alternative, we could take the product without renaming, as we did in Example 2. The expression P r s a ,b ,x ,c ,d R x S yields the same relation as in Fig. But this relation has a name, R S, while the result relation in Fig. We then subtract T from R, leaving only those tuples of R that are also in S. The two forms of join are also expressible in terms of other operations. We then apply the selection operator with a condition C of the form R. Finally, we must project out one copy of each of the equated attributes. Let L be the list of attributes in the schema of R followed by those attributes in the schema of S that are not also in the schema of R. A N ALG EBRAIC QUERY LANGUAGE 51 That is, we take the product U x V. Then we select for equality between each pair of attributes with the same name — B and C in this example. For another example, the theta-join of Example 2. It is not a complete declaration; we shall add more to it later. Line 1 declares Movie to be a class. Following line 1 are the declarations of four attributes that all Movie objects will have. Lines 2 , 3 , and 4 declare three attributes, t i t l e , year, and length. The first of these is of character-string type, and the other two are integers. Line 5 declares attribute genre to be of enumerated type. The name of the enumeration list of symbolic constants is Genres, and the four values the attribute genre is allowed to take are drama, comedy, s c iF i, and teen. An enumeration must have a name, which can be used to refer to the same type anywhere. OBJECT DEFINITION LANGUAGE 185 W hy Nam e Enumerations and Structures? The enumeration-name Genres in Fig. How ever, by giving this set of symbolic constants a name, we can refer to it elsewhere, including in the declaration of other classes. In some other class, the scoped name Movie:: Genres can be used to refer to the definition of the enumerated type of this name within the class Movie. Here is an example with a complex type. Line 3 specifies another attribute address. This attribute has a type that is a record structure. The name of this structure is Addr, and the type consists of two fields: s t r e e t and c ity. Both fields are strings. In general, one can define record structure types in ODL by the keyword S tr u c t and curly braces around the list of field names and their types. Like enumerations, structure types must have a name, which can be used elsewhere to refer to the same structure type. The type of a re lationship describes what a single object of the class is connected to by the relationship. Typically, this type is either another class if the relationship is many-one or a collection type if the relationship is one-many or many-many. We shall show complex types by example, until the full type system is described in Section 4. More precisely, we want each Movie object to connect the set of S ta r objects that are its stars. The best way to represent this connection between the Movie and S ta r classes is with a relationship. We may represent this relationship by a line: relationship Set stars; 186 CHAPTER 4. HIGH-LEVEL DATABASE MODELS in the declaration of class Movie. It says that in each object of class Movie there is a set of references to Star objects. The set of references is called stars. To get this information into S ta r objects, we can add the line relationship Set starredln; to the declaration of class S ta r in Example 4. However, this line and a similar declaration for Movie omits a very important aspect of the relationship between movies and stars. We indicate this connection between the relationships s t a r s and s t a r r e d l n by placing in each of their declarations the keyword in v e rse and the name of the other relationship. If the other relationship is in some other class, as it usually is, then we refer to that relationship by its scoped name — the name of its class, followed by a double colon : : and the name of the relationship. Line 6 shows the declaration of relationship stairs of movies, and says that its inverse is Star: : starredln. Since relation ship starredln is defined in another class, its scoped name must be used. Similarly, relationship starredln is declared in line 11. Its inverse is declared by that line to be stars of class Movie, as it must be, because inverses always are linked in pairs. If the relationship is many-one from C to D, then the type of the rela tionship in C is just D, while the type of the relationship in D is Set. If the relationship is many-one from D to C, then the roles of C and D are reversed in 2 above. If the relationship is one-one, then the type of the relationship in C is just D, and in D it is just C. Of course, since a D object could be associated with any set of C objects, it is also permissible for th at set to be empty for some D objects. HIGH-LEVEL DATABASE MODELS E x am p le 4. The first two of these have already been introduced in Examples 4. We also discussed the relationship pair s t a r s and s ta rr e d ln. Since each of their types uses Set, we see that this pair represents a many-many relationship between S ta r and Movie. S tudio objects have attributes name and address; these appear in lines 13 and 14. We have used the same type for addresses of studios as we defined in class S ta r for addresses of stars. In line 7 we see a relationship ownedBy from movies to studios, and the inverse of this relationship is owns on line 15. Since the type of ownedBy is Studio, while the type of owns is Set, we see that this pair of inverse relationships is many-one from Movie to Studio. A type system is built from a basis of types that are defined by themselves and certain recursive rules whereby complex types are built from simpler types. In ODL, the basis consists of: 1. Primitive types: integer, float, character, character string, boolean, and enumerations. The latter are lists of symbolic names, such as drama in line 5 of Fig. Class names, such as Movie, or S ta r, which represent types that are actually structures, with components for each of the attributes and rela tionships of that class. These types are combined into structured types using the following type constructors: 1. If T is any type, then Set denotes the type whose values are finite sets of elements of type T. Examples using the set type-constructor occur in lines 6 , 11 , and 15 of Fig. If T is any type, then Bag denotes the type whose values are finite bags or multisets of elements of type T. If T is any type, then List denotes the type whose values are finite lists of zero or more elements of type T. If T is a type and i is an integer, then Array denotes the type whose elements are arrays of i elements of type T. For example, Array denotes character strings of length 10. If T and S are types, then D ictionary denotes a type whose values are finite sets of pairs. Each pair consists of a value of the key type T and a value of the range type S. The dictionary may not contain two pairs with the same key value. O BJECT DEFINITION LANGUAGE 189 Sets, Bags, and Lists To understand the distinction between sets, bags, and lists, remember that a set has unordered elements, and only one occurrence of each element. A bag allows more than one occurrence of an element, but the elements and their occurrences are unordered. A list allows more than one occurrence of an element, but the occurrences are ordered. Thus, 1,2,1 and 2,1,1 are the same bag, but 1,2,1 and 2,1,1 are not the same list. If Ti, T2 ,... For example, line 10 of Fig. Both fields are of type string and have names street and city, respectively. The first five types — set, bag, list, array, and dictionary — are called collection types. There are different rules about which types may be associated with attributes and which with relationships. Struct N string fieldl, integer field2.