Merrett's Databases at McGill

Merrett's Introduction to Databases and to the Aldat Project at McGill

History, 1960's

In the 1960's, programming languages such as FORTRAN, ALGOL, COBOL and PL/I had introduced ways of organizing data in primary memory (RAM) into familiar data structures: numbers (reals and integers), logical or Boolean values, strings of characters, structures or records. The first two languages were designed for engineering and scientific calculations, and their data structures are largely sufficient for the computations required.

The latter two languages were used for commercial data processing, which requires manipulation of data loaded from secondary storage (tape and disk). It became apparent that organizing principles were also needed for such data, and by the end of the 1960's extensions were made to COBOL and PL/I to store, find and load data on secondary storage. One was IMS, IBM's Information Management System, a hierarchical organization of data. Another was an inter-industry design, proposed by the Data Base Task Group (DBTG) of the Conference on Data Systems Languages (CODASYL) and called the DBTG or network model.

It can be argued that these proposals, and other models and investigations during the 1960's, were not grounded in sufficiently precise concepts to satisfy the need for new principles of data organization. At any rate, Codd's publication of the relational data model in 1970 precipitated a flurry of research activity which still continues, and his paper can be taken to mark the start of the field of databases.

A relation is a mathematical concept, a set of "tuples" or records, and Codd defined an algebra of operations on relations. These operations are all-important, because a data structure, or "model" of data, is of no use if we cannot operate on the data to retrieve it or modify it.

Of course, the IMS, DBTG and other data models introduced operations, but because of the links of these approaches to existing programming languages geared for data in RAM, their operations were at the level of individual records instead of at a higher level. The base unit of storage on tape or disk tends to be the file. A proper abstraction should formalize the notion of "file". Relations do this. A proper set of operations should apply to files, not to records, which are just pieces or elements of files. The relational algebra does this.

In the late 1960's, I was a programmer-analyst with IBM (U.K.). I built information systems for internal use at a manufacturing plant in Scotland. We used PL/I and no database system. New information systems were integrated into the overall information complex of the factory. One person knew the overall picture, and his knowledge occasionally came out in a systems flowchart. This was a chart about the size of a large blackboard, in which little boxes about 1 inch square represented programs and little circles and cylinders about the same size represented tapes and disks respectively. It was very complicated.

I came to McGill with the intention of unravelling such messes by an approach in which files (the little circles and cylinders) were the base units and all the programs (the little boxes) were reduced to a few operators. This would lead to a programming language in which files are treated in the way that FORTRAN treats numbers, and a small set of operators, such as the arithmetic operators of FORTRAN, suffice for all necessary information calculations on the files.

History, 1970's

The 1970's saw intensive work on query languages and on theory, and a number of early implementations of the relational model. The query language work led to powerful and flexible notations for ad-hoc queries, using first-order predicate calculus as a standard. The theory work introduced normalization, which reduces redundancy in the data representation, and dependency theory, on which normalization is based. The implementations were PRTV (IBM U.K.), Ingres (Berkeley) and SQL (IBM San Jose), among others.

Competing data models were also introduced, notably Chen's Entity-Relationship model, Smith & Smith's notions of generalization and aggregation (the is-a and part-of hierarchies), and Codd's RM/T. These models derive from relations and add "semantic" considerations to "capture more meaning", as Codd puts it.

At McGill, we established generalizations to the relational algebra and created the domain algebra, in order to provide a better basis for a programming language. These extensions also led to a better query language. We studied information system requirements and concluded that the extended algebra is sufficient for all the applications we had envisaged. We called this work the Aldat Project , standing for the algebraic approach to data.

There was also a lack, at this time. of data structures for secondary storage which captured the symmetry of relations and their independence of notions of record order and favoured fields for retrieval. This was an inconvenience, so we developed the notions of multi-dimensional paging ("multipaging"; this remains in a number of ways an improvement over its successor, the grid-file) and of Z-order and kd-tries.

This work on information systems is covered in my book, Relational Information Systems (now out of print).

History, 1980's

In the 1980's, the database research community turned to other forms of data, investigated interactions with logic and artificial intelligence, and came under the sway of "object-oriented" programming. This coincided with our own investigations in the Aldat Project, having established the suitability of the relational model for the structured data of commerce and administration. We found relational representations for text, pictures and the rule bases of expert systems, and demonstrated the effectiveness of our operators in building the related applications.

Unlike many database researchers, we have always supposed in the Aldat Project that we are dealing with data which resides on secondary storage and may be too big to fit into RAM. As examples of this, consider the 570 megabytes of the Oxford English Dictionary, and the petabytes (a petabyte is a million gigabytes) generated by the NASA Earth Observing System. There is a conflict between two claims. The first, made by many, is that the relational model is inadequate to cope with the sophisticated data of modern databases, such as CAD/CAM and expert systems. The second, made by us, is that relations give the only way of organizing large data files, which must reside on secondary storage, simply and systematically.

We have written GIS (Geographical Information Systems) and other spatial data algorithms in Aldat. Temporal data is similarly easily supported. And we have implemented an expert system shell in 200 lines of Aldat code which subsumes the functionality of two commercial shells.

Again, a lack of good file structures for this less structured data led us to introduce structures and algorithms based on "tries" for spatial data and to improve existing methods for text data. The spatial data tries give us variable-resolution zooming with only a single copy of the data and yet reading from disk only the amount of data relevant to the resolution specified. Tries offer tremendous data compression, up to a factor of h/2, where h is the number of bits in the search key.
The fact that tries automatically compress data without loss of information is fundamental to the potential they hold out as a data structure for general implementation.

History, 1990's

Although the "semantic" components introduced into the languages of many of the new data models are often limited and ad-hoc, and although the "object-oriented" approach is not always well defined, many of the new ideas are significant. Our research in the Aldat Project incorporated such ideas as data abstraction and hiding, complex data types and nesting, state and instantiation, and generalization and inheritance.

As we did when investigating and extending the relational model and relational algebra, we started with a formal basis. Then, it was set theory and logic. For the new work, the basis was programming languages. Our perspective can be suggested by a comparison with differential equations. Differential equations provide a uniform formalism for solving an immense variety of problems. To solve different problems, you use the formalism appropriately. (There are no "semantic" intrusions to focus on heat conduction in a metal plate, or on satellite orbits, or on the Schroedinger equation.) The formalism has an extensively investigated mathematical basis, which abstracts from the potential applications. In the same way, we based an application-independent programming formalism on general and well-tried concepts.

Our perspective can also be suggested by a comparison with programming languages for RAM. Java, for instance, has no dialect for spatial data or for expert systems or for semistructured data. For secondary storage, there are, by contrast, special systems for geographical information (e.g., ArcInfo), for semistructured data (e.g., XML), and many expert system shells, on top of conventional databases, conceived for administrative and organizational data. Aldat is a serious attempt at unifying everything into a secondary-storage programming language.

In this research program we investigated seven fundamental aspects of programming languages to adapt them to relational operations on large data units on secondary storage, generalizing where necessary, and never specializing. These are:

scoping (leading to multidatabases, (Internet) distibuted databases),
iterative abstraction (loops and basic parallelism),
parametric abstraction (functions and abstract data types; "computations").
... with state and instantiation (object orientation),
nesting (complex data and object orientation),
typing (including metadata),
event programming (active databases).

We have implementations, partial at least, for most of these.

History, 2000's

A challenge to conventional database technology is presented by "semi-structured" data, as found in bibliographic and bioinformatic databases among others. It is widely assumed that new techniques must replace relations. (This was also widely assumed when object-oriented databases appeared and claimed that relations could not cope with complex data structures.) The semistructured, parsable language, XML, was developed, and query languages for it.

By taking relational ideas to their logical limit, semistructured data may be completely assimilated under relations and the relational operations, without needing new syntax and certainly without needing a whole new approach or new language. We extended nesting of relations to recursive nesting, and included recursion in the domain algebra to process recursively nested relations. As a convenient special case we allowed syntactic sugar in the form of path expressions which subsume all the functionality of XML and its query languages. The advantage of using relations here instead of a specialized new language is that the new repertoire of applications is integrated with everything else that relations can do.

Since HTML is a close predecessor of XML, Aldat's semistructure capabilities also process data on the Web. We also incorporated other aspects of databases distributed across the Internet by using path expressions within an HTTP-like protocol.

A second challenge to classical databases comes from sequence data, including timeseries and "data streams" (which are timeseries of indefinite length so that even secondary storage can hold at most a window of current and recent values). Although relations are sets, they can easily represent sequences if a sequencing attribute is provided. The functional mapping operators of the domain algebra permit such an attribute to order the tuples for series aggregation, but we must combine these with the reduction operators to work with windows.

We added a windowing operator to the domain algebra as a hybrid between these two forms of aggregation (reduction and functional mapping). Trials with timeseries and data-streaming applications shows so far that this operator adds sufficient capability for all work with sequences.

Theses and Projects in Databases at McGill

The most important contributions made by students to the Aldat Project have been through M.Sc. theses and projects, and there are opportunities for advanced contributions by Ph.D. students.

At the Master's level, work has been pursued in exploration and implementation. Since 1979 some 57 theses and 68 projects have been completed in the Aldat Project. Of these, eleven theses and nine projects have contributed directly to the present implementation, the system jRelix. The Aldat Project supports a Software Coordinator who assists with implementation work and ensures that student contributions are fully integrated into the system. All implementation work in the Project is currently installed on the workstations in our laboratory.

The remaining Master's work in the Aldat Project, 46 theses and 59 projects, have explored a variety of applications and environments for our work: financial administration, forms systems, text editors and processing and representation, picture processing, spatial data, temporal data, spreadsheets, integrity constraints, metadata, implementations for parallel machines and networks of computers, object orientation, geographical information systems and data on the World Wide Web.

Doctoral work in the Aldat Project has focussed on new data structures and algorithms, and on advanced database techniques. For instance, multipaging, kd-tries and z-order originated at McGill. Some useful generalizations of relational recursion, and new implementations of tries for text and spatial data came out of Ph.D. theses here. Potential topics are: trie implementations of the relational and domain algebras in Aldat; techniques for ultra-rapid general searching of text and other sequences; multimedia data representations and retrieval algorithms.

Courses

COMP 199, Excursions in Computing Science

COMP 420, Files and Databases

COMP 612, Database Systems

COMP 617, Information Systems

Recreation

Georgeville, Nature, Ski