Secondary Storage Algorithms and Data Structures

**Instructor:** T. H.
Merrett

"Secondary Storage Algorithms and Data Structures", COMP 420, introduces data structures and algorithms for data on secondary storage, notably queries and processing. The course looks at sequential files, tree-structured files, and direct-access files and at algorithms for applications requiring various levels of volatility, symmetry, and activity.

We also examine file access across the Internet, using sockets and TCP/IP.

File and socket programming is in Java. (The class package documentation is especially useful.)

Based on the categories covered, we discuss design techniques for specific applications (which have in the past included student records and employment databases, natural language retrieval, astronomical queries, musicology queries, air traffic control, restaurant guide, city property registry, world geographical database of towns, personnel database, hospital ward monitor, and a medical insurance database).

The course also briefly introduces databases, using McGill's advanced relational database system, relix.

*
Fundamental to this course is that secondary storage is memory organized
differently from main memory (RAM), so that data structures, algorithms,
complexity considerations and languages must all be rebuilt differently
from the familiar computer science for RAM.
*

COMP 420 Secondary Storage Algorithms and Data Structures COURSE SUMMARY Week^{}Topic Asst. 1 Overview of Aldat [PDF slides 1797K] Java classes and sequential I/O 1 Java and UNIX Files, records, blocks, RAM and secondary storage [Slides ps 291K pdf 214K] [Exercises ps 133K pdf 61K] [Text ps 303K pdf 203K] 2 Sequential Files: [Slides ps 109K pdf 68K] [Exercises ps 124K pdf 52K] [Text ps 150K pdf 133K] searching 2 merging 3 sorting Random access in Java 4 3 Logarithmic Files: [Slides ps 314K pdf 245K] [Exercises ps 206K pdf 102K] [Text ps 530K pdf 302K] B-trees: searching, inserting, splitting, overflows 5 Volatility, activity Tries: text; variable-resolution [Notes ps 340K pdf 202K] (Supplementary notes: Building and Searching Full Tries Slides on the pointerless representation for tries) Symmetry: quad trie, kd-tree, kd-trie, z-order 6 5 Direct Files: [Slides ps 499K pdf 674K] [Exercises ps 122K pdf 125K] [Text ps 444K pdf 302K] Hashing: division/remainder, multiplicative collision resolution - linear probing, 7 separate chaining (Supplementary notes: Hash Delete with Linear Probing) Volatility: virtual hashing, linear hashing Activity: tidying, distributions Symmetry: multipaging - static and dynamic (Supplementary notes: Multipage Search and Insert) 7 Cost Analysis Usage distributions: sequential/direct breakeven Probe and load factors: B-trees and hashing Hashing overflows - theory and measurement 8 (Supplementary notes: Cost Analysis for Trees) File Design [Exercises ps 179K pdf 74K] Using Access Categories, Activity, Volatility, Symmetry and Cost Analysis 9 (Supplementary notes: Design Problem and Answers) 8 Relational Databases: [Slides ps 191K pdf 261K] [Exercises ps 42K pdf 48K] Relations Relational Algebra relix 10 9 Domain Algebra 11 Misc: File support for concurrency [Slides ps 136K pdf 101K]; High dimensionality [Slides ps 139K pdf 101K] Trie Join [PDF slides 70K]

The links from this web page constitute the text for the course. They are reasonably complete, but some omissions and errors have been left deliberately, and
will be addressed in the lectures.

Note on plagiarism