Copyright ©  T. H. Merrett

308-420A           Secondary Storage Algorithms and Data Structures       Week 6

				  Assignment 6

The telephone book is represented as two files, a full trie indexing the phone
numbers (/course/cs420/data/triePhone), and a data file containing the other
fields (/course/cs420/data/dataNameAddr).

Write a Java program to search the trie and retrieve the corresponding data
for the phone numbers 2883502, 2883508, 3364566, 8445663 and 8453932. (You
may treat each of these as a separate request.)

Each file is a  RandomAccessFile. The data file consists of blocks of records
containing the fields  Name  (40 bytes) and  Address  (33 bytes). The record
and blocksize information (73 bytes and 100 records/block) are stored in the
index trie. Your program should use this stored information, and read a full
block at a time into a buffer, extracting individual records from the buffer.

The trie indexing the phone numbers has two main components, with numeric
parameters before and after each. The first component is the set of pages
containing the bitpairs that represent the nodes of the trie. The second
component is the indexing information for these pages: the top counters,
the bottom counters, the numbers of nodes, and the addresses on disk. Here
is the layout.

	byte*	type	meaning
	   0	int	number of pages in the trie (nTrie=42)
	   4	int	blocksize of trie pages in bytes (BTrie=64)
	   8	short	height of trie in page levels (h=4)
	  10	short	number of node levels per page level (t=8)
	  12	bp[]	nTrie blocks of BTrie bytes each: the trie
			("array of bitpairs, bp[]")
	2700	int	number of entries in page index (indexSize=46)
	2704	int[]	Tcount[indexSize]
	2888	int[]	Bcount[indexSize]
	3072	int[]	nodeCt[indexSize]
	3256	int[]	Addr[indexSize]
	3440	short	blocksize of data pages in records (bData=100)
	3442	short	size of data records in bytes (RData=73)

*Do not hard-code these addresses into your program, but calculate them from
 the stored parameters, and the number 12, which is the space needed by the
 first four numbers. (To confirm your understanding of the above layout, check
 that the stored parameters give the byte addresses listed, and answer the
 bonus questions at the end. Note that there is an extra Tcount and Bcount for
 the end of each level of trie pages; the corresponding values of nodeCt and
 Addr are -1, which can be used to detect the end of each level.)

Read the four arrays, Tcount, Bcount, nodeCt, and Addr, into main memory at the
outset and keep them there (736 bytes). Read the pages of trie nodes into main
memory only as needed (64 bytes each), but you may keep one page for each of the
four levels in main memory (256 bytes).

You will need to support the search logic you have been given for tries with a
mechanism to handle arrays of bitpairs, to compare bitpairs with bits in the
query, and to convert numeric query keys to bitstrings. Bits are stored eight
to the byte, and bitpairs four to the byte: use Java's byte masking operations;
but also take the time to build classes for bits, bitpairs, sequences of both,
and a cursor mechanism to keep track of where you are. Make the input and the
trie page data as close to arrays of bits or bitpairs, respectively, as you
can. It will require some time to do this, so start this assignment early.

The trie stores four bitpairs in each byte with the first pair in the least
significant two bits and the fourth pair in the most significant. Thus the mask
is updated as follows to scan the bitpairs in the correct order:
	mask= mask==192 ? 3 : mask*4.

For a bonus mark, answer the questions: what is the compression achieved by
storing the 885 phone numbers in triePhone instead of as 7-byte character
strings? 4-byte integers? 3-byte integers? What is the compression of the
whole file (including waste due to blocking the data)? Note that there are
9294 trie nodes, and hence bitpairs: if there were no wastage in either
trie pages or data pages, what would the compression be over the original
ascii file? What is the load factor of the 42 trie pages?


Optional: change your program to use internet sockets via HostClient, as you
did with Assignment 1. Compare the running times. Try this from home.


You may find the following four classes handy: bit, bitSeq, bitPair, bitPairSeq.
They are in ~cs612/java, along with the other java classes for the course.



Shared marks for shared work: assignments should be your own work; marks
for joint assignments will be divided by the number of collaborators.
But please feel free to work in groups to learn or for extra work that
is not for marks.