Copyright © 1998 T. H. Merrett

308-420A            Secondary Storage Algorithms and Data Structures      Week 4

			Building and Searching Full Tries

I Building a full trie

 b : number of bits in key
 t : number of levels of trie nodes in each layer of pages
 h : height of trie in pages
 b=ht
 2^t: max number of trie nodes in a page
	   (2 bits per node, so max bitstring in page has 2^(t-2) bytes)
 page nodes are bitpairs
 index to page will contain:
 T : topcount - number of trie edges entering this level prior to this page
 B : botcount - number of trie edges leaving this level prior to this page
 N : node-count - number of nodes (non-00 bitpairs) stored in this page
 A : disk address (page number)
 also needed while constructing:
 Tc: cumulative T - number of trie edges entering this level thru this page
 Bc: cumulative B - number of trie edges leaving this level thru this page
 (will become T, B, resp. for next page this level)

the keys must be input in ascending lexicographic order

e.g., b=8, h=2, t=4, 2^t=16 nodes (4 bytes) per page

key 00000011
showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 1, 4)
10 
10 
10 
10 
showPage(1): (T, B, Tc, Bc, N)= (0, 0, 1, 1, 4)
10 
10 
01 
01 

key 00101100
showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 2, 5)
10 
10 
11 
10 10 
showPage(1): (T, B, Tc, Bc, N)= (0, 0, 2, 2, 8)
10 01 
10 01 
01 10 
01 10 

key 10000000
showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 3, 8)
11 
10 10 
11 10 
10 10 10 
showPage(1): (T, B, Tc, Bc, N)= (0, 0, 3, 3, 12)
10 01 10 
10 01 10 
01 10 10 
01 10 10 

key 10000101
showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 3, 8)
11 
10 10 
11 10 
10 10 10 
showPage(1): (T, B, Tc, Bc, N)= (0, 0, 3, 4, 14)
10 01 10 
10 01 11 
01 10 10 10 
01 10 10 01 

key 10001000
showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 3, 8)
11 
10 10 
11 10 
10 10 10 
writePage(1): (T, C, N, A)= (0, 0, 8, 0)
10 01
10 01
01 10
01 10
showPage(1): (T, B, Tc, Bc, N)= (2, 2, 3, 5, 9)
11 
11 10 
10 10 10 
10 01 10 

key 10100000
showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 4, 9)
11 
10 10 
11 11 
10 10 10 10 
showPage(1): (T, B, Tc, Bc, N)= (2, 2, 4, 6, 13)
11 10 
11 10 10 
10 10 10 10 
10 01 10 10 

key 10101100
showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 4, 9)
11 
10 10 
11 11 
10 10 10 10 
showPage(1): (T, B, Tc, Bc, N)= (2, 2, 4, 7, 16)
11 11 
11 10 10 01 
10 10 10 10 10 
10 01 10 10 10 

key 11000000
showPage(0): (T, B, Tc, Bc, N)= (0, 0, 1, 5, 11)
11 
10 11 
11 11 10 
10 10 10 10 10 
writePage(1): (T, B, N, A)= (2, 2, 16, 1)
11 11
11 10 10 01
10 10 10 10 10
10 01 10 10 10
showPage(1): (T, B, Tc, Bc, N)= (4, 7, 5, 8, 4)
10 
10 
10 
10 

no more keys
writePage(0): (T, B, N, A)= (0, 0, 11, 2)
11
10 11
11 11 10
10 10 10 10 10
writePage(1): (T, B, N, A)= (4, 7, 4, 3)
10 
10 
10 
10 

So the final trie pages are
10 01	11 11		11		10 
10 01	11 10 10 01	10 11		10 
01 10	10 10 10 10 10	11 11 10	10 
01 10	10 01 10 10 10	10 10 10 10 10	10 

And the final page indexes are
T  0  1  0  2  4  5
B  0  5  0  2  7  8
N 11 -1  8 16  4 -1
A  2 -1  0  1  3 -1

(One page (2) for root level, three (0, 1, 3) for second level; -1s signal end
 of level, and the T and B for these columns are the cumulative values for that
 level: total entries= cumulative B for last level (8); total nodes=11+8+16+4
 - compare 8 bits * 8 keys: 39*2 bits vs. 64: already 2 bits per node is much
 less than double the original number of bits, and a not much larger set of
 keys will actually be compressed.)


To visualize, rearrange pages by level, and interleave with T, B, N
T  0	11		T 1
B  0	10 11		B 5
N 11	11 11 10
	10 10 10 10 10

T  0	10 01	T  2	11 11		T  4	10	T  5
B  0	10 01	B  2	11 10 10 01	B  7	10	B  8
N  8	01 10	N 16	10 10 10 10 10	N  4	10
	01 10		10 01 10 10 10		10 

(we don't need the page addresses to search by hand)


II Searching the full trie

Searching this trie for 10001000: we use "x" to show which "1" is hit	130 line

T  0	1x		T 1
B  0	10 x1		B 5
N 11	11 x1 10
	10 10 x0 10 10

T  0	10 01	T  2	1x 11		T  4	10	T  5
B  0	10 01	B  2	11 x0 10 01	B  7	10	B  8
N  8	01 10	N 16	10 10 x0 10 10	N  4	10
	01 10		10 01 x0 10 10		10 

Gets to bottom of trie, so success.


Now, how does the computer do it?

Let's look at the search process on a single page. The bitpairs are just found
in a sequence with no way of telling which is the end of a level, e.g., the
root page
	11 10 11 11 11 10 10 10 10 10 10 
So we maintain a counter, skip, which tells us how many pairs are on each level,
and we can use this to increment a "cursor", last, which gives the last node on
the present level. Here they are for the root page
	skip= 1; last= 0;
 - Each time we see "11", we increment skip.
 - When we come to last, we set the next value for last by adding skip to it.
Try it for the root page (the bitpairs are counted from position pos=0)
	pos	bitpair	skip	last
			  1	  0
	 0	   11	  2	  2
	 1	   10
	 2	   11	  3	  5
	 3	   11	  4
	 4	   11	  5
	 5	   10		 10
	 6	   10
	 7	   10
	 8	   10
	 9	   10
	10	   10		 15
So the last bitpair in each level is at positions 0, 2, 5, 10, respectively.
(We don't need the 15 because we are at the end of the page.)

But we are not just reading through the trie: we are searching. In each level,
we need a bitpair to compare with the current input bit. Call the position of
this bitpair  next. We need to count the "1" bits in each bitpair up to the
bitpair at  next, in order to find  next  for the next level. We use a counter,
srch, to do this.
 - Initially, for the root page,	next= 0; srch= 0;
 - Before  next, increment  srch  for every "1" bit;
 - at  next, increment  srch  for the "1" bits including the one we marked "x",
   above, but not past it.
 - At  last, reset  next= last + srch, and reset  srch= 0.
Meanwhile, maintain  skip  and  last  as above.
Here it is for the first four bits of our example search key, 10001..
	pos	bitpair	input	srch	next	skip	last
			  1	  0	  0	  1	  0
	 0	   11	  0	  2,0	  2	  2	  2
	 1	   10		  1
	 2	   11	  0	  2,0	  4	  3	  5
	 3	   11		  2		  4
	 4	   11	  0	  3,0	  8	  5
	 5	   10				 10
	 6	   10		  1
	 7	   10		  2
	 8	   10	  1	  3	 13
	 9	   10
	10	   10		 15
So the bitpair to be checked in each level against the input bit is at positions
0, 2, 4, 8, respectively. (If the next level of nodes were also stored on this
page, we can see that the next bitpair to be checked would be at position 13,
and the end of the level would be at position 15.) Note that input bit 0 matches
bitpairs 10 and 11, and input bit 1 matches bitpairs 01 and 11.


Now we must consider changing pages. We might have to sequence through all the
bitpairs of the trie, but T and B save us from looking at any but the relevant
pages on each level. We need to use the (T, B, N, A) index to do two things:
1) find the address, A, of the next page, and 2) recalculate  next  and  last
so they work correctly for the page instead of for the whole trie.

Continuing the example, we can see how to do each of these.
1)  srch, counting the "1" bits, also counts the descendent subtries from the
current level. In this case, it has value 3, which means that we need to go to
the third subtrie from the beginning of the next level. T for that level tells
us which page this is on: we scan the T index until an entry >= 3, then pick
the page before. (In general, when descending from a page other than a leftmost
page of its level, such as the root page, we scan until >=B'+3, where B' is the
B value for the page we are just leaving.)

2) The value we need for  next  is
	B' + srch - T - 1,
where T is the T-value for the descendent page we have now selected, and  srch
gets reset to 0. In the example, we have  next= 0 + 3 - 2 - 1= 0; srch= 0.

For  last, we reset  skip  to T" - T, where T" is the T-value for the page
following the selected page on the same level (or the end-of-level value if
the selected page is the last on the level). Then last= skip - 1. For the
example,  skip= 4 - 2= 2; last= 1.


The final consideration is, if we have found the key in the trie, how do we
find the corresponding record in the data? The data was loaded in the same
order as the keys, and so the remainder of each record occupies a consecutive,
fixed-length position in the data file. A calculation almost the same as the
one that found  next  for the next level of trie pages will give the position
of the data record. In the example, our succesful search concluded at the
third bit of the last level of nodes on the second page of the last level of
pages. Since B for this page is 2, the record we want is the fifth, which is
the one at position 4 of the data file, if we count from 0. The position is
thus
	B + srch - 1


The overall logic for the search algorithm should be three nested levels of
loops, the outer loop over all page levels, the middle loop over all node
levels within the page, and the inner loop a combination of four steps:
loop until  next; code for  next; loop until  last; code for  last.

  for (int pgLev=0; pgLev<h; pgLev++)
    if (pgLev > 0) { find and read next page; reset skip, last, next, srch }
    for (int ndLev= 0; ndLev<t; ndLev++)
      while (...isBefore(next) ) { increment skip, srch }
      fail if no match, or succeed if last input bit, or
	get next input bit and increment skip, srch
      while (...isBefore(last) ) { increment skip }
      if (ndLev < t - 1) { reset next, srch, last }