Copyright © 1998 T. H. Merrett

308-420A       Secondary Storage Algorithms and Data Structures           Week 3

                              Cost Analysis for Trees

- Phi-ary trees. (In these notes, we write "p" instead of "phi".) These are
  complete trees with fanout  p  at each node. They can be characterized as
  follows, where  h,  the height, is the number of levels in the tree and is
  also the number of accesses required for a search for a single record at a
  leaf, if each node is a block on secondary storage.

               Level          Number of nodes           Cumulative
                               at this level          number of nodes
                 0                  1                        1
                 1                  p                      1 + p
                 2                 p^2                   1 + p + p^2
                 :                  :                        :
                h-1               p^(h-1)             Sum (k=0..h-1) p^k

  The total number of nodes, Sum (k=0..h-1) p^k, is n = (p^h - 1)/(p - 1).
  It follows that h = log_p (n(p - 1) + 1). (Eq. 1.3-34 on p.107 is wrong.)

  Since the number of accesses to get to level  k  is  k+1  (k=0..h-1),
  the average or expected number of accesses required to find a single
  record at any level is (Sum (k=0..h-1) (k+1)p^k)/n. Since (k+1)p^k is
  the derivative of p^(k+1) with respect to  p,  and since the operation
  of differentiating can be taken outside the summation, the numerator is
  just the derivative of p(Sum (k=0..h-1) p^k), or

                          hp^(h+1) - (h+1)p^h + 1
                    num = -----------------------
                                (p - 1)^2

  We can check two special cases. If p=2, for a binary tree, the expected
  search depth is
                          (h - 1)2^h + 1
                 num/n =  --------------  ~  h - 1.
                             2^h - 1

  This says that, on the average, a search in a binary tree will take us
  down to the level just above the leaf nodes. This makes sense, because in
  any binary tree, the number of leaves is one more than the number of internal
  nodes. Half the nodes are leaves, so we expect the average search to get to
  just before the leaf level.

  The other special case is very large  p.  In this limit,

                num/n -> --------  =  h.

  This says that for large  p,  the average search goes even deeper than for
  p=2, ultimately reaching the leaves. In fact,  p  does not have to be very
  large for this to be a close approximation. This makes sense, too, because
  the bushier the tree, the more the leaves dominate the node count. It also
  implies that we damage performance only negligibly if we move from a
  homogenous B-tree, with data in internal nodes, to an inhomogenous B-tree,
  with data all at the leaves. (And, of course, we gain by vastly increasing
  the fanout and hence decreasing the depth.)

- B-trees. We use techniques similar to the above to find the upper and lower
  bounds on the height,  h,  of a homogenous B-tree, as functions of  N,  the
  number of records stored. The order of the B-tree is  f,  so that the
  fanout,  s,  at any node satisfies  f/2 <= s <= f.

  There are two extreme cases. The best case gives the lower bound on  h.  In
  this case, the B-tree is as full as possible and each node contains f-1 keys
  and has  f  descendents. The worst case, with the B-tree as empty as allowed
  by the rules, gives the upper bound. Below are the contents and fanout of
  each node, where f2 = ceil(f/2).

                       Best  Case                  Worst  Case
       Level     No. Nodes    No. Records    No. Nodes   No. Records
         0           1           f-1             1           1
         1           f           f-1             2         f2-1
         2          f^2          f-1           2*f2        f2-1
         :           :            :              :           :
        h-1       f^(h-1)        f-1        2*f2^(h-2)     f2-1

  Thus, for these two cases, the total number of records is

    (best case)   N = (f-1)(1 + f + f^2 + .. + f^(h-1)) = f^h - 1
    (worst case)  N = 1 + 2(f2-1)(1 + f2 + .. + f2^(h-2)) = 2*f2^(h-1) - 1

  and the range of values for  h  is

    log_f (N+1) <= h <= 1 + log_f2 ((N+1)/2).  (x_y means x subscripted y)

  This tells us the upper and lower bounds for the maximum search cost in
  a homogenous B-tree. The minimum search cost is  1  access. What is the
  expected search cost?

  For an inhomogenous B-tree, the total number of records,  N,  is given
  by the number of records in the leaf nodes. (And  f  and  f2  are much
  larger than for the homogenous case.) What are the maximum and minimum
  search costs?

- The above discussion gives the cost of searching B-trees. What about the
  cost of building them? The maximum number of splits for any insertion is
  h,  the height of the tree, but usually an insertion requires no splits
  or occasionally one. The average number of splits, after building a tree
  of  N  records in  n  nodes, is (n-1)/N ~ 1/b*alpha, where  b  is the
  block size and  alpha  the load factor.