```Copyright © 1998 T. H. Merrett

308-420A       Secondary Storage Algorithms and Data Structures           Week 3

Cost Analysis for Trees

- Phi-ary trees. (In these notes, we write "p" instead of "phi".) These are
complete trees with fanout  p  at each node. They can be characterized as
follows, where  h,  the height, is the number of levels in the tree and is
also the number of accesses required for a search for a single record at a
leaf, if each node is a block on secondary storage.

Level          Number of nodes           Cumulative
at this level          number of nodes
0                  1                        1
1                  p                      1 + p
2                 p^2                   1 + p + p^2
:                  :                        :
h-1               p^(h-1)             Sum (k=0..h-1) p^k

The total number of nodes, Sum (k=0..h-1) p^k, is n = (p^h - 1)/(p - 1).
It follows that h = log_p (n(p - 1) + 1). (Eq. 1.3-34 on p.107 is wrong.)

Since the number of accesses to get to level  k  is  k+1  (k=0..h-1),
the average or expected number of accesses required to find a single
record at any level is (Sum (k=0..h-1) (k+1)p^k)/n. Since (k+1)p^k is
the derivative of p^(k+1) with respect to  p,  and since the operation
of differentiating can be taken outside the summation, the numerator is
just the derivative of p(Sum (k=0..h-1) p^k), or

hp^(h+1) - (h+1)p^h + 1
num = -----------------------
(p - 1)^2

We can check two special cases. If p=2, for a binary tree, the expected
search depth is
(h - 1)2^h + 1
num/n =  --------------  ~  h - 1.
2^h - 1

This says that, on the average, a search in a binary tree will take us
down to the level just above the leaf nodes. This makes sense, because in
any binary tree, the number of leaves is one more than the number of internal
nodes. Half the nodes are leaves, so we expect the average search to get to
just before the leaf level.

The other special case is very large  p.  In this limit,

hp^(h-1)
num/n -> --------  =  h.
p^(h-1)

This says that for large  p,  the average search goes even deeper than for
p=2, ultimately reaching the leaves. In fact,  p  does not have to be very
large for this to be a close approximation. This makes sense, too, because
the bushier the tree, the more the leaves dominate the node count. It also
implies that we damage performance only negligibly if we move from a
homogenous B-tree, with data in internal nodes, to an inhomogenous B-tree,
with data all at the leaves. (And, of course, we gain by vastly increasing
the fanout and hence decreasing the depth.)

- B-trees. We use techniques similar to the above to find the upper and lower
bounds on the height,  h,  of a homogenous B-tree, as functions of  N,  the
number of records stored. The order of the B-tree is  f,  so that the
fanout,  s,  at any node satisfies  f/2 <= s <= f.

There are two extreme cases. The best case gives the lower bound on  h.  In
this case, the B-tree is as full as possible and each node contains f-1 keys
and has  f  descendents. The worst case, with the B-tree as empty as allowed
by the rules, gives the upper bound. Below are the contents and fanout of
each node, where f2 = ceil(f/2).

Best  Case                  Worst  Case
Level     No. Nodes    No. Records    No. Nodes   No. Records
0           1           f-1             1           1
1           f           f-1             2         f2-1
2          f^2          f-1           2*f2        f2-1
:           :            :              :           :
h-1       f^(h-1)        f-1        2*f2^(h-2)     f2-1

Thus, for these two cases, the total number of records is

(best case)   N = (f-1)(1 + f + f^2 + .. + f^(h-1)) = f^h - 1
(worst case)  N = 1 + 2(f2-1)(1 + f2 + .. + f2^(h-2)) = 2*f2^(h-1) - 1

and the range of values for  h  is

log_f (N+1) <= h <= 1 + log_f2 ((N+1)/2).  (x_y means x subscripted y)

This tells us the upper and lower bounds for the maximum search cost in
a homogenous B-tree. The minimum search cost is  1  access. What is the
expected search cost?

For an inhomogenous B-tree, the total number of records,  N,  is given
by the number of records in the leaf nodes. (And  f  and  f2  are much
larger than for the homogenous case.) What are the maximum and minimum
search costs?

- The above discussion gives the cost of searching B-trees. What about the
cost of building them? The maximum number of splits for any insertion is
h,  the height of the tree, but usually an insertion requires no splits
or occasionally one. The average number of splits, after building a tree
of  N  records in  n  nodes, is (n-1)/N ~ 1/b*alpha, where  b  is the
block size and  alpha  the load factor.

```