Copyright © 1996 T. H. Merrett

308-420	         Secondary Storage Algorithms and Data Structures         Week 9

Design Problem


In an air traffic control system, aircraft file flight plans, which specify a sequence of rectangular cells they will fly through, from a rectilinear grid of cells all the same dimensions, covering the country. They also file departure and arrival times and airports. This makes it possible to predict roughly when each aircraft will be in each cell. Aircraft are identified by flight numbers.

Thus, flight plans involve fields FlightNo, DepPort, DepTime, ArrPort, ArrTime, and a sequence of Cells.

While an aircraft is in flight, it is monitored, and every ten seconds its Longitude, Latitude, and Bearing are recorded. These must be checked against the predictions of the flight plan, and a warning issued if the aircraft is in the wrong cell at the wrong time.

If this happens, the aircraft must be checked against any other aircraft near it for position and bearing. If two aircraft will come within a preset distance from each other on their present bearings, the time when this will happen must be calculated. This process must be complete before the next position and bearing data are recorded.

(We assume that all aircraft fly at the same altitude.)

Using the categories on which the course is based, and making appropriate calculations approximately, design file(s) to support the above application. Calculate file sizes, and estimate the times to do the operations. Assume a disk with 2 $\mu$sec. (microseconds) transfer time per byte and 20 msec. (milliseconds) average seek time.

308-420A			Design Problem 			 Week 9
SIZES (9 marks)
Suppose 10,000 aircraft in flight at any time; maybe 20,000 planes total;
Cell size 10km*10km (0.8min to cross), avg. flight crosses 100 cells;
Keep no history of position or bearing.
FPlan(FlightNo, DepPort, DepTime, ArrPort, ArrTime)
6        3        4         3       4       = 20 * 20,000 = 400K
CellSeq(FlightNo, CellX, CellY, InTime, OutTime)
6       4      4       4        4         = 22 *100*20,000 = 40M
PosBear(FlightNo, Long, Lat, Bear)   could be part of FPlan, but it's not plan
6       4    4     4  = 20 *10,000 = 200K
Every 10 sec., record PosBear, check against CellSeq (InTime), possibly check
against PosBear (Long, Lat).

AVS (18 marks)
FPlan
lo V:  replaced daily; created all at once (plans don't change except for
weather .. _all_ plans .. or major equipment failure)
hi A: used only to calculate times for CellSeq; basically SEQUENTIAL, so
? S: doesn't matter
CellSeq
lo V: _replaced_ daily, created all at once
hi A: look up _all_ flights every 10 sec., but only 1 of 100 cells: 1% is
big for R=20 (b.e.a. = 20/10020 = 0.2%)
lo S: only InTime
PosBear
hi S: look up Long, Lat
lo V: rewrite only, every 10 sec: this does not affect structure. Even if
we add/delete on takeoff and landing, do this at rewrite time.
lo A: exceptional lookups (must be fast): hi if 20 flts in 10,000 are close

CHOICE (5 marks)
FPlan: SEQ (small, but not necessarily RAM)
CellSeq: seq. is too slow (40 sec.). Need another ordered method, but NOT
direct because of worst case. B-Tree clustered on InTime.
PosBear: B-tree again, because of worst case; Z-order on Long, Lat.

TIMING (8 marks)
(Can we look up 10,000 plans in CellSeq in 10 sec.?)
N = 10^6 records, f=b~200 (4K blocks);
worst case $1 + \;ceil \log_{100} 5_{10}5 \rceil = 4$
so 4 * (20ms + 4K*2mus) = 4 * 28 = 112ms per access
Clustered by time, so 10,000 flights in 50 blocks
so 50 * 112ms = 6 sec: we made it! (Could go up to 16,666 flts.)


Copyright © 1998 T. H. Merrett

308-420             Secondary Storage Algorithms and Data Structures      Week 9

Design Problem


A world-wide database for travellers records information on 500,000 restaurants, 100 restaurant chains, and 1000 restaurant districts.

Some, but certainly not all, of the restaurants belong to chains. Some, but certainly not all, of the restaurants are in restaurant districts.

The fields stored should include RestName, Address, TC (whether the restaurant is in town or in the country), Cuisine (e.g., Afghanistani, Greek, Hamburger), PriceRange (low/medium/high), ChainName and HeadQuarters of the chain, District, City of the district, and South, West, North and East streets bounding the district.

The database should be able to answer queries such as Find all restaurants in Montréal [Address] serving medium-priced Lebanese food'', or Find the district with the most low-priced restaurants'', or Find all the districts which have a McDonald's''. Other queries should also be possible: the files should be designed for flexibility.

Furthermore, the agency maintaining the database should be able to add new restaurants, districts and chains.

Using the categories on which the course is based, and making appropriate calculations approximately, design files to support the above matching. Calculate file sizes, and estimate the time to do the match. Assume a disk with 2 microsec. transfer time per byte and 20 msec. (milliseconds) average seek time.

308-420A			Design Problem 			 Week 9

file sizes (9 marks)
Rest(RestName, Address, TC, Cuisine, PriceRange, ChainName, District)
20	  40	 1	20	  2		20	20  :120
*500K = 60Mb	6K pages => 10^4 pages
Chain(ChainName,HQ)
20	20: 40	*100 = 4Kb RAM!
Dist(District, City, South, West, North, East)
20	20	20   20	    20	  20 :120	*1000 = 120Kb	20 pages

AVS (18 marks)
A: e.g., get list of restaurants for travellers, is presumably low activity:
say, 50 restaurants of 500K  << 1%
But queries such as "find districts with low-priced restaurants" could be
high activity. It says, be flexible.
V: "add new restaurants, chains, districts": take it as high volatility
S: "flexible": large variety of queries, so high.

Choice (5 marks)
Seq  B-tree  B-tree + Z  KDB  trie  kd-trie  hash  lin.hash  tidy  SMP  DMP
A!     S!                     S!     V?      S!      S!      S!    V!
try DMP for both Rest, Dist

Costs (8 marks)

MP: don't use RestName (it's a key,  or at least unlikely source for high-act.)
don't use TC, PriceRange (not enough selectivity)
left with Address (2000), Cuisine (1000), District (1000), Chain (100)

I 2-D DMP: Address, Cusine	10^4 = 2x * 1x	x = 100/root(2)
f1= 140, f2= 70	n= 9800, B= 6123
or f1= 150, f2= 75	n= 11250, B= 5334

Q1 one access: 20ms + 6K * 2 mu-s = 32 ms
Q2 seq: 20 ms + 6*10^7 * 2 mu-s = 120 sec
Q3 seq: ditto

II 3-D DMP: Address, Cuisine, District	10^4 = 2x * 1x * 1x	x = 10 root3(5)
f1= 34, f2= 17, f2= 17	n= 9826, B= 6107

Q1 17 accesses: 17*32ms = 544 ms
Q2 seq: 120 sec
Q3 seq: 120 sec

III 4-D DMP: Address, Cuisine, District, Chain	10^4 = 20x * 10x *10x * x
x = root4(5)
f1= 20, f2= 15, f3= 15, f4=2	n= 9000, B= 6667

Q1 15*2=30 accesses: 30*32ms = 960 ms
Q2 seq: 120 sec
Q3 10^4/2 accesses: 10^4*32ms/2 = 160 sec slower than seq (activity 50%)

So of these three, 2-D DMP seems best, for the three given queries. Of course,
different assumptions about Vi may change this.


Copyright © 1999 T. H. Merrett

308-420           Secondary Storage Algorithms and Data Structures        Week 9

Design Problem


The hyperglobe is a proposal for organizing knowledge according to two geographical coordinates (longitude and latitude) and one temporal coordinate (date), indicating the origin of the topic. The knowledge'' is represented as large chunks of unstructured text or other unformatted data.

For instance, at the location of Hastings, England, at 1066 AD, there might be a short book (100 Kbytes) on the Norman invasion; at the location of Montreal for 1987 AD there might be a score (10 Kbytes) and an audio record (40 Mbytes) of the premiere of Patriquin's Earthpiece; at the location of Moscow for 1991 AD, there might be a short CNN television report on the Soviet coup (1 Gbyte).

To resolve longitude and latitude to about 0.1 seconds of arc (about 3 metres at the equator) requires 3-byte fields. To resolve dates to quarter-days also requires a 3-byte field.

There may not be room on secondary storage for the chunks of unstructured data, and they may even be stored in some other format, such as on optical disk, or as documents in some remote online library. So they will be represented in the main file by an identifier of, let us say, 4 bytes.

The documents contain cross references to each other by this identifier, by which we must be able to locate them rapidly in the system. The geographical and date coordinates should be rapidly retrievable given the identifier. A user should be able to examine documents for all dates at a given location or for all locations at a given date.

Secondary storage should hold a description of the name and of the location of any data that is stored elsewhere.

New documents are frequently added, but once there do not often change.

Suppose there are 10,000 different geographical references, 10,000 different dates, and 10,000,000 different documents.

Using the categories on which the course is based, and making appropriate calculations approximately, design file(s) to support the above application. Calculate file sizes, and estimate the times to do the operations. Assume a disk with 20 nsec. (nanoseconds) transfer time per byte and 5 msec. (milliseconds) average seek time.

308-420A			Design Problem 			 Week 9
FILES & SIZES (9 marks)
Coords(Long, Lat, Date, DocID)
3    3     3     4           13 * 10M = 130MB
Remote(DocID, Title, Location)
4      40      56            100 * 1M = 100MB
Text(DocID, Doc)              suppose 1st byte of ID indicates text, audio, ..
4    100K(avg)                 100K* 8M = 800GB
Video(DocID, Doc)
4    1G(avg)                  1G* 1000 = 1TB
etc.
(these should add to 10M docs)
AVS (18 marks)
A               V               S
lo hi 100%      lo hi           lo hi
Coords         Y (Y)                             Y
Remote         Y (Y)             Y             Y
Text           Y (Y)             Y             Y
Video          Y (Y)             Y             Y
etc.           Y (Y)             Y             Y

Activity:
Coords  lo because need to ref individual doc by DocID
hi? breakeven 13/250013=1/20000
activity given 1 loc (or date) is 1/1000: lo,
but these are averages, and some cases could be > beakeven
others  we must assume activity is proportional to that for Coords
Volatility:
all files  grow only: vol. depends on effect on organizing fields
Symmetry:
Coords  hi(2-D): Date, {Long, Lat}   NB. DocID is a key, may be a 3rd dim.
others  lo: DocID
Volatility (again): suppose DocID is assigned on filing, last DocID + 1
Coords  may affect file structure  e.g., SMP won't do
others  need not affect file structure apart from appending at end
since Seq is ruled out by A, V is hi

CHOICE (5 marks)
Coords
Seq B B+Z trie kdtrie hash kdhash linhash tidy SMP DMP
A X
V                       X                    X   X
S    X       X          X             X      X
So: B+Z, kdtrie (logarithmic), DMP (direct). Choose DMP: 2-D, with DocID given
by secondary index. This can be a 1:1 tidy "function" because DocIDs are
consecutive integers: Key(Link) = 4 bytes * 10M = 40MB
others
Seq B B+Z trie kdtrie hash kdhash linhash tidy SMP DMP
A X
V                       X                    X   X
S       X         X            X                 X   X
So: B-tree or trie (logarithmic), linear hashing (direct). Choose linhash.
But note that Text, Video, etc., have _variable-length_ records, so will
need indexing files, like Key.

TIMING (8 marks)
1. Finding 1 doc by DocID
access Key: 5 ms + B*20ns = 5 ms
access Coords: ditto = 5 ms
access Remote: ditto = 5 ms
access Text or Video or etc: 5 ms (index) + 5 ms + 20ns*docsize
total 15 ms (Remote) or 20+2 ms (Text) or 20 sec (Video)
2. Finding all docs for place (or date: same numbers)
access Coords: suppose B = 1443, so n= 90,000= 300 * 300 (square by spec)
1 row is 1/300 of this: 300 pages, so 300*(5ms + 1443*20ns) = 1.5 sec
(NB seq access is 130M * 20ns = 2.6 sec)
access Remote or Text or Video or etc (resp. n= 10^5,10^9,10^9; r=10^4):
n(1 - (1 - 1/n)^r) = r - r^2/2n + r^3/6n^2 - .. = resp. n*0.08, r, r
= (resp. 8000, 10000, 10000 accesses) * (5ms + 1K*20ns)
= (resp. 8000, 10000, 10000 accesses) * (5ms + 1K*20ns)
NB this applies only to Remote, because Text, Video, etc. have _records_
much bigger than 1K, and we should analyze the indexing files.




308-420             Secondary Storage Algorithms and Data Structures      Week 9

Design Problem


There are potentially about 18 million postal codes in Canada, each defining a region on the map which we can suppose to be a polygon of about six sides. Thus, there are about 53 million edges in the map of all postal codes (almost every edge is shared by the two regions it separates).

This question is concerned with a secondary storage representation of this postal code map, based on the edges. Each region is a cycle of edges, and each vertex (a corner where edges meet) is also a cycle, of the edges meeting there. So the edges, vertices, and regions are described by a structure in which each edge is stored four times, about 211 million records in all. Each edge appears in two vertex cycles (the vertex at the start of the edge and the vertex at the end) and in two region cycles (the region to the left of the edge and the region to the right). Each edge has an identifier and a direction (so start'', end'', left'', and right'' are established).

[The above two paragraphs motivate the following. You need not have read them completely.]

A cycle is stored as a set of edge pairs: an edge, and the edge following it in the cycle. The file to represent the cycles will have four fields, edgeID, edgeDirection, successorID, and successorDirection. It will have 211 million records.

In addition, you need to store information for each edge naming its two vertexes and its two regions, and coordinates for each vertex.

There are two kinds of operations that should be supported by your file design. The first is drawing the map of all the postal code regions, or any part of the map. The second is changing the map by means of {/em splice} operations: splice(edgeID1, edgeDirection1, edgeID2, edgeDirection2) finds each edge in the file and replaces it by the other edge. With enough splice operations, any rearrangement of the map is possible.

Using the categories on which the course is based, and making appropriate calculations approximately, design file(s) to support the above application. Calculate file sizes, and estimate the times to do the operations. Assume a disk with 20 nsec. (nanoseconds) transfer time per byte and 5 msec. (milliseconds) average seek time.

308-420A			Design Problem 			 Week 9
1) let's make it very simple: two files, one for drawing, one for splice
(it wasn't specified what splice changes, so suppose it changes only cycles:
this is not actually true - splice also creates or destroys faces)
we'll also start by supposing applic 1 is to draw only the whole file.
FILES & SIZES (9 marks)
cycles(edgeID,edgeDir,succID,succDir)    (4 bytes for ID since 2^26=64*10^6)
4      1       4       1       = 10 * 211*10^6 = 2.1*10^9
edges(edgeID,startX,startY,endX,endY,left,right) can also have start,end
4      4       4    4    4     6    6                    4   4
= ((5 or 7) * 4 + 2*6) *53*10^6 = 2*10^9
AVS (18 marks)
applic 1: draw whole file: use only edges - 1 pass to draw edges in any order
100% activity, does not change, no field singled out
applic 2: splice: 1 splice costs 4 accesses to cycles on each of edgeID,succID
and we assumed it does not affect edges
activity 8/53M, 2D symmetry (edge/dir,succ/dir), splice changes key fields
access-transfer ratio, rho = 5ms/20ns = 250,000
breakeven activities, R/(R+rho) = (10, 40)/250,000 = (1/25K, 1/6K) for (c, e)
A                       V                       S
e1    h                       l                       l, in fact 0D
c2    l                       h                       h (2D: edgeID,succID)
CHOICE (5 marks)
Seq B B+Z trie kdtrie hash kdhash linhash tidy SMP DMP
cA l
V h                      h     h             h   h
S h  h       h           h     h      h      h
?         ?                                  ?
try DMP on edge and succ (10^53*10^53 even though each is a 1/4th key)
eA                        h     h      h
V
S       -         -            -                 -   -
?   ?      ?                                ?
try seq
TIMING (8 marks)
e1 draw all: 2*10^9 bytes in 1 pass: 5ms + 2*10^9 * 20*10^-9 = 40 sec
c1 splice: a) find the 4 matching edges under edgeID-edgeDir:
2*10^9 bytes / B=2K => n = 10^6 => fe = fs = 1000 (Ve = Vs = 10*53)
so 1000 * (250,000 + 2K) * 20*10^-9 = 5 sec
b) find the 4 matching edges under succID-succDir: also 5 sec
c) do same for the edges they were swapped for: 10 sec. more

2) improvement to allow drawing of partial map:
AVS (18 marks)
e changes to allow us to specify (i) any set of regions (ii) a range of coords
so can be as low A as 6/53M < 1/6K, so both lo and hi activity;
still low V; symmetry -
either a region, say left, or a coord pair, say startX,Y
(this assumes edges are short, so left, right are not far apart,
ditto start, end)
A                       V                       S
e1    lh                      l                       h, 2D (edge, startXY)
CHOICE (5 marks)
Seq B B+Z trie kdtrie hash kdhash linhash tidy SMP DMP
eA l                      h     h      h
V                                     -              -
S h  h       h           h            h      h
?         ?                              ?
try SMP on left, (startX,startY)      NB z-order start for small range queries
TIMING (8 marks)
Vl = 18M, Vs = 36M, if coord resolution is high enough
2*10^9 bytes / B=1K => n = 2*10^6 => fl = fs/2 = 1000
e1 draw all: same - just read the SMP file in 1 pass, no extra overhead
draw selected X-Y range (high-A): select on-the-fly in sequential scan
draw selected regions (high-A): merge on left (or right) with region set
draw selected regions: 2000 * (250,000 + 2K) * 20*10^-9 = 10 sec
so 10 sec for every 1/1000 of the file
draw selected coord pairs: 5 sec for every 1/2000 of the file