Bettina Kemme: Database Replication for Clusters of Workstations
Dissertation (PhD), Diss. ETH No. 13864, ETH Zürich, Departement of Computer Science, 2000.
Click to get the
Letter
Format PostScript
,
Letter Format Gzipped PostScript ,
Letter
Format PDF
,
DINA4
Format PostScript
,
DINA4 Format Gzipped PostScript ,
DINA4
Format PDF
Abstract:
This thesis is centered around the topic database replication. The
work has been motivated by the advances in the development of cluster
databases and their specific demands in terms of high throughput, low
response times, flexible load-balancing, data consistency and
fault-tolerance. Eager, update-everywhere replication seems a
promising mechanism to achieve these goals. By replicating data across
the cluster, transaction have fast access to local copies and the
available system can take over the work of failed sites. Using eager
replica control the updates on the different copies are coordinated
before the transaction commits providing data consistency and
fault-tolerance in a straightforward way. Update-everywhere is a
prerequisite for load balancing since only then a transaction can be
submitted at any site without any restrictions. However, existing
eager, update-everywhere solutions have severe performance limitations
and are barely used in practice.
What is needed is a solution that bridges the gap between theory and
and practice, and implements efficient, eager, update-everywhere
replication. The thesis provides several contributions to this goal.
We start with a detailed analysis of existing solutions depicting
those assumptions and techniques that make current solutions
inefficient and impracticable. In an attempt to eliminate these
limitations the thesis presents a couple of basic techniques that we
believe are necessary to provide efficiency: keeping the number of
messages small, using the powerful multicast primitives of group
communication systems to support replica control and fault-tolerance,
simplifying the coordination between the database systems, avoiding
deadlocks and a 2-phase commit protocol, and providing different
levels of transaction isolation and fault-tolerance in order to adapt
to various hardware and workload configurations.
Based on these techniques the thesis develops a replication tool in
three steps. First, we develop a theoretical framework including a
family of replica control protocols. These protocols delay
communication until the end of the transactions where all updates are
multicast in a single message to all sites. By using a total order
multicast as provided by group communication systems all messages
arrive in the same order at all sites. By guaranteeing that all sites
serialize conflicting transactions according to this order
transactions are globally ordered without further coordination among
the sites. In order to reduce resource consumption remote sites do not
need to reexecute the operations but only apply the physical changes.
The proposed replica control protocols provide different levels of
isolation allowing for various degrees of concurrency. Furthermore,
they offer two levels of fault-tolerance which vary in the degree of
data consistency on failed nodes. All protocols provide data
consistency on available sites and correctness criteria that are well
understood and widely used in practice.
In a second step we evaluate our approach in two ways. First, we
have built a simulation system that provides a performance comparison
of all proposed protocols and a comparison with traditional
solutions. This study shows that our approach provides
superior performance compared to traditional solutions, is more stable
and applicable for a much wider range of configurations and
workloads. By providing a whole family of protocols the approach is
able to adjust to the environment since the protocol can be chosen that
is best suited to alleviate the specifics of a given configuration. In
particular, the protocols are able to address high network costs and
high conflict rates. Second, we have proven the feasibility of the
approach in a real cluster environment by integrating it into the
database system PostgreSQL. Most of the functionality has been added
in separate modules and only few changes to the existing PostgreSQL
system were necessary. With this, we are confident that the approach
can be implemented in a similar way in other systems. The evaluation of
the system verified the results of the simulation study and proved
that eager update-everywhere can achieve high throughputs and low
response times, provide good scalability and allows for flexible
load-balancing.
The third step of the thesis evaluates further important issues. It
presents a solution to recovery that smoothly fits into the proposed
framework. It offers online joining and leaving of nodes while
maintaining data consistency on all sites. Furthermore, we discuss the
issues related to partial replication. Our solution is an attempt to
keep the replcation overhead proportional to the number of replica
while still providing a simple and practicable replica control
mechanism. We show that our approach supports non-replicated data as
well as remote data access, and flexible and fast subscription to
replica. The solutions to both recovery and partial replication have
been proven practicable by implementing them into PostgreSQL.
The work has been embedded into the DRAGON project. Within this
project, further work has focused on the relationship between group
communication system and database system. This work proposes tighter
interleaving between these two components in order to provide exactly
those multicast semantics that are needed by the database system. In a
first attempt we propose a scheme that is able to overlap transaction
processing with the communication delay to provide even smaller
response times.
As a summary the thesis proposes an eager update-everywhere
replication tool that provides most of the functionality needed in
cluster databases. The approach has been developed by building a solid
theoretical framework and by proving the feasibility in terms of
performance (simulation) and practicability (real implementation).