Bettina Kemme: Database Replication for Clusters of Workstations

Dissertation (PhD), Diss. ETH No. 13864, ETH Zürich, Departement of Computer Science, 2000.

Click to get the
Letter Format PostScript , Letter Format Gzipped PostScript , Letter Format PDF ,
DINA4 Format PostScript , DINA4 Format Gzipped PostScript , DINA4 Format PDF

This thesis is centered around the topic database replication. The work has been motivated by the advances in the development of cluster databases and their specific demands in terms of high throughput, low response times, flexible load-balancing, data consistency and fault-tolerance. Eager, update-everywhere replication seems a promising mechanism to achieve these goals. By replicating data across the cluster, transaction have fast access to local copies and the available system can take over the work of failed sites. Using eager replica control the updates on the different copies are coordinated before the transaction commits providing data consistency and fault-tolerance in a straightforward way. Update-everywhere is a prerequisite for load balancing since only then a transaction can be submitted at any site without any restrictions. However, existing eager, update-everywhere solutions have severe performance limitations and are barely used in practice.
What is needed is a solution that bridges the gap between theory and and practice, and implements efficient, eager, update-everywhere replication. The thesis provides several contributions to this goal.
We start with a detailed analysis of existing solutions depicting those assumptions and techniques that make current solutions inefficient and impracticable. In an attempt to eliminate these limitations the thesis presents a couple of basic techniques that we believe are necessary to provide efficiency: keeping the number of messages small, using the powerful multicast primitives of group communication systems to support replica control and fault-tolerance, simplifying the coordination between the database systems, avoiding deadlocks and a 2-phase commit protocol, and providing different levels of transaction isolation and fault-tolerance in order to adapt to various hardware and workload configurations.
Based on these techniques the thesis develops a replication tool in three steps. First, we develop a theoretical framework including a family of replica control protocols. These protocols delay communication until the end of the transactions where all updates are multicast in a single message to all sites. By using a total order multicast as provided by group communication systems all messages arrive in the same order at all sites. By guaranteeing that all sites serialize conflicting transactions according to this order transactions are globally ordered without further coordination among the sites. In order to reduce resource consumption remote sites do not need to reexecute the operations but only apply the physical changes. The proposed replica control protocols provide different levels of isolation allowing for various degrees of concurrency. Furthermore, they offer two levels of fault-tolerance which vary in the degree of data consistency on failed nodes. All protocols provide data consistency on available sites and correctness criteria that are well understood and widely used in practice.
In a second step we evaluate our approach in two ways. First, we have built a simulation system that provides a performance comparison of all proposed protocols and a comparison with traditional solutions. This study shows that our approach provides superior performance compared to traditional solutions, is more stable and applicable for a much wider range of configurations and workloads. By providing a whole family of protocols the approach is able to adjust to the environment since the protocol can be chosen that is best suited to alleviate the specifics of a given configuration. In particular, the protocols are able to address high network costs and high conflict rates. Second, we have proven the feasibility of the approach in a real cluster environment by integrating it into the database system PostgreSQL. Most of the functionality has been added in separate modules and only few changes to the existing PostgreSQL system were necessary. With this, we are confident that the approach can be implemented in a similar way in other systems. The evaluation of the system verified the results of the simulation study and proved that eager update-everywhere can achieve high throughputs and low response times, provide good scalability and allows for flexible load-balancing.
The third step of the thesis evaluates further important issues. It presents a solution to recovery that smoothly fits into the proposed framework. It offers online joining and leaving of nodes while maintaining data consistency on all sites. Furthermore, we discuss the issues related to partial replication. Our solution is an attempt to keep the replcation overhead proportional to the number of replica while still providing a simple and practicable replica control mechanism. We show that our approach supports non-replicated data as well as remote data access, and flexible and fast subscription to replica. The solutions to both recovery and partial replication have been proven practicable by implementing them into PostgreSQL.
The work has been embedded into the DRAGON project. Within this project, further work has focused on the relationship between group communication system and database system. This work proposes tighter interleaving between these two components in order to provide exactly those multicast semantics that are needed by the database system. In a first attempt we propose a scheme that is able to overlap transaction processing with the communication delay to provide even smaller response times.
As a summary the thesis proposes an eager update-everywhere replication tool that provides most of the functionality needed in cluster databases. The approach has been developed by building a solid theoretical framework and by proving the feasibility in terms of performance (simulation) and practicability (real implementation).