Topic Table

400-Project:

A graphical representation tool for scientific experimental workflows

Thesis / Projects

Database Replication (several Master Thesis / PhD Thesis)
Exp-DB: a database tool for fast prototyping of experimental database systems (Master Thesis / Master Project)
XML database systems (Master Thesis)
Peer-to-Peer Systems (Master Thesis)
Web-Services (Master Project)

A graphical representation tool for scientific experimental workflows

Bioinformatics research and other scientific research is driven by experiments. With the introduction of new technology, more and more experiments are automated, leading to more and more complex experimental workflows. Outcome of experiments and the relationship between indvidual experiments are often stored in relational database systems. A workflow starts with one or more experiments. The output of these experiments is used to perform the next steps in the experimental suite. An experimental workflow can be described as a graph such that each node in the graph is an experiment and an edge from experiment A to experiment B indicates that A was executed directly before B and its output served as input for B.

Task of this 400 project is to build a graphical user-interface representing existing experimental workflows.
The system retrieves information about past and current workflows from an existing experimental database.
Then it presents the user the information in form of nested graphs. The user can view an entire workflow, or a sub-workflow starting or ending with a special experiment.

Knowledge needed (pre-requisite or has to be acquired): basic knowledge of SQL and relational database access, Java, graphical interface development (e.g., Swing).

The project is part of the Proteomics Information Management project aiming in building a fast-development tool for bioinformatics experimental database systems.

Database Replication (several Master Thesis / PhD Thesis)

Data replication is very attractive in order to increase system throughput and provide fault-tolerance. We are currently working on replication strategies based on group communication systems. We are currently looking both into replication within the database kernel and middleware based data replication.

Postgres-R is an extension to the public domain database systems PostgreSQL providing efficient and consistent data replication. It by far outperforms other replication strategies with similar consistency guarantees. Our first prototype version of Postgres-R is based on a rather old version of PostgreSQL. In our current work, we are migrating Postgres-R to the newest version of PostgreSQL. The current version of Postgres-R is based on a master/slave approach: Only one replica is allowed to execute updates sending them to the other sites at commit time; the secondary sites apply these changes. Apart from this, secondary sites are allowed to execute read-only queries. The latest work has focused on fault-tolerance; the system continues to work despite failures, and failed and new sites can (hopefully soon) join the system.
There is still a lot and exciting work left to be done. One challenge is to transform the system to an update everywhere approach, where every site is allowed to perform updates. A lot of feedback from industry on our first protoype has shown that this is what industry wants and need. An update everywhere approach requires to adjust the concurrency control component of PostgreSQL. Since PostgreSQL uses the same concurrency control methods than Oracle, being able to integrate the approach into PostgreSQL will proof the concept on a large scale basis.
This work offers a lot of opportunities. Students have a chance to work and extend one of the mostly used public domain database systems. They will look into the very details of how a complex database system works and operates. Additionally, they have to work on a highly algorithmic level, designing and redisgning abstract distributed algorithms, and implement them into a real system. Their work will be based on successful work of 5 previous graduate students.

Middleware based replication is more restricted than replication within the database system. But it might be the only solution in heterogenous environments or if access to the internals of the database systems are restricted. So far, we have developed a middleware based replication tool that is able to work with PostgreSQL. We would like to continue our research in several directions. Our current prototype is a rather specialized module running as a stand-alone system. We would like to integrate it as a special service or component into a distributed computing environment, e.g., CORBA or even J2EE. Furthermore, it has to be extended to work in a heterogenous environment, not only supporting PostgreSQL but also other database systems, like DB2. Since replication has to be very efficient, it is unlikely that the middleware has a single, standard way to interact with each of the different databases. Instead, in order to optimize as much as possible, wrapper modules have to be developed that work for the individual database systems taking advantage as much as possible of the internal options of the underlying database system.
The work offers similar opportunities as Postgres-R. Working with real systems, and transferring new ideas into real implementations.

Knowledge needed: Solid knowledge about database systems, in particular their internals (transactions, concurrency control, query execution). Knowledge in networks, communication primitives, and distributed computing. Unix knowledge (processes, shared memory, process intercommunication etc.).

More information about our database replication project.

Exp-DB: a database tool for fast prototyping of experimental database systems (Master Thesis / Master Project)

Exp-DB is a support tool helping in the design and development of experimental databases. Exp-DB helps research groups performing bioinformatics experiments to set up an initial web-based information system in reasonable short time and without the need of being an expert in database design or advanced programming. The architecture is based on well-established software-engineering principles.

A first prototype of Exp-DB is existing.Currently, we are adding an access control component to the system.
Exp-DB needs extensions in several directions.

It needs an advanced query module that does not only allow for flexible query execution of internal information but also is able to integrate external sources for heterogenous query execution.
It needs support for inserting information into experimental tables. Many experiments are now automated and supported by specialized software tools. These software tools contain most of the relevant data regarding the experiments. Currently information has to be entered by hand. What is needed is an general programming interface (API) to allow for import of experimental data. Having such a general API specialized wrappers can be written translating the specific formats data is provided by the software tools into the language understood by the interface.
Worflow support: Currently the database system is passive in the sense that users have to initiate action to insert data or query inserted data. An active system will be able to initiate new steps automatically and/or inform users when new tasks have been completed or need to be done.

Knowledge needed: Good understanding of database systems, information system development. Knowledge of the needs of bioinformatics researchers. Programming environment is Java, JSP, application servers.

Exp-DB is developed as part of the Proteomics Information Management project aiming in building a fast-development tool for bioinformatics experimental database systems.

XML database systems (Master Thesis)

So far, lot of research has been performed in efficient querying of XML database systems -- XQuery being one of the most well-known query languages for XML. A wide range of commercial standards exist (XPath, DOM, etc. etc.) showing the widespread use of XML in industry.
In our XML project, we are currently developing an client/server based XML engine with the focus on updating XML data. The current engine is able to perform a wide range of XML updates in single user mode. The currently developed extensions to the system are an efficient index structure that is especially well suited for systems with high update rates, and the development of a concurrency control component taking advantage of the specific structure of XML documents.
In our future work the system has to be extended in several important ways.

The system requires an efficient buffer and storage management so that users can share data, and I/O management of the system is optimized.
The system has to be extended by a reasonable query execution engine to provide a complete XML database management system.

Knowledge needed: a general understanding of database systems, in particular buffer management, transactions, and query execution. Familiarity with processing semi-structured data.

Some more information about our XML database management system.

Peer-to-Peer Systems (Master Thesis)

Peer-to-peer computing is the sharing of computer resources and services by direct exchange between systems. These resources and services include the exchange of information, processing cycles, cache storage, and disk storage for files. Peer-to-peer computing takes advantage of existing desktop computing power and networking connectivity, allowing economical clients to leverage their collective power to benefit the entire enterprise. (http://www.peer-to-peerwg.org/whatis/index.html)
In true peer-to-peer systems, there is no centralized coordination, centralized database, no centralized view of the system. Global interaction is a result of local interaction. Each component in the system is highly unreliable and might connect/disconnect at any time.
In this thesis, the student should look into data-management issues of peer-to-peer systems. Often, data is loosely replicated and coupled. How can such data be kept up-to-date, cleaned, and organized? Basis for this thesis is doing the reading course on this topic and, and then develop, implement and compare data maintenance strategies based on relevant literature.

Knowledge needed: general good knowledge about data management in distributed systems, data replication, communication paradigms.

Web-Services (Master Thesis / Master Project)

to be announced soon.