Topic Table
400-Project:
Thesis / Projects
A graphical representation tool for
scientific experimental workflows
Bioinformatics research and other scientific research is driven by experiments.
With the introduction of new technology, more and more experiments
are automated, leading to more and more complex experimental workflows. Outcome
of experiments and the relationship between indvidual experiments are often
stored in relational database systems. A workflow starts with one or more
experiments. The output of these experiments is used to perform the next
steps in the experimental suite. An experimental workflow can be described
as a graph such that each node in the graph is an experiment and an edge
from experiment A to experiment B indicates that A was executed directly
before B and its output served as input for B.
Task of this 400 project is to build a graphical user-interface representing
existing experimental workflows.
The system retrieves information about past and current workflows from an
existing experimental database.
Then it presents the user the information in form of nested graphs. The user
can view an entire workflow, or a sub-workflow starting or ending with a
special experiment.
Knowledge needed (pre-requisite or has to be acquired): basic knowledge of
SQL and relational database access, Java, graphical interface development
(e.g., Swing).
The project is part of the Proteomics
Information Management project aiming in building a fast-development
tool for bioinformatics experimental database systems.
Database Replication (several Master Thesis
/ PhD Thesis)
Data replication is very attractive in order to increase system throughput
and provide fault-tolerance. We are currently working on replication strategies
based on group communication systems. We are currently looking both into
replication within the database kernel and middleware based data replication.
Postgres-R is an extension to the public domain database systems PostgreSQL
providing efficient and consistent data replication. It by far outperforms
other replication strategies with similar consistency guarantees. Our first
prototype version of Postgres-R is based on a rather old version of PostgreSQL.
In our current work, we are migrating Postgres-R to the newest version of
PostgreSQL. The current version of Postgres-R is based on a master/slave
approach: Only one replica is allowed to execute updates sending them to
the other sites at commit time; the secondary sites apply these changes.
Apart from this, secondary sites are allowed to execute read-only queries.
The latest work has focused on fault-tolerance; the system continues to work
despite failures, and failed and new sites can (hopefully soon) join the
system.
There is still a lot and exciting work left to be done. One challenge is
to transform the system to an update everywhere approach, where every site
is allowed to perform updates. A lot of feedback from industry on our first
protoype has shown that this is what industry wants and need. An update everywhere
approach requires to adjust the concurrency control component of PostgreSQL.
Since PostgreSQL uses the same concurrency control methods than Oracle, being
able to integrate the approach into PostgreSQL will proof the concept on
a large scale basis.
This work offers a lot of opportunities. Students have a chance to work and
extend one of the mostly used public domain database systems. They will look
into the very details of how a complex database system works and operates.
Additionally, they have to work on a highly algorithmic level, designing
and redisgning abstract distributed algorithms, and implement them into a
real system. Their work will be based on successful work of 5 previous graduate
students.
Middleware based replication is more restricted than replication
within the database system. But it might be the only solution in heterogenous
environments or if access to the internals of the database systems are restricted.
So far, we have developed a middleware based replication tool that is able
to work with PostgreSQL. We would like to continue our research in several
directions. Our current prototype is a rather specialized module running
as a stand-alone system. We would like to integrate it as a special service
or component into a distributed computing environment, e.g., CORBA or even
J2EE. Furthermore, it has to be extended to work in a heterogenous environment,
not only supporting PostgreSQL but also other database systems, like DB2.
Since replication has to be very efficient, it is unlikely that the middleware
has a single, standard way to interact with each of the different databases.
Instead, in order to optimize as much as possible, wrapper modules have to
be developed that work for the individual database systems taking advantage
as much as possible of the internal options of the underlying database system.
The work offers similar opportunities as Postgres-R. Working with real systems,
and transferring new ideas into real implementations.
Knowledge needed: Solid knowledge about database systems, in particular their
internals (transactions, concurrency control, query execution). Knowledge
in networks, communication primitives, and distributed computing. Unix knowledge
(processes, shared memory, process intercommunication etc.).
More information about our database replication
project.
Exp-DB: a database tool for fast prototyping of
experimental database systems (Master Thesis / Master Project)
Exp-DB is a support tool helping in the design and development of experimental
databases. Exp-DB helps research groups performing bioinformatics experiments
to set up an initial web-based information system in reasonable short time
and without the need of being an expert in database design or advanced programming.
The architecture is based on well-established software-engineering principles.
A first prototype of Exp-DB is existing.Currently, we are adding an access
control component to the system.
Exp-DB needs extensions in several directions.
- It needs an advanced query module that does not only allow for
flexible query execution of internal information but also is able to integrate
external sources for heterogenous query execution.
- It needs support for inserting information into experimental tables.
Many experiments are now automated and supported by specialized software
tools. These software tools contain most of the relevant data regarding the
experiments. Currently information has to be entered by hand. What is needed
is an general programming interface (API) to allow for import of experimental
data. Having such a general API specialized wrappers can be written translating
the specific formats data is provided by the software tools into the language
understood by the interface.
- Worflow support: Currently the database system is passive in
the sense that users have to initiate action to insert data or query inserted
data. An active system will be able to initiate new steps automatically and/or
inform users when new tasks have been completed or need to be done.
Knowledge needed: Good understanding of database systems, information system
development. Knowledge of the needs of bioinformatics researchers. Programming
environment is Java, JSP, application servers.
Exp-DB is developed as part of the Proteomics Information
Management project aiming in building a fast-development tool for bioinformatics
experimental database systems.
XML database systems (Master Thesis)
So far, lot of research has been performed in efficient querying of XML database
systems -- XQuery being one of the most well-known query languages for XML.
A wide range of commercial standards exist (XPath, DOM, etc. etc.) showing
the widespread use of XML in industry.
In our XML project, we are currently developing an client/server based XML
engine with the focus on updating XML data. The current engine is
able to perform a wide range of XML updates in single user mode. The currently
developed extensions to the system are an efficient index structure that
is especially well suited for systems with high update rates, and the development
of a concurrency control component taking advantage of the specific structure
of XML documents.
In our future work the system has to be extended in several important ways.
- The system requires an efficient buffer and storage management so that
users can share data, and I/O management of the system is optimized.
- The system has to be extended by a reasonable query execution engine
to provide a complete XML database management system.
Knowledge needed: a general understanding of database systems, in particular
buffer management, transactions, and query execution. Familiarity with processing
semi-structured data.
Some more information about our XML database management
system.
Peer-to-Peer Systems (Master Thesis)
Peer-to-peer computing is the sharing of computer resources and
services by direct exchange between systems. These resources and services
include the exchange of information, processing cycles, cache storage,
and disk storage for files. Peer-to-peer computing takes advantage
of existing desktop computing power and networking connectivity,
allowing economical clients to leverage their collective power to
benefit the entire enterprise. (http://www.peer-to-peerwg.org/whatis/index.html)
In true peer-to-peer systems, there is no centralized coordination, centralized
database, no centralized view of the system. Global interaction is a result
of local interaction. Each component in the system is highly unreliable and
might connect/disconnect at any time.
In this thesis, the student should look into data-management issues of peer-to-peer
systems. Often, data is loosely replicated and coupled. How can such data
be kept up-to-date, cleaned, and organized? Basis for this thesis is doing
the reading course on this topic and, and then develop, implement and compare
data maintenance strategies based on relevant literature.
Knowledge needed: general good knowledge about data management in distributed
systems, data replication, communication paradigms.
Web-Services (Master Thesis / Master Project)
to be announced soon.