Data mining using relational database
management systems
B. Zou,
X. Ma, B. Kemme, G. Newton, D. Precup.
Abstract:
Software packages providing a whole set of data mining and machine
learning algorithms are attractive because they allow experimentation
with many kinds of algorithms in an easy setup. However, these packages
are often based on main-memory data structures, limiting the amount of
data they can handle. In this paper we use a relational database as
secondary storage in order to eliminate this limitation. Unlike
existing approaches, which often focus on optimizing a single algorithm
to work with a database backend, we propose a general approach, which
provides a database interface for several algorithms at once. We have
taken a popular machine learning software package, Weka, and added a
relational storage manager as back-tier to the system. The extension is
transparent to the algorithms implemented in Weka, since it is hidden
behind Weka’s standard main-memory data structure interface.
Furthermore, some general mining tasks are transfered into the database
system to speed up execution. We tested the extended system, refered to
as WekaDB, and our results show that it achieves a much higher
scalability than Weka, while providing the same output and maintaining
good computation time.
Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD), Singapore, April 2006.
Click to for the pdf version.