Machine Learning meets Databases
Data mining software packages usually provide a whole set of data
mining and machine learning algorithms, and hence, are attractive
because they allow the analysis of data with many kinds of algorithms
in an easy to use fashion. However, these packages are often based on
main-memory structures, limiting the amount of data they can handle. We
have been exploring how to use database technology to scale to large
data sets. We took Weka, a popular open-source machine learning
software package, and added a relational storage manager as
a backend tier. The extensions are transparent to the learning
algorithms implemented in Weka, since they are hidden behind Weka’s
standard main-memory data structure interface. Thus, machine learning
researchers can continue to implement new algorithms into Weka without
the need to know how to access the database. Furthermore, some general
mining tasks are transferred into the database system to speed up
execution. A special buffer mechanism further reduces the interactions with the database backend. Our WekaDB can handle much larger data sets than the
original Weka with reasonable performance.
Collaborators:
- Prof. Doina Precup (McGill)
- Glen Newtorn (National Research Council, Ottawa)
Students:
- Yu Chen
- Xuesong Ma
- Chen Tang
- Beibei Zou
Related Paper:
Data mining using relational
database
management systems. B. Zou,
X. Ma, B. Kemme, G. Newton, D. Precup. Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD), Singapore, April 2006.