Machine Learning meets Databases

Data mining software packages usually provide a whole set of data mining and machine learning algorithms, and hence, are attractive because they allow the analysis of data with many kinds of algorithms in an easy to use fashion. However, these packages are often based on main-memory structures, limiting the amount of data they can handle. We have been exploring how to use database technology to scale to large data sets. We took Weka, a popular open-source machine learning software package, and added a relational storage manager as a backend tier. The extensions are transparent to the learning algorithms implemented in Weka, since they are hidden behind Weka’s standard main-memory data structure interface. Thus, machine learning researchers can continue to implement new algorithms into Weka without the need to know how to access the database. Furthermore, some general mining tasks are transferred into the database system to speed up execution. A special buffer mechanism further reduces the interactions with the database backend. Our WekaDB can handle much larger data sets than the original Weka with reasonable performance.