With the tremendous growth in data science and machine learning, it is becoming increasingly clear that traditional relational database management systems (RDBMS) are lacking appropriate support for the programming paradigms required by such applications, whose developers prefer tools and packages that perform the computation outside the database system.
The most common current approach to develop machine learning and data science applications is to use one of the many statistical languages such as R , MATLAB, Octave , etc., or packages such as pandas , NumPy , theano , etc., meant to augment a general purpose language like Python with linear algebra support. Should the data to be used reside in an RDBMS, the first step in these programs is to retrieve the data from the RDBMS and store them in user space. From there, all computation is done at the user end. Alternatively, sometimes users manually export the required data from the database into files that are then fed into such applications. Needless to say, user systems do not posses massive amount of processing power unlike servers running an RDBMS, often forcing them to work with a smaller subset of data. Users might also choose smaller data sets as transfer costs and latencies to retrieve the data from the database system can be huge. This data subsetting can be counterproductive, as it has been pointed out that having a larger data set can reduce the complexity of the algorithm that needs to be built , while also providing better accuracy. Additionally, once the data is taken out of the RDBMS, all further data selection and filtering need to be performed within the statistical package. Especially in the feature engineering phase of a learning problem, where such relational operators are needed . Therefore, the statistical community has augmented their favorite systems with some relational functionality, such as the pandas DataFrame designed to work on top of NumPy for Python and the DataFrame objects in R. However such implementations are not as sophisticated as the capabilities of a conventional RDBMS when it comes to executing relational operations.
Attempts by database community to support linear algebra operations by supporting UDFs written in host languages such as Python, R, etc. as well as related attempts in supporting linear algebra by extending SQL has failed to make inroads among data scientists. This is primarily due to the fact that these approaches are more tedious in the usability aspect and therefore hamper the productivity of the user.The goal behind AIDA is to support in-database analytics without sacrificing usability.
Developed in Python, AIDA allows data scientists to use regular Python intepreters as clients to connect to the database and perform linear algebra using familiar syntax of NumPy/pandas and relational operations using Object Relational Mapping (ORM) style API support.
This, for example, allows users to write code such as below, mixing both relational and linear algebra operators on the objects.
AIDA's client libraries can cleverly shift this computation to AIDA's server , that resides in the RDBMS.
This has two advantages.AIDA facilitates this by an RMI mechanism to move the computation to the actual objects on the server side.
AIAD's server interally uses NumPy to do its linear algebra computations whereas it pushes down relational operations as SQL using Table UDFs into the RDBMS via a Database Adapter intended for that RDBMS. Therefore by building Database Adapters for different RDBMS, we can easily port AIDA to that RDBMS. The current implementation of AIDA works on MonetDB.
Depending on the nature of operations requested, AIDA can move data transparently between the RDBMS and NumPy using its TabularData abstraction. This ensures that the users do not have to worry about the data set's internal format. AIDA takes care of this nuance and does any transformations required. AIDA's server is also clever in some optimizations , in that it does lazy evaluations of relational operations (therefore reducing the overall computations required) and avoids unnecessary format transformations between NumPy and RDBMS.
AIDA also provides support for in-database visualization of data sets through matplotlib and plotly.
You can watch a short video demonstration of AIDA below.
Although AIDA has been mainly designed for the interactive exploration phase of data analysis, it contains some mechanisms to ship the execution of entire functions to the database server. Thus, batch processing code can be fully executed at the server without continuous interaction with the client. However, the execution of code related to machine learning tasks often performs better when executed on GPUs compared to CPUs. However, database servers typically run best on CPUs. In this milestone, we extend AIDA to ship certain functions (and the necessary data) to GPUs. The decision of whether a function should be executed within the CPU or the GPU will be left to the user who will have to consider the trade-off between increase in execution speed with the potential impact of data copying. The performance of executions of standard ML functions on the CPU and the GPUs of our server system will be evaluated.
Skills Required: Python programming, Linear Algebra, Basic ML Algorithms. Practical experience working on GPUs is a plus but not mandatory.AIDA has a virtual table concept [3] that allow us to run SQL queries over Python data sets residing in the RDBMS memory. This is absolutely needed if the SQL query refers to both Python data sets and tables residing within the RDBMS. However, if the query is only referring to Python data then it might be more beneficial to not perform the expensive conversion to match the row format of the RDBMS. While for columnar database such as MonetDB, the data formats are often similar to those used by the statistical libraries, this is not the case with row-based RDBMS such as PostgreSQL. Therefore, when implementing AIDA on a row-based RDBMS, we might want to hand over the data and the SQL to a column-based query execution engine instead to minimize the data conversion that is needed. For this milestone one has to determine how the row-based RDBMS (which hosts the relational data) and the column-based RDBMS (which would be used to process SQL that only refers to Python data) could co-exist live within a single shared memory space. In the current implementation, AIDA lives within the Python interpreter embedded in the host RDBMS. The feasibility of AIDA itself having an embedded column-based RDBMS when residing inside a row-based RDBMS such as PostgreSQL needs to be investigated. In fact, MonetDB has some embedded variants (MonetDB lite, DuckDB) that could possibly serve the purpose. However, it's a different system than the MonetDB we have used so far. Thus, for this milestone, we have to first analyze whether the embedded MonetDB would be a good candidate for our purposes. If yes, an integration with AIDA, including the development of the virtual table concept within this embedded MonetDB is needed. If not, the possibility exist to build our own basic SQL execution engine that would be able to at least execute simple SQL statements on Python data with reasonable efficiency.
Skills Required: C and Python programming, SQL, Database concepts.