For Best Predictive Analytics Results: Don’t Move the Data

In the Days of Data Mining

Before the dawn of time, when predictive analytics was called “data mining” or “machine learning”, it was typical that the data required to train and build predictive models was moved to a new data store that was friendly to the predictive analytics tool. A score or a model eventually popped out and they were moved to the production database.

These were ancient times (the 1990s and early 2000s) when the data was brought to the predictive analytics tools. Today however, we have the opportunity to run our predictive analytics tools on the data without having to move it. It seems simple but it is a breakthrough of sorts. Moving the predictive analytics tools to the data rather than the data to the predictive analytics has resulted in dramatic increases in speed, quality and usability.

Key Differentiators for Oracle Predictive Analytics

I recently spoke with some members of the senior team working on Machine Learning, Advanced and Predictive Analytics at the Oracle Corporation. Their offering, Oracle Advanced Analytics, doesn’t require the user to move their data to the predictive analytics tool. Instead they have extended their database to be both a data management and an advanced analytics/machine learning platform. Users can build predictive models on data while it is managed inside their Oracle Database, Data Warehouse or Oracle Cloud databases and data warehouses.

There are some key differentiators that Oracle emphasizes for its predictive analytics solutions:

1. Don’t move the data. Oracle has taken very seriously the new predictive analytics mantra of ‘don’t move the data’. As a juggernaut in the database world they have both an advantage and an incentive to make this work efficiently and seamlessly. The advantages of not having to move the data are the elimination of load and transform times and a reduced risk of something going wrong in the translation between the source data and the predictive analytics tool.

To accomplish this Oracle has invested in making it possible to perform machine learning functions natively as part of the SQL Database functions.  They expose these SQL based machine learning functions via SQL API, SQL Developer’s Oracle Data Miner “workflow” UI and through with open source R.  

One of the advantages of moving the data is that it can be stored in a data format that is ideal for predictive analytics. In this case Oracle Advanced Analytics ML functions operate as fully parallelized in-database functions which inherit and share all the other Oracle Database attributes e.g. security, auditing, ETL, pipelining, and the ability to “mine” nested tables and unstructured data types.  Oracle exposes these functions through the R language.

Instead of moving data to an R client, Oracle Advanced Analytics’ Oracle R Enterprise component overloads many standard R functions that operate on data frames by providing proxy R objects for database tables. In addition, embedded R execution enables users to execute their R scripts on the database server side – eliminating the need to move data to the client.   This is important because it can save hours when you no longer need to download the data into your PA tool or client R engine, or upload the models into some other execution database. It also means that the data will always be up to date since you will be hooked in directly to the ‘single version of the truth’ that the data warehouse or data lake provides.

Similarly, Oracle provides machine learning algorithm in their Big Data Appliance and Cloud offerings.  Oracle R Advanced Analytics for Hadoop, like Oracle Advanced Analytics in the Database, provides a library of over 20 machine learning algorithms that “mine” on the Hadoop data management platform while taking advantage of Spark. 

Oracle’s Big Data SQL spans the Database and Hadoop worlds and allows users to access and leverage data in the data reservoirs, filter and summarize it and join it with data in the Database for further predictive modeling.

2. Let everyone speak their preferred language. In the sci-fi classic “Hitchhiker’s Guide to the Galaxy”, Douglas Adams invents the “Babel Fish” which, by placing it in your ear, allows you to understand when others are speaking in a different language. It also means that everyone can be comfortable in speaking their own preferred language without having to learn a new one. The world of predictive analytics has, up to this point, suffered from all of the different languages that data scientists speak. If there was a babel fish for predictive analytics it would save time and allow many more users to access the tools. Because of these benefits, many vendors have recently begun pursuing this new strategy of “speak the language you prefer” and let the PA tool do the translating. 

Oracle likewise endeavors to allow its users to speak in the language they prefer rather than asking them to learn some new universal language or some specific Oracle language. They support SQL naturally and provide an intuitive “drag and drop” workflow UI (that looks a lot like SAS and SPSS) packaged within their popular SQL Developer IDE.  If you like to speak R the Oracle Advanced Analytics’ Oracle R Enterprise component and the Oracle R Advanced Analytics for Hadoop component enables R users to view, transform, and analyze database tables using standard R syntax through the transparency layer, which as noted above, translates R functions into Oracle SQL or HiveQL as appropriate.

For machine learning, Oracle provides an R formula-style interface for its in-database and Spark-based machine learning algorithms, which makes these algorithms immediately accessible to R users. Oracle further enables R users to execute custom R scripts on the Oracle Database server using “embedded R execution”, or write custom mappers and reducers in R. These custom R scripts can also leverage open source R packages.

3. Make the important things run really fast. Oracle has done a nice job of embedding key predictive analytics functions directly into the Database SQL kernel so that they run with high performance. This means that their software takes care of optimally mapping to parallel processors.

 If you want super-speed you can use Oracle functionality that allows you to specify that the data reside in memory via the In-Memory option. Oracle’s tight integration with R enables overloaded R functions that operate on R proxy objects to generate the corresponding SQL for in-database execution – taking advantage of database parallelism, query optimization, indexes, and even partitioning. This means that key operations for data wrangling and preparation, as well as statistical analysis, run really fast.

 With embedded R execution, R users can execute their scripts in a data-parallel or task-parallel manner, where Oracle Database manages the spawning or parallel R engines, loading of the R function, and data loading. Running R scripts on the server leverages more powerful database hardware (CPUs, RAM) as well as faster data transfer to database-managed R engines due to inter-process communication rates. Users can specify the “degree of parallelism” they’d like to have.

4. Be Ready for Production Deployment. Just as John Donne proposed that “no man is an island”, no model should be either. Models should play nicely with others – specifically they should be easy to integrate with your production processes. For R users, putting an R-based solution into production can be challenging. Oracle has made this simple by enabling user-defined R functions to be invoked directly from SQL, leveraging the embedded R execution functionality. Structured results of a data frame can be returned as a database table. Images, plots, and graphics generated from R can be returned, one per row, in a database table where a BLOB column contains the PNG images. Both images and R data can be returned as an XML string for those applications consuming XML. 

Four Years from Now

Oracle has a high-level and strategic view of the future that is very much colored by their legacy as a database company. This is a good thing. In general, their view is pragmatic and focused on the real world problems that are often infrastructure related. Building solid infrastructure is not as exciting as predicting the future but it is many ways more important to get it 100% correct. Here are some thoughts on the future based on the insights about productionized predictive analytics. Four years from now…

  1. The semantic layer will be the most important. In the future, users will have a seamless view of all data sources. With this easy and transparent access goes some risks as business users and citizen data scientists may not be as aware of data semantics and lineage as the current predictive analytics professionals (just as Spider-Man’s uncle warned: “with great power comes great responsibility…”). For these reasons there will also be greater needs for high quality metadata and metadata governance that describes data lineage and semantics.
  2. Collaboration will result in multiplicative improvements in ROI.  The future ‘big’ opportunity is not just predictive analytics “tools” but predictive analytics data management and predictive analytics platforms.  When the barriers between these disparate worlds are broken down and data scientists are allowed to work as peers with business users a new wave of “predictive enterprise applications” is enabled. 
  3. The database will handle the optimization. Predictive analytics model building and model execution will appear to take place seamlessly within one unified view of the database and the underlying database software will take care of the mapping of the data to ensure optimal performance. The predictive analytics user won’t have to move their data and the database will make sure that access is highly optimized.
  4. The cloud will continue to grow in importance.  This seems simple but let me emphasize: The cloud will continue to grow in importance…

Expert Insights

Many thanks to Charlie Berger, Mark Hornick, and Marcos Arancibia who provided expert insights on this topic. They work in the machine learning, advanced and predictive analytics Product Management group at Oracle. www.oracle.com

FURTHER READING:

Stephen J. Smith

Stephen Smith is a well-respected expert in the fields of data science, predictive analytics and their application in the education, pharmaceutical, healthcare, telecom and finance...

More About Stephen J. Smith

Books by Our Experts