Enterprise-Grade Data Science
We’re Asking the Wrong Question
When we were building the Darwin data mining platform for MPP supercomputers back in the 1990s we had one big question: “Which is the best machine learning algorithm?” We even did a very serious bake off between neural networks, genetic algorithms, k-nearest neighbors, clustering, logistic regression and decision trees (CART). We used real world financial and marketing data.
We found out that one algorithm was better at prediction than all the others … but by only a few percent … and just for one particular problem.
I don’t remember which algorithm it was.
The fact is that all the machine learning algorithms were pretty good and somewhat similar – within a few percent of accuracy of each other. Picking the best one was really not a lot better than just picking a good one. We were surprised.
It turns out that we were asking the wrong question. Asking “What is the best machine learning algorithm?” is fine but it is focusing on just 10% of the problem. The bigger challenges, that had more impact when they were overcome, revolved around data preparation and model management. We should have also been asking: “How do I best prepare my data?” and “How do I manage my models and scoring after I’ve built them?”.
In the real world, these were the questions that made a difference in the useful application of machine learning. Let me repeat: data preparation and model governance have more impact on return on investment for a business than picking the super best algorithm.
This model management ‘plumbing’ isn’t glamorous and it isn’t as exciting to talk about as how a neural network works or how genetic algorithms behave like Darwinian evolution. But getting the ‘plumbing’ right is critical to making predictive models useful. And actually, plumbing can get exciting when it doesn’t work (e.g. if your septic system is backing up or your predictive model hasn’t been correctly tested).
Focusing on Providing an Enterprise-Grade Solution
I don’t typically like to write a blog post incorporating a vendor’s tagline but I really liked this line from Alpine Data: “The Enterprise-Grade Data Science Platform”. They are an example of an advanced analytics platform focused on making the plumbing of predictive analytics work in an operational way day in and day out. Their focus is on making data science safe, consistent and repeatable for the enterprise and they are not alone. The bigger vendors like SAS, Oracle, IBM and SAP also have this goal. And there are others like Alpine who are building an enterprise platform for predictive modeling like FICO (Fair Isaac), RapidMiner and Data Robot. I’ll stop there as the list is long.
The goal is to make data science as ubiquitous in the enterprise as business intelligence has already become. This is a lofty goal as some may argue that even BI still has a long way to go.
Features of Enterprise-Grade DS
For data science platforms that seek to become ubiquitous within the enterprise they will need some features that are uncommon, nascent or often missing from existing offerings. This could be a long list but some of these features are:
- Data source and data structure agnostic
- Abstraction away from needing to understand storage and processing resource allocation
- Strong and performant random sampling tools for model creation
- Auto detection of test and train group contamination
- An ETL layer that specifically helps to create useful input variables for models
- Strong curation and governance of models
- Model scoring interoperability through standards like PMML and PFA
- Alerts if an independent variable appears to be a surrogate for the dependent variable
- Ability to collaborate with other model builders
- Alerts for model aging
- Alerts when the underlying data has changed from when the model was created
- Predictive models that are automatically updated as the data changes
Four Years from Now
These enterprise-grade data science platforms are now available and they are being refined and improved with great velocity. This is happening today because of the push from behind of the many great data resources that are becoming available through big data and it is being pulled forward from the front as businesses race to deploy predictive modeling and machine learning throughout the enterprise.
This speed and focus is unprecedented in the history of data science and predictive analytics. It represents a cultural change as the power of predictive analytics is shared and demanded from all parts of the enterprise. In four years expect enterprise-grade data science platforms to become as much a part of every business as BI is today. In the not too distant future they will be as ubiquitous as financial and ERP software.
Expert Insights:. Many thanks to Steven Hillion and Josh Lewis who provided expert insights on this topic. Steven is the Chief Product Officer and Josh is the Vice President of Product at Alpine Data Labs. www.alpinedata.com