What is a Data Science Platform?

What is a Data Science Platform?

Data science has been rapidly progressing as both a discipline and as a technology over the last decade. Best practices have been created and it is becoming part of the operational core for leading businesses. What is needed now is the next step in product evolution: a data science platform that supports both expert and business users and provides a consistent and integrated solution for building, managing, and optimizing predictive models. Let’s look at how we got here and some of the key functional requirements of a data science platform.

It was 1994 and I thought I was building a data science platform…

In 1994 I worked for a company called Thinking Machines which created, at the time, the world’s faster supercomputer, and the world’s largest parallel computer (65,536 processors, but that old iPhone 7 in your desk drawer is now about twice as powerful). We did a lot of research on Artificial Intelligence and were breaking ground on a new form of AI that was driven by learning from examples. We (and others) were starting to call it “Machine Learning”.

To prove that a parallel computer could do AI, we had several research projects that created tools for different machine learning algorithms: multiple hidden layer backpropagation neural networks, genetic algorithms, k-nearest neighbors (what we called memory-based reasoning), clustering, CART decision trees, and more mundane statistical tools like logistic regression.

I looked at all of these research projects at Thinking Machines and decided that if we tied them together we might have a nice product. We did just that and named it Darwin. It became the great-grandfather of the data mining software packages at Oracle when they acquired Thinking Machines in 1999.

How do you make disparate data science tools work together?

While building Darwin it became immediately clear that we were missing some important pieces. In addition to the model building tools, we also needed to provide data management tools to preprocess the data. So with a redundant file-based system (similar to Hadoop) and a rudimentary column-based, in-memory database I created a data processing tool that we unimaginatively called ‘Data Manager’ or DataMan for short. We were on our way to building a platform!

It was also clear that we needed some visualization and UI on the front end to show the results. So we co-opted another research project on data visualization and added it to the package. We built a business optimizer to optimize lift and calculate ROI.  When we were done we had a rudimentary command-line driven, data management, machine learning, visualization, and optimization data science … platform?

We felt like we had a complete product but I’m not sure that it would be considered a data science platform by today’s measure.

Was Darwin a data science platform?

We were not alone in taking the rocket science coming out of research labs and turning it into a product. At that time Evangelos Simoudis and his team at Lockheed had created a general purpose data mining tool called Recon. He later founded the company Customer Analytics to focus on using machine learning for marketing. Yuchun Lee co-founded Unica and embedded machine learning into a complete product that used data science to optimize customer relationship management. There were many others. It was the mid-1990s.

Most of these products had excellent tools for data management, advanced analytics (machine learning, data mining, data science), visualization, and optimization. And they were integrated. At that time though, many of the interfaces were command line or fairly rudimentary GUIs. There was not a lot of need or thought given to model management and there was not a lot of tools for collaboration aside from those afforded by sharing files or RDBMS views for feature creation.

Would we consider any of these to be a data science platform by today’s standards?  Probably not. They were missing some important pieces of functionality that are critical today for supporting the complete data science lifecycle.

How do the smart people define a data science platform?

If you read the work of other industry analyst firms or search the internet, you may be confused as to what a ‘platform’ is compared to other types of software solutions.

The machine learning and data science website KD Nuggets defined a data science platform as:

“A cohesive software application that offers a mixture of building blocks used for creating many kinds of data science solutions…”

Gartner defines a data science and machine learning platform as:

“A cohesive software application that offers a mixture of basic building blocks essential both for creating many kinds of data science solutions and incorporating such solutions into business processes, surrounding infrastructure and products.”

On Twitter, folks discuss ‘platform’ with distinctions such as:

  • “A platform is not just an ambitious product.”
  • “A platform is antifragile.”
  • “A platform hands the users a bunch of Legos, then says, ‘We can’t wait to see what you build!’.”
  • “A product business is best when it is predictable, like a factory farm. A platform business is like a rainforest, teeming with surprises and spontaneously generating new solutions.”
  • “It was the platform (App Store) that made the iPhone successful.”
  • “Minecraft is a platform, Tetris is a product.”

From these definitions, it seems like one could conclude that a key aspect of a platform is that it be a Swiss army knife of sorts with many uses. It should also allow for emergent behavior to occur.

Pragmatically a platform needs to be a seamlessly integrated solution that satisfies all requirements in the data science lifecycle. It needs to provide a solution from raw data access to an optimized business solution and it needs to support users with all different levels of expertise.

Here is a graphic that attempts to capture the key components and characteristics of a data science platform: 

How does a platform relate to a software tool, suite or application?

I’m still figuring this out myself so please chime in if you disagree. Here is a simple (certainly incomplete) taxonomy:

  • Software tool – Software that is delivered for specific uses to a well-defined set of users. It relies on other software upstream and downstream in its ecosystem to provide a complete solution.
  • Software Suite – A collection of separate tools that are serviceably but not seamlessly integrated and used for a general purpose by different skill levels. It has a somewhat inconsistent UI and usually grew up from combining independent tools that were not initially planned to be integrated (perhaps via acquisitions). 
  • Software Application – Software with a consistent UI, developed for a specific purpose that takes a well-defined user from end to end. Example: Fraud prediction app for a bank that is used by experts to report to the US Department of the Treasury.
  • Software platform – a general purpose, consistent UI takes both power users and novice users end to end for their tasks within the same environment. 

Is there a need for a data science platform?

With the rise of the citizen data scientist, it is becoming more important to have data science solutions that support both the elite and novice users. It is also important as data science moves out of the lab and into mainstream business that it becomes operationalized, with seamless and flawless execution. A data science platform can solve these problems.

Proposed characteristics of a data science platform

So let’s get specific. What does a data science platform look like? I’ll argue that the important characteristics of a data science platform would include this list:

  • Data connectivity
  • Database discovery
  • Data ingestion
  • Feature creation
  • Privacy safety
  • Model building
  • Model management
  • Parameter recommendation
  • Collaboration (data, features, models)
  • Visualization
  • Centralized administration (security, access, privacy)
  • Workflow
  • Optimization
  • Business value calculation
  • Ability to generate focused business applications
  • Scalable from small exploration to big data deployments

What else should be on the list?

Four Years From Now

Are we currently delivering data science platforms?  I think we are and there are probably remaining features and functions that still need to be added (we just don’t know what they are yet). This is a normal progression as the access to the technology becomes democratized and brilliant software engineers and UX designers figure out the best ways to deliver it to their customers.

What will we see in four years? Probably a continuing cycle of breakthrough technologies that become software tools (maybe chatbots are next?), then a maturation and standardization which results in a bifurcation into both the creation of universal platforms and high octane specialized applications.

What do you think? Have we moved into the age of data science platforms?  What key aspects of a platform am I missing from my description above?

Related Articles from Stephen J. Smith:

“Best Practices in Data Science: Ten Keys to Operational Success and Business Value” -  https://www.eckerson.com/articles/best-practices-in-data-science-ten-keys-to-operational-success-and-business-value

“Enterprise-Grade Data Science” - https://www.eckerson.com/articles/enterprise-grade-data-science

Stephen J. Smith

Stephen Smith is a well-respected expert in the fields of data science, predictive analytics and their application in the education, pharmaceutical, healthcare, telecom and finance...

More About Stephen J. Smith

Books by Our Experts