Four Signs You Need A Data Science Platform

My current research is on data science platforms and one of the most common question I get from end users is “How will I know when I need a data science platform?”. 

It’s a great question because, as an analytics leader building a data science competency at your company, you will almost certainly start off by first experimenting (the experimentation stage) and then transition to building a small team of top-notch data scientists (the artisanship stage) (see Figure 1). But at some point, you’ll need to make the transition to automating your data science program by investing in a platform so that you can scale and improve efficiencies. This is not a small decision and the timing can be important. Here are four telltale signs that it might be time to make that investment.

Figure 1. Invest in a data science platform between the artisanship and automation stages.

1. Overwhelmed by Operations  

“My expensive data scientists are spending 60% of their time on

mundane operational tasks. This is madness.”

Conducting the ‘science’ of data science (aka building models) is the fun part but too often your key team members spend their time on mundane tasks that could be automated or at least delegated. For your team to be most effective, you’ll need a strong platform to automate as much of the repetitive operational tasks as possible. Here are some of the signs that you and your team are overwhelmed by operational issues:

  • You don’t know how many models you have
  • You’re not sure how much you spent on AWS last month
  • Feature creation is taking a long time
  • There is a disconnect between data engineers and data scientists
  • Your data scientists are spending a lot of time babysitting their models after they are deployed

A good data science platform will automate model management and data processing. In figures two through four, we see how effort and focus can become overwhelmed by mundane tasks like data preparation and operations. With a good data science platform, the focus of key data scientists can be on the factors that are most important: building models and improving the business.

Figure 2. The ‘science’ of data science (model building) is just one part of the data science pipeline.

Figure 3. Unfortunately, without a strong platform, the efforts of your DS team are dominated by more mundane tasks like data preparation and operations. 

Figure 4. With a good data science platform, your data scientists can focus on the areas where they can have the most impact: building models and understanding how to improve business outcomes.


2. Lack of Collaboration  

“I can’t find that model that I built last year.” 

Many industry experts that I have interviewed have surprisingly (at least to me) called out collaboration as the most important ingredient in moving from the experimental or artisan stages of data science into an automated mode. We probably should not be surprised though. Data science is a team sport and leveraging and mentoring less experienced team members is as important as attracting and engaging the high-octane superstar data scientists. Here are some of the micro-signs that you need a data platform to help manage and encourage better collaboration:

  • Your team went from 2 data scientists to 5 (it doesn’t take much of an increase)
  • Your team went from 0 citizen data scientists to 25
  • Jose told you that he can’t replicate Sally’s model results
  • You don’t trust the latest churn model as you’re not sure who built it and whether it used the right feature set
  • Your superstar data scientist seems isolated and isn’t interacting with teammates

A good data science platform will provide you with ample tools for your data scientists to share model provenance, origin information about data used, and created features. You will be able to track the outcomes of models created by different team members and hold them accountable for results. Consequently, individual and company trust in the data science team’s output will increase,

3. Low Business Impact 

“We just improved accuracy by 0.001%! 

ROI? Not sure… get back to you on that.”

Calling data science a ‘science’ may be a bit misleading. The point of data science, as used in your company, is not to discover or develop new academic breakthroughs. The point is to generate, revenue, profits, and ROI. Oftentimes, as teams grow, they can become too focused on the science of cool machine learning algorithms and lose sight of the impact that they should be having on the business. A data science platform will embed business goals and deliverables into every aspect of your model production line.

Here are some signs that your team may be too focused on academic pursuits and getting distracted from their business goals:

  • Your data scientists are spending a lot of time learning new algorithms when tried-and-true algorithms are working well.
  • You are overwhelmed by the number of different algorithms, techniques and tools that your team is using and requesting access to.
  • Your data scientists are bickering over a 0.1% gain in accuracy but seem unaware of the business goals of the latest next-best-offer model.

A data science platform will help you manage the different algorithms and even automatically suggest the best tools for a particular business problem (sometimes by exhaustively trying each and every variant). A good platform will start off with interfaces and tools that enable business users to communicate the business goals that drive the rest of the data science process.

4. Not Scalable 

“My team is missing deliverables. We always used to be on time.” 

Though it is difficult to forecast exactly when you will outgrow your current best practices for data science, it is 100% predictable that you will at some time in the near future. This is because of the overall change in the business landscape where more data and the advent of data science have made model-driven decision making a requirement, not an option.

The best companies act tactically, but think strategically; get that next important model into production, but always keep in mind the need to be moving towards a robust platform. Some of the micro-signs that you are outgrowing your experimental or artisan-based approach to data science include:

  • You’re managing your model delivery schedules on worksheets.
  • You hired two new data scientists this year but you produced fewer models.
  • You were delivering 5 models per year but now you’re required to produce 25.

A good data science platform will scale not only with increased data but also with the increased demands of more models and a bigger team. 

Four Years From Now

I’m convinced that we are seeing the resurgence of platforms in data science today because of the Precambrian-like explosion of open source data and machine learning software that has occurred over the last decade. As that technology moves into the mainstream, there is a shift in focus to operational efficiencies. This is the normal cycle of innovation and assimilation that occurs in many fields. The question is: What will be the next interesting thing to happen to data science after we get our act together with good platforms?

Will it be ever more powerful predictive and machine learning algorithms? Deeper learning? More random forests?  I don’t think so. There will be incremental improvements in algorithms but probably not like we’ve seen over the last few years.

However, there will still be breakthrough innovation. But like all great innovation, it will come from where we least expect it. We may have hit a momentary plateau in big data processing and machine learning algorithms but there are still plenty of places for improvement. Here are some ideas of what might provide some excitement for data science and become part of our next generation of data science platforms:

  • Explainability – New techniques like LIME (Local Interpretable Model-Agnostic Explanations) are using data science to explain data science models that humans don’t fully understand.
  • Productionization – Even more automation of the mundane will occur in managing and deploying models. We’ll see data scientists freed up to do more science and become strategists rather than tacticians.
  • Automated Data Science – Increasingly people will turn machine learning techniques towards optimizing data science processes (visualize the Escher print of the hand drawing the hand…). In the future it may be possible for the data scientist to focus mostly on defining the constraints and goals of a business problem. Automated data science will leverage AI and machine learning to automatically pick the best algorithms, models, and parameters.
  • Privacy – Improved best practices will be encoded into processes and eventually stamped into products that protect individual privacy. This will be a hard problem to solve with BI and DS tools regularly looking to exploit any and all persona data.

Expert Insights. Many thanks to Mac Steele who provided expert insights on this topic. Mac is the Director of Product at Domino Data Lab.

Stephen  J. Smith is the research leader for data science at the Eckerson Group. His unique perspective comes from his real-world experience in building the predicitve analytics products Darwin, Discovery Server and Optas as well as writing the highly-rated business technology books “Data Warehousing, Data Mining and OLAP” and “Building Data Mining Application for CRM” with McGraw-Hill. If you are an expert in applying data science to business he’d like to hear your ideas: [email protected]

Related Articles from Stephen J. Smith:

“What is a Data Science Platform?”

“Best Practices in Data Science: Ten Keys to Operational Success and Business Value”

Stephen J. Smith

Stephen Smith is a well-respected expert in the fields of data science, predictive analytics and their application in the education, pharmaceutical, healthcare, telecom and finance...

More About Stephen J. Smith

Books by Our Experts