Stephen Smith: Operationalizing Data Science

Stephen Smith is a well-respected expert in the fields of data science, predictive analytics and their application in the education, pharmaceutical, healthcare, telecom and finance industries. He co-founded and served as CEO of G7 Research LLC and the Optas Corporation which provided the leading CRM / Marketing Automation solution in the pharmaceutical and healthcare industries.

Smith has published journal articles in the fields of data mining, machine learning, parallel supercomputing, text understanding, and simulated evolution. He has published two books through McGraw-Hill on big data and analytics and holds several patents in the fields of educational technology, big data analytics, and machine learning. He holds a BS in Electrical Engineering from MIT and an MS in Applied Sciences from Harvard University. He is currently the research director of data science at Eckerson Group.

Key Findings:

  • Operationalizing data science is about making it repeatable, scalable, and flawless
  • Data preprocessing and communication are keys to operationalizing data science
  • Data science has equal potential to produce benefits and cause disasters
  • Data science vendors are incorporating data lineage features and data libraries into their offerings 
  • Building an agile team is essential
  • In the past, data science has failed in the same way Napoleon did in Russia

This is an excerpt from the podcast with Henry H. Eckerson and Stephen Smith

Henry: I think it’s important to define our terms first. What is data science and how does it compare to big data, BI, AI, and machine learning?

Stephen: Let’s start off in the 1990s. There was something called data mining, which included predictive analytics, some AI techniques, and statistical and machine learning techniques. My understanding of data science is it’s fundamentally capturing the incorporation of predictive analytics or data mining into the data ecosystem and at the same time trying to make a science of it, which makes good sense. They did the same thing with computer science back in the 1950s when they really didn’t have a name for people working on computers.

For me, data science is the focus on building predictive models that are going to be prescriptive. They’re going to prescribe what you should do next as a marketer or as a business person, and that’s the fundamental difference between adding business intelligence, but it’s really all an interconnected ecosystem. My research in operationalizing data science is about showing that to unlock the power of these predictive models, you really need to worry about all of the data from the very first staging area where raw data comes in to an organization to how that data gets deployed for making business decisions at the line of business.

Henry: So how does AI and machine learning fit with data science? Is it just the commercialization of those things?

Stephen: It really is. Again, if you want to look for the meaning of the words, you have to look at it historically. Artificial intelligence has been around at least since the 1950s, and it really was working to create human behavior on computers or machines that pass what’s called the “Turing test”, which is the ability for a computer to mimic a human being’s behavior in such a way that a human couldn’t tell the difference between a computer and a person. After AI ran out of steam and ran into the AI winter, machine learning came long and said, “Hey, rather than trying to create programs that mimic human beings, why don’t we let a computer learn those behaviors just the way a person did?”

That started the whole effort in looking at large databases for optical character recognition, which even in the late 1990s was sort of an unsolved problem. It was hard for a computer to read somebody’s handwriting, and now it’s a fairly solved problem. So, AI kind of morphed into machine learning and it was successful. Now, AI has kind of come back for self-driving cars, image recognition, and natural language generation. A lot of that is actually due to using data-driven approaches, which is more synonymous with machine learning than it is with artificial intelligence.

If you look at it from a purely academic perspective, you crack open these algorithms that are using data to learn, whether they’re AI or machine learning or statistics, and fundamentally they’re all doing the same thing. Not to get too technical, but they’re searching for an optimal model in a very high-dimensional space to differentiate between positive and negative instances. The difference is the way they go about coming up with that model. With AI, you classically do it in a slightly different way than machine learning or statistics, but there’s a tremendous amount of overlap.

Henry: You briefly mentioned some of the successes of data science. Could you talk more about those successes and give some examples?

Stephen: The classic successes of data science are oftentimes dealing with customers. So one of the classic problems is predicting when you’re going to lose a customer, or churn. Another big problem is figuring out the next best thing to sell to your customer. If your customer buys a pair of jeans, should you try and sell them a book next or a belt or a shirt or a pair of shoes? Making a good match can make all the difference in terms of whether they actually buy it or not.

Data science, specifically neural networks and decisions trees, have been used for fraud detection and analyzing credit scores for decades now. Those are very well understood technologies, and there are so many applications. Again, I’m using “data science” fairly broadly, but I would include the work with deep learning neural networks. The breakthroughs today have been in image understanding where it’s fairly easy to train a computer to look for certain objects.

I remember reading something about a cucumber farmer in Japan who took a deep learning neural network and was able to automatically classify his cucumbers. That’s a classic example for an industry that is not very technology-savvy. But it’s really an indicator of how far we’ve come when you get a cucumber farmer using artificial intelligence.

Henry: What are some of the problems companies are having that prevent successes with data science?

Stephen: A lot of the times, companies are doing prototypes and they will get amazing results. They can detect fraud going on in the company or predict which customers are going to leave, but oftentimes they have trouble scaling that up. So, there’s a number of problems that they run into.

I think of it as the data science sandwich where the middle of the sandwich is what everybody worries about, which are the algorithms, classically called data science. That’s where you’re deciding whether you want to use something like logistic regression from classical statistics or if you want to use neural networks which are both AI and machine learning or if you want to use random forest, etc. The bottom line is that there are techniques that are better and worse.

The number of techniques that you can choose from is unending, and the reality is that many of the techniques will do a very good job for you. That’s not the challenge. The challenge is that maybe 50 percent of the models, after they have been carefully built by the data scientist, get delivered, but they’re delivered late and the business has changed and they’re not used. Maybe the model is delivered but it doesn’t match the business problem and it’s not used. There’s a real disconnect between what’s going on with the data science teams and understanding the business problems.

In interviewing folks for my research, one very knowledgeable person said that the biggest barrier he saw to data science being fully realized was just communication. If the data scientists could be better storytellers and they could communicate with the businesspeople better, then it would be much, much more likely that the models would be put to good use.

So communication is a slice of bread in the data science sandwich. In the bottom slice, there’s data processing. You could argue that the breakthrough in neural networks was not about neural networks and machine learning; it’s really about data preprocessing.

The convolution neural networks are actually just kind of tricks for ways of preprocessing the pixelated images that are coming into the neural networks. So, if you gave smart data scientists the choice between unbelievably rich, high-quality data and any predictive analytics algorithm they wanted but poor data, they would always pick the rich data. That’s why so much time is spent enriching the data.

Now, when you’re trying to operationalize data science, you’re talking about making it repeatable, scalable, and flawless. To do that, you need to make sure people communicate and know data lineage and governance are very important for the data because a lot of the boring stuff that has to do with data management is really critical for the success of data science as well as understanding the business problems.

Henry: You just said there that operationalizing data science is to make it repeatable and safe. Can you expand on that?

Stephen: The results from data science and predictive analytics are prescriptive. What comes out of a data science group is generally going to tell business users what to do – ‘Mail to this person’; ‘Don’t mail to this person’; ‘Report this person to the treasury department for suspected money laundering’; ‘Don’t report this person’ – so it prescribes steps, but oftentimes what it’s prescribing is not intuitive. If it were intuitive, then you wouldn’t really need the data science, because the people using the model would’ve already done it.

Some of this is really about building trust in the end users. I just got off the phone today with somebody and they said one of the biggest barriers to the use of data science is that it’s scary. The math involved in predictive analytics is much more complex than what’s going on in BI or data management in general. So, you combine the complexity along with the fact that you’re supposed to do what it tells you to do, and it’s usually telling you to do something that has some real monetary impact…

I actually saw an example where a model was built and in the recoding process, it completely reversed, put a negative sign in terms of the score, so the good customer became the bad customers and the bad customers became the good customers. And the marketing treatments were a complete mismatch. It was one little flipped bit and it had a huge impact. I don’t know how much financial impact it had, but marketing campaigns are hundreds of thousands of dollars if not millions of dollars, so there’s a lot at stake.

The goal in operationalizing data science is to ideally make it safe, even boring, but boring in a good way, predictable. For example, NASA and SpaceX ideally want their rocket launches to be boring, not exciting. They don’t want them to be unpredictable, and that’s what you want from data science as well.

Henry: So you’re saying some of the keys to operationalizing data science are not about technology. Sometimes it’s more about communication and trust and some of those soft problems?

Stephen: My research is showing that the biggest gains can actually be made from organizational and communication improvements, not necessarily technological ones. There are technological ones as well, but one problem that happens over and over again is that business people ask for a model, and data scientists go off and build it for a month. Then the business person says, ‘I don’t think I’m going to use it.’

The best way to solve that is to build an agile team with representatives from different functional groups: a business representative, data science engineer, business analyst, and data librarian. That would really solve it. And, again, a lot of it is about communication. There are numerous examples of a company doing very well with their data science initiatives, but all of their models are going through one person. That person gets another job some place else, and the whole function falls apart. So communication is essential in terms of describing the models you’re building and what the data means and what kind of data preprocessing is being done.

Henry: So, responsiveness and communication are essential to operationalizing data science. They also sound like the biggest inhibitors. Are there other things that are inhibiting the operationalization of data science?  

Stephen: Yeah, there are definitely other inhibitors. One is that data scientists, as they’re often defined, are kind of mythical creatures. They are the combination of PhD statisticians and super smarter PhD machine learning people who are very good at math, but also able to talk coherently with a business user and understand business problems. That narrows down your pool of applicants quite a bit. The definition also includes the ability to write code and manage data, and most statisticians don’t like to write code. These are all fairly sophisticated skills, so with that definition of a data scientist, it’s pretty hard to get your system up and running.

If instead you think of data science as sort of as a hologram with different aspects, different people representing different skillset in an agile team, then that kind of removes the skillset barrier.

But there are other challenges, too. Something as simple as randomly sampling a database is absolutely critical for successful model-building, and the number of times that I’ve heard of people incorrectly randomly sampling a database is quite high. Those can cause very embarrassing problems when your model performs spectacularly in the lab and then when you actually invest in it, it crashes and burns and you find out that you had been using the same data or similar data for both training it and testing it in the lab.

Henry: So why is this happening now? Why weren’t we talking about operationalizing data science 10 years ago?

Stephen: I think part of it is because of our successes. If you did data science 10 or 15 years ago and you wanted a neural network, you might borrow somebody’s code from academia, or you would just write your own. You could certainly use some tools as well, but with the advent of open source, there’s such an opportunity with Python libraries and different things you can do on Spark. A lot of the challenge of actually building the models has been taken care of for you, and now there’s just this opportunity to apply data science in so many places. I also think it could be that more and more companies are interacting and behaving with customers in a digital way. We have so much more opportunity for data science to be applied.

For example, if you go into a store and you buy a shovel and you walk out the door, it’s pretty difficult to say, ‘Hey, while you’re buying that shovel, would you also like to buy a pair of boots?’ Now, not only do you get that information digitally, but you have the opportunity to take action digitally. I think that’s a major change. Even though that happens in commerce, it happens more and more in older companies. For example, in agriculture, companies are getting lots and lots of information about genetic variance of seeds.

It’s kind of the victim of its own success. We’ve solved the problems in the middle of the data science sandwich, but it has exposed tiny flaws in processing the data. In the past, data would just go into the database and it might show up in a BI report. Maybe it’s a little off, and people would ignore it. Now, a predictive model might latch onto a little data, and it might be telling somebody in the business to do something that’s completely wrong. It’s sort of a magnifier of imperfections.

Henry: What are some examples of vendors and tools that operationalize data science?

Stephen: Well, the quick answer is all of them, which is very heartening. In the past, you’d get a vendor that was just building a particular algorithm and every single one of the vendors I’ve talked to have recognized that the real value of data science is when it can be operationalized.

The big players have an opportunity because they’ve got one of everything, so folks like SAP, Oracle, SAS, and IBM, but there’s also smaller companies like Dataiku and Domino. And there are big companies that are also startups like Cloudera and FICO. These companies don’t just have the algorithms, but they have things for doing data lineage and data libraries. They have incorporated ways to embed the outputs of predictive models into BI tools and make it possible to visualize it.

Some of them can even couple a prediction from a data science team with business rules. So, one example might be ‘What is the right price for the sales item?’ and a predictive model might say, ‘Take 22.37% off this dress,’ and that’s great. That might actually be the optimal thing to do, but there are real-world constraints. Maybe, when a store does a sale, they only do 20 percent, 30 percent, or 40 percent, so you might have a business rule that overrides a particular recommendation. That’s a nice synergy between common sense and human oversight along with the power of a predictive model.

There are also optimization tools that I think are very, very critical. Sometimes a business model just tells you what’s going to happen. Then you have to decide what to do about it.

Henry: How do organizations that use these tools and features benefit?

Stephen: There’s a book called Predictive Analytics, which is a spectacular book about examples of using data science to predict things, but the first four or five examples they go through in the book are all about near horrible mistakes. So, this goes back to trust and risk. There are well-known examples of doing everything correct, and your brilliant predictive model figures out how to target pregnant teenagers who may or may not have told their parents yet. So, having good checks and balances and processes to prevent things like that from happening is good. That’s the downside.

On the upside, my prediction is that you’re going to see companies that almost mysteriously start to take over market share in their particular industries. They won’t have come out with a new product, had a technology or marketing breakthrough. Somehow they just seem to be growing market share – and that’s the real power of data science. You can make everything much more efficient without having to purchase new factories or do new things. Everything you do is just going to be a little bit better, and if it’s 10 percent better, that’s huge. That’s a game changer in the marketplace.

So I think some embarrassing things can happen if you don’t have your house in order and have your data science operationalized. On the other side, you can see benefits without massive changes in your business model. You can really see a dramatic shift in your market share.

Henry: Could you talk about some of the best practices that businesses could follow and use to get the benefits rather than the disasters?

Stephen: One of the biggest ones is the idea of an agile team. On the agile team you would have probably some privacy expertise. We have the GDPR threatening to hit us over the head with privacy laws coming May 25, 2018. The thing about data science is it really touches all parts of the enterprise, all the way from the very earliest parts of data ingestion all the way up through the business plan for a new market or a new product. I think that a multidisciplinary team is one of the keys that I’ve run across so far.

Another is the centralization of the data science functionality. So there’s a running argument whether data science does require rocket scientist to run these models. The question is should we have folks all centralized in one place and have them supporting a line of business or should we embed these folks in different lines of business? I generally like to see things democratized and kind of moved out to the fringes as much as possible, but what seems to work for data science is to build it like a computer software company within your company.

If you’re building a product, you don’t send your development team to your customers to build the product. Instead, you send out product managers or product marketing managers to embed themselves in the industry with the customers to understand what the customers really need. That’s what I think is going to work for large organizations who operationalize data science. They will have a strong central location where you’re going to have data scientists, data engineers, etc. and then embedded product marketing managers or product managers working with the business units.

If you had a hundred people in your data science group, rather than having 10 of them embedded in nine different departments or lines of business, you’d have one person embedded in each line of business or department and the rest of the folks centralized.

Henry: Can you speak to what this means for data scientists? What does the operationalization of data science mean for their future?

Stephen: I think it’s going to make their future a lot more enjoyable. Today, if you’re in an organization that really hasn’t operationalized data science, you are going to have to go and find new data streams yourself. That might be kind of fun, but how much better would it be if your data sources were in a internal marketplace where you could quickly see all of them and not be restricted from accessing them? In the classic enterprise data warehouse, ‘we know what data is good for you. Here it is in the data warehouse. Don’t touch anything else.’ That doesn’t work for data science because we don’t know what data we need until we play with it.

So getting that data access directly is important. It jut saves you a lot of time so you can focus a lot more on the interesting pieces, such as building the models and understanding the business. It also means that your work is going to get used by other people. Again, the worse thing you can do to a data scientist is have them do all this work and then not use their models.

It will make data science rise in importance within the organization, which can only be a good thing for all people involved.

Henry: Obviously there are a lot of potential benefits that businesses can reap from using data science, but in one of your blogs you write there’s a disconnect between the hype over data science and what organizations actually need and want. Can you explain this and what it means for the movement to operationalize data science?

Stephen: By way of analogy, when Napoleon conquered most of Europe with superior strategy and better armies, he eventually failed with the Russian winter, not because he wasn’t brilliant enough, but because he couldn’t put shoes on the feet of his soldiers. I think data science is a little bit like that. It’s the logistics. It’s the boring pieces that are really holding it back right now. What’s interesting is all these pieces that need to be fixed for data science really need to be fixed across the organization.

Everybody knows we need better data governance, data lineage, sharing, and understanding of what the data sources are. Everybody agrees that we need better connections between data usage and business users.

Henry: What would say to an organization that is just struggling to have a successful BI program? ‘Don’t even bother with data science yet. Get these things in order first’?

Stephen: That’s a hard question.

I think that everything you do helps. So, a good BI program is going to expose the data. I think there’s an adage that if the data or metric is not used, then it’s probably got something wrong with the way it’s calculated. So the more people use the data, the more likely the data is in good shape. I think the progression that works is getting started with some outside consultants on specific projects. The most important thing you could start with is an assessment of what types of projects might benefit from predictive analytics and data science.

So go ahead and get started with these one off projects, but after you start to get them under your belt, you need to step back and say, ‘Looks like we’ve got something that could be a benefit that will help you sell it to senior management because you have these results.’ At that point you need to put a plan together.

The most successful companies think for the very long-term, so they think about how big the data science capability needs to scale. Can it get by on a terabyte or 60 terabytes of data an hour? That’s a very different investment, but you need to look at that and actually plan.

I think you need some experimentation to get your feet wet and the fastest way to do that is to pull in outside resources, use good high-level tools, and then start to think about hiring and building your own core competency internally after you get those initial successes and senior management buy-in.

You really need to put together a business plan that says, ‘You know, here’s what it’s going to be in year one. Here’s what it’s going to look like in year 10,’ and then go to execute that, execute on that plan as best you can.

Henry: I know you have a report coming out on operationalizing data science. When does that come out and what will readers learn?

Stephen: That is coming out at the end of February. I’m just wrapping up the research. What we’ve been doing is talking to all the top vendors in this space as well as their customers. The premise is basically the following, ‘data science is awesome, but its potential is unrealized’, and then, it’s two questions: what are the barriers to realizing that full potential? And then, what suggestions do you have for other folks in terms of best practices to overcome those barriers?

And the list is quite long. We’re actually getting feedback from folks within the industry to narrow that down to the top eight. It’s really a report that is meant for CIOs, CAOs, CDOs, and analytics leaders to help them to figure out how to plan for a strong data science offering within their corporation and how to operationalize that to reap the benefits of it.

We’re hoping that this will be a working document that will be immediately helpful. For instance, we have suggestions about how to get senior management buy-in. You really need to bring in people from other industries to sell your senior management and hold internal conferences to show people how it works because it is complex and a little bit scary. So it really should be a very effective 20 pages of what someone can do right now as they’re building their data science offering or, if they’ve already got something running, to start to operationalize it.

If you liked the podcast, please subscribe!

Henry H. Eckerson

Henry Eckerson covers business intelligence and analytics at Eckerson Group and has a keen interest in artificial intelligence, deep learning, predictive analytics, and cloud data warehousing. When not researching and...

More About Henry H. Eckerson