Data Science is Plutonium Powerful: Dangerous and Handle With Care
I Have a Piece of Uranium in My Bedroom
Actually it is just uranium ore and it is stored away in my closet in one of those ‘junior rock collections’ that eight year old boys get when they first visit the Grand Canyon.
It is not dangerous.
Plutonium, on the other hand, is warm to the touch and will bring a glass of cold water to a boil if dropped in. Plutonium is made from uranium and uranium is refined from uranium ore.
My piece of uranium ore.
Business Intelligence is like uranium ore. It is potentially valuable but only when processed and carefully handled by humans.
Data Science is like plutonium. Vastly more powerful and vastly more dangerous than uranium ore. By itself it can generate electricity, power satellites and make very big explosions.
Data Science is Different from Business Intelligence
It may seem that business intelligence (BI) is similar to data science (DS). Data science encompasses predictive analytics, machine learning, data mining, and even parts of what is considered to be artificial intelligence. It is routinely touted as improving revenue, profit, and ROI. It is often presented as automated and able to discover knowledge. These huge impacts and the automated nature of many of its applications make DS particularly dangerous.
Business intelligence (BI), on the other hand, is used for processing and organizing business data so that it can be adeptly navigated by a competent human data analyst. It supports decision makers in making better and faster decisions. It is not typically relied upon for making decisions directly.
Under the hood both BI and DS are about processing data to find patterns that can aid in making business decisions. DS just brings to bear more sophisticated algorithms, more math, more data, more power, and often also delivers less transparency and more complexity.
We need both BI and DS but we also need to be careful to not underestimate the power and risks of data science just because it seems similar to BI.
Just as plutonium is made from relatively safe uranium ore it is very different in its behavior. There is the possibility of explosions. Big explosions.
Four Things that Lead to Data Science ‘Explosions’
Here are four types of ‘explosions’ that regularly occur with the use of data science:
- “Oops we just used ethnicity to deny someone a home loan.” Breaches in the use of personally identifiable information (PII) happen all the time in DS. From the alerting of parents that their teenage daughter is pregnant, to not realizing that zip+4 is heavily correlated with ethnic background. The costs of these privacy breaches are very high and typically land your CEO on the front page of the Wall Street Journal. With the deep rumblings of GDPR growing to a roar come May 2018 this is a big one to watch.
- “Gee that model worked perfectly on the test data.” There is a whole class of data science ‘supernovae explosions’ that occur via something called “temporal leakage”. For instance, if you are predicting today’s stock price, it is perfect reasonable to build a predictive model using yesterday’s stock price and any information you have from any day in the past. What is not ok is to use information that you have from today (that would be cheating). But a surprising number of times, models are built where future information (also known as ‘the answer’) leaks backward in time (temporally). This can happen because a field is misnamed or from a computer science fencepost error. The net result is that the model is not nearly as good in practice as it was in the lab. This can result in large amounts of money being mis-invested and lost. You rarely here about these types of mistakes because they are so costly and embarrassing.
- “The model was amazing at predicting the wrong thing.” One of my favorite examples from my book (“Building Data Mining Applications for CRM”) is where a mobile phone carrier builds a model to predict which customers are at most risk of not renewing their contract (churning). The model that they produced was awesome at predicting those likely to churn. The marketers then pre-emptively sent those high risk customers special and valuable marketing offers reminding them to renew. Perversely, this dramatically increased churn because it reminded the at-risk customers that their contract was coming due. The model was awesome but poorly integrated with the rest of the business.
- “The model just isn’t working any more on this data set.” A friend of mine ran a hedge fund and he found many instances where the vendors who were providing industry-wide datasets actually had errors in their data. They, and other hedge funds, correctly built models that were based on the bad data. And in the world of data science: great model building + bad data = bad model. Luckily his hedge fund discovered the data errors and was actually able to exploit the behavior of other hedge funds who were using the same data but didn’t notice the mistakes.
Four Data Science Prophylactics
Before we get started on some of the fixes to the above problems with our data science plutonium, let’s make sure we’re on the same page as to the meaning of the word ‘prophylactic’:
Prophylactic (noun) – a medicine of course of action used to prevent disease.
To avoid these DS nuclear meltdowns requires catching them and remediating them at the earliest possible stage. It is similar to the differences between the Three Mile Island nuclear accident which was caught early enough so that no one was killed and Chernobyl and Fukushima, which both experienced a core meltdown and rated a severity score of seven out of a maximum of seven.
Data science, like plutonium, must be part of a disciplined operational process in order to release its power and prevent explosions. Image: Public domain: http://www.nrc.gov/reading-rm/basic-ref/students/animated-pwr.html
Here are four recommended prophylactics. Several of them are already built into some of the best data science tools currently available.
- PII alerting systems. Your data science process and tools should be tightly linked to your master data management and data governance so that a data scientist can be automatically alerted when a field might contain PII or be unlawful to use (see various privacy regulations like HIPAA, FERPA or GDPR).
- Too good to be true alerting systems. Temporal leakage problems are very difficult to catch automatically and are best caught by strong process and a thorough review of the model and the code. One thing that can help is to have an automated alert that sounds the alarm when a model seems too good to be true. These same alarm triggers should be deeply embedded within any good data scientist’s psyche as well.
- Multidisciplinary agile team that includes business expertise. The problem of having great models that are disconnected from the realities of the business, is best solved culturally by integrating a marketer or line of business executive into the agile team responsible for the business decisions that come from the model. Create an agile team accountable for a business outcome that includes a data scientist. Don’t just create a data science team accountable for creating great models.
- Explainable models and good master data management. Sorry deep learning guys, but a model that is opaque and can’t be explained or scrutinized can’t be trusted for big expensive problems. Look for transparency of the model and look to constantly be improving your MDM.
Four Years from Now
We are lucky. We have been given an incredibly valuable and powerful resource. Plutonium allows us to fly satellites to Saturn and generate clean electricity.
Plutonium can play a critical part in achieving some of our greatest ambitions. It can be used safely when encapsulated in an appliance like a deep space probe. Image: Public domain: http://solarsystem.nasa.gov/multimedia/display.cfm?IM_ID=2071
Data science is also extremely powerful. It is capable of generating 30% more revenue or 50% more profit within existing business processes without additional investment. But it can also land your CEO on the front page of the Wall Street Journal … for all the wrong reasons.
Here are four things I predict will happen in the next four years as we increase the usage of our data science plutonium:
- There will continue to be some very expensive mistakes made in new and creative ways. In talking recently with Mick Hollison of Cloudera he gave the personal example of his son’s flight home from college when the hurricane Irma was about to hit Florida. The airline prices went sky high when automated systems (maybe even weak versions of data science) recognized the opportunity of high demand and limited supply. Exploiting this business opportunity would normally be a good idea but it became a public relations nightmare for the airlines when national news sources presented the behavior as price gouging in a time of crisis.
- Self-updating / self-modifying models will be recognized as dangerous. There is a lot of interest in pushing data science to the edge - to make decisions very rapidly at the point of contact with a consumer (e.g. to make a recommendation for a new product purchase before a consumer has left the store page on the website). These models function autonomously within the milliseconds required to retain the customer’s attention. This is a highly successful use of data science but some data scientists want the models themselves to be constantly changing based on real time conditions and data. Depending on the importance of the prediction this may be a very bad idea. It is ok for a model to modify itself if it is recommending the purchase of the blue flaming sword rather than the gold magic potion in a video game. It is another story if the self-modifying model begins to miss lots of suspicious money laundering activity or modifies the definition of a cardiac event in a heart monitor.
- Data science appliances will be the biggest wins. One way to manage the complexity and the dangers of data science is to encapsulate it in strong operational processes. The other way to do it is to encapsulate it into an ‘appliance’ that has limited functionality but is very good at what it does. For instance packaging up data science into something that just does prediction of churn in the mobile telephone space will be much less likely to have errors than something that is meant to be general purpose. Expect to see packaged DS solutions for fraud detection, next best product prediction and others that will be very aware of the industry and type of problem that they are solving. They will be much less prone to explosions.
- Some second place companies in data driven industries will take leadership roles based on their ability to operationalize data science. These companies will not make it known that they are leveraging data science but they will reap its benefits and keep away from the front page of newspapers for DS supernova events.
The application of data science within American companies is going to be a game changer. Eric Siegel, who wrote the excellent book “Predictive Analytics” called it the “prediction effect”. Successful companies can add data science to existing business processes without any further investment and make their top and bottom line grow significantly. You will recognize the companies who operationalize data science by their success. Yes, like plutonium, data science has its dangers and risks but the benefits far outweigh the risks when used correctly. Data science has become a game that you can no longer afford not to play. Just make sure you’re playing safely.
Expert Insights. Many thanks to Mick Hollison who provided the example of airline prices during the Florida hurricane. Mick is the Chief Marketing Officer at Cloudera. www.cloudera.com
More Writing from Stephen J. Smith:
“The Demise of the Data Warehouse” - https://www.eckerson.com/articles/the-demise-of-the-data-warehouse
“Ok, I Was Wrong, MDM is Broken Too: Insular, Dictatorial MDM Doesn’t Work” - https://www.eckerson.com/articles/ok-i-was-wrong-mdm-is-broken-too-insular-dictatorial-mdm-doesn-t-work
“IT Activation Energy: Cloud Considerations in Retail and Insurance” - https://www.eckerson.com/articles/it-activation-energy-cloud-considerations-in-retail-and-insurance
“Cloud Data Warehousing: Producing the Infrastructureless Culture” - https://www.eckerson.com/articles/cloud-data-warehousing-producing-the-infrastructureless-culture
“17 Things a CDO Should Know: A Report from MIT’s CDO-IQ Symposium” - https://www.eckerson.com/articles/17-things-a-cdo-should-know-a-report-from-mit-s-cdo-iq-symposium
“Deep Learning, AI and Privacy” - https://www.eckerson.com/articles/deep-learning-ai-and-privacy
“Enterprise-Grade Data Science” - https://www.eckerson.com/articles/enterprise-grade-data-science
For Best Predictive Analytics Results: Don’t Move the Data - https://www.eckerson.com/articles/for-best-predictive-analytics-results-don-t-move-the-data