Dirty Data Lakes and Dubious Analytics

Gold is down almost 40% since it peaked in 2011. But it’s still up almost 350% since 2000. Although since 1980, on an inflation-adjusted basis, it’s basically flat. However, since the early-1970s it’s up over 7% per year (or about 3.4% after inflation).” Ben Carlson, an institutional investment manager provides this wonderful example of how statistical data can be abused, in this case by playing with time horizons. Ben is talking about making investment decisions. Let me replay his conclusions, but with a more general view (my changes in bold).

“It’s very easy to cherry-pick historical data that fits your narrative to prove a point about anything. It doesn’t necessarily mean you’re right or wrong. It just means that the world is full of conflicting evidence because the results over most time frames are nowhere close to average. If the performance of everything was predictable over any given time horizon, there would be no risk.”

We have entered a period of history where information has become super-abundant. It would be wise, I suggest, to consider all the ways this information can be misinterpreted or abused. Through ignorance, so-called confirmation bias, intention to deceive, and a dozen other causes, we can mislead, be misled, or slip into analysis paralysis. How can we avoid these pitfalls? Before attempting my own answer, let’s take a look at an example of dangerous thinking that can be found even among big data experts.

Jean-Luc Chatelain, a Big Data Technology & Strategy Executive, recently declared “an end to data torture” courtesy of Data Lakes. Arguing that a leading driver is cost, he says Data Lakes “enable massive amount of information to be stored at a very economically viable point [versus] traditional IT storage hardware”. While factually correct, this latter statement actually nothing about overall cost, with the growth in data volumes probably exceeding the rate of decline in computing costs and, more importantly, the fact that data governance costs grow with increasing volumes and disparity of data stored.

More worryingly, he goes on to say: “the truly important benefit that Data-Lakes bring to the ‘information powered enterprise’ is… ‘High quality actionable insights’”. This conflation of vast stores of often poorly-defined and -managed data with high quality actionable insights flies in the face of common sense. High quality actionable insights more likely stem from high quality, well-defined, meaningful information rather than from large, ill-defined data stores. Actionable insights require the very human behavior of contextualizing new information within personal or organizational experience. No amount of Lake Data can address this need. Finally, choosing actions may be based on the best estimate of whether the information offers a valid forecast about the outcome… or may be based on the desires, intentions, vision, etc. of the decision maker, especially if the information available is deemed to be a poor indicator of the future likely outcome. And Chatelain’s misdirected tirade against ETL (extract, torture and lose, as he labels it) ignores most of the rationale behind the process in order to cherry-pick some well-known implementation weaknesses.

Whether data scientist or business analyst, the first step with data—especially with disparate, dirty data—is always to structure and cleanse it; basically, to make it fit for analytic purpose. Despite a very short history, it is already recognized that 80% or more of data scientists’ effort goes into this data preparation. Attempts to automate this process and to apply good governance principles are already underway from start-ups like @WaterlineData, @AlpineDataLabs as well as long-standing companies like @Teradata and @IBMbigdata. But, as always, the choice of what to use and how to use it depends on human skill and experience. And make no mistake, most big data analytics moves very quickly from “all the data” to a subset that is defined by its usefulness and applicability to the issue in hand. Big data rapidly becomes focused data in production situations. Returning again and again to the big data source for additional “insights is governed by the law of diminishing returns.

It is my belief that our current fascination with collecting data about literally everything is taking us down a misleading path. Of course, in some cases, more data and, preferably, better data can offer a better foundation for insight and decision making. However, it is wrong to assume that more data always leads to more insight or better decisions. As in the past evolution of BI, we are again focusing on the tools and technology. Where we need to focus is on improving our human ability to contextualize data and extract valid meaning from it. We need to train ourselves to see the limits of data’s ability to predict the future and the privacy and economic dangers inherent in quantifying everything. We need to take responsibility for our intentions and insights, our beliefs and intuitions that underpin our decisions in business and in life.

“The data made me do it” is a deeply disturbing rationale.

This article was originally published on December 9, 2014 on itknowledgeexchange.techtarget.com.


Barry Devlin

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988....

More About Barry Devlin