Data Wrangling, Information Juggling and Contextual Meaning, Part 1
“Data wrangling is a huge—and surprisingly so—part of the job,” said Monica Rogati, vice president for data science at Jawbone, in a mid-2014 New York Times article by @SteveLohr that I came across recently. “At times, it feels like everything we do.”
With all due respects to Ms. Rogati, the only surprising thing about this is her surprise. Data wrangling, also previously known as data cleansing and integration, is as old as data warehousing itself. Older perhaps, than many data scientists in senior roles in the industry. Even the oft-quoted “80% of data scientists’ time is spent preparing data” goes way back, having been applied to both data warehouse projects, where ETL (extract, transform and load) design and implementation was often reckoned to be 80% of the effort, and business analysts query/reporting efforts based on the warehouse. The really interesting aspect of the latter estimate is that this extensive preparation effort occurs even after IT has done the groundwork for the data warehouse. The implication, perhaps, is that we are missing something, and have been missing it for some time.
A return to first principles is called for. Why does data seem so averse to being cleansed and integrated? Why is data so persistently dirty? The answers lie among the regularly abused words data and information. This is not some academic debate about semantics; defining the difference and further, understanding the roles of context, knowledge and meaning is the foundation to fixing the 80% problem.
From informatics to philosophy, there are many different definitions of both data and information, but let’s be pragmatic. I like to start with information, as the means by which we humans communicate with one another by assigning meaning to symbols. It began with sounds, nuanced grunts perhaps, moved largely to text for many centuries, and is now veering from still images to pervasive video. As a simple definition, we could say that information is the subjectively interpreted record of personal experiences. The actual meaning of any piece of information, therefore, is to a lesser or greater extent in the eye or ear of the beholder and depends intimately on the context in which it occurs.
This variability of meaning proved a challenge to the early builders of computer applications. The logic of programming demands well-defined and repeatable meanings. Early computers were also limited in their handling of anything other than numbers. The result was the emergence of data as the foundation of computing. Although formal data modeling appeared much later, information (what business people want), was converted, from the earliest days, into a combination of data and metadata. Data contains the values—numbers or text—of particular instances of information. Metadata, stored separately, contains the definitions, or context, of what the data instances mean. We see this very clearly in relational databases: column names and descriptions are stored separately from the actual data values. Data is what computers need and process.
But it’s actually more complex if we step beyond database systems and the internally sourced data that formed the early foundations of business computing and BI. File processing systems separate the data and metadata much further: the values reside in a file, while the definitions exist only in the code of the programs, and may not be obvious unless well commented—and that’s a practice we all recognize as widespread :-).
The challenge for BI developers and practitioners has always been to move back from data to information, where the metadata that enables that transition is incomplete, missing, or even incorrect. Note that, in this case, the data is coming from internal systems where we might expect good data governance practices have been followed, data ownership and definitions have been agreed, and content expertise exists within the organization. As previously mentioned, even under these seemingly ideal circumstances, it is said that as much as 80% of the effort of the BI practitioner goes into data preparation. So, it would appear that obvious meaning—what BI practitioners really need—correlates at best loosely with metadata and good data governance.
In the light of these considerations, when dealing with externally sourced big data where governance and expertise are lacking, and meaning open to interpretation, the surprise should be that only 80% of the effort of data scientists goes into data wrangling!
Knowing the difference between data and information, and the role and value of context and meaning are only the first steps in addressing the problems of data wrangling and interpretation. But, for now, we can provide the starting point shown in the accompanying figure. The key points to note are:
Information is the physical representation of human communication and thinking, either directly expressed or gathered through the machines we design and build, varying in degree of structure depending on its source, and today primarily stored and manipulated in digital form
Data is a subset of information, being information that has been structured specifically for computational needs and consisting mainly of well-bounded numerical and textual elements
Metadata is also a subset of information, being the contextual matter that has been separated out from the computational data in point 2
This is the beginning of a more complete model that will be developed in Part 2 of this series. (It can also be found in my book “Business unIntelligence” if you cannot wait!)