Data Wrangling, Information Juggling and Contextual Meaning, Part 3

Data Wrangling, Information Juggling and Contextual Meaning, Part 3

The pragmatic and theoretical considerations in part 1 and part 2 of this series provide the basis for evaluating any data wrangling or similar tool. I’ll look at more than just tools that self-identify as data wrangling products. In my view, the term data wrangling, while aimed at big data practitioners and data scientists, points to only one aspect of a much broader need. We might call it, somewhat boringly, data preparation or, perhaps more elegantly, data contextualization or information distillation. Whatever the name, the underlying need is for the creation of a valid, reusable and bidirectional bridge between business meaning and the data stored in computers. Built correctly, such a bridge is a common resource for data scientists and business analysts, data modelers and BI users, application developers and business managers. Wherever context-setting information is first created, it must be available easily and broadly throughout the organization.

The data contextualization journey begins in two distinctly different environments within a business. The first, traditional, environment is with the data generated internally as part of formal business processes, leading to the design and development of business intelligence (BI) and data warehouse systems. Data warehouse automation (DWA) tools, such as WhereScape and Kalido, play here. This starting point leverages relatively well-known data sources for BI systems with well-understood (in a business sense) needs. As a result, DWA tools focus on getting the business and IT people to work together in a rapid, agile and productive process that delivers quick win applications and an underlying, reusable store of context-setting information. What is less obvious here is that the IT data experts still have to go off and “wrangle” the required data from sometimes complex or dirty sources. With DWA, the automation is emphasized. However, in the background, IT undertakes the same type of work that is traditionally carried out in data modeling and extract, transform and load (ETL) tools. In terms of the modern meaning model (m3), described in part 2, the focus of DWA tools is in the physical locus and the transition from information to data, facilitated by a development process that eases the sharing of explicit and tacit business knowledge between business and IT.

The second, emerging, environment begins with externally sourced data, typically brought into a Hadoop-based system, as the basis for discovery and exploration by data scientists. In this case, the data scientist is expected to have both the business knowledge of business need / context and the more technical understanding of the structure and content of the data. This is the target market of data wrangling (and similar) tools, and is being rapidly populated by a number of young and start-up companies, such as TriFacta and Alation. The focus for such tools is on the problem of how somebody faced with a large, amorphous set (or sets) of poorly documented, dirty data can rapidly make sense of it. These products typically address the problem by automating the analysis and initial cleansing of the data for familiar field structures (dates and times, social security numbers), common content (state abbreviations), related fields within or across sets (addresses, repeated ids), filtering, dealing with missing values, and so on. The initial analysis is presented to the data scientist for further cleansing, which may include overriding or expanding some of the tool’s output, annotation with usage rules or policies, etc. The intelligence and friendliness of the user interface are vital considerations. Because of the ad hoc nature of this work, the ability of the tool to access other relevant analyses previously performed by other users is also vital. This represents a visit to the interpersonal locus of m3, through the use of collaborative functionality. In essence, these data wrangling tools are gathering and organizing context-setting information so that people can find, understand and use the data.

While this may seem very different from what DWA tools offer, I believe this is more a matter of market maturity than a real divergence. Data science, however sexy, is actually only the first step of analytics and BI. Interesting discoveries made by data scientists often need to be put into formal production. Analysis of incoming external data on its own is seldom sufficient; the combination of data from both internal and external sources delivers far more value than external data alone. Both factors point to a longer term need for integration of these two fields. Furthermore, we can see that data management and governance of externally sourced big data becomes increasingly vital, bridging from the discovery of context-setting information to tracking data lineage and usage, as seen in Waterline Data Science, for example. In addition, it’s clear that the facilities offered by data wrangling tools would also be useful in data warehouse automation, even for (supposedly) well-structured and well-understood internal production data.

Data wrangling thus points to a new data preparation need that has emerged as the volume and variety of data being ingested from external sources increases. This is a need that will only grow more rapidly as the Internet of Things is rolled out. This growth will, of necessity, drive further innovation and automation in the discovery of context for such data. But, good data management and governance will dictate that context-setting information discovered on ingestion must be brought into the production environment along with (a subset of) the data itself. The result is likely to be a gradual convergence of functionality between today’s data wrangling and data warehouse automation toolsets: more analysis of structure and data content in DWA tools and more formal development process support in data wrangling tools.

And, if I dare suggest the unthinkable, this would enable organizational shifts that bring data scientists (back, in some cases) into the fold of BI and data warehousing groups. This, in my view, would be good for data scientists, allowing them to take production needs into account. And good for BI and data warehousing groups to have some agile and innovative new blood.

Barry Devlin

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988....

More About Barry Devlin