Data Governance Coming Out of the Dark (Part 1)
As I and others predicted a few years ago, data lakes are filling up with the detritus of data scientists’ exploratory dives. They may be in search of gold nuggets among the myriads of data coming from the Internet, but they’re leaving a leaden legacy of dozens of raw data files, summaries, temporary files, reference tables, and more in an unmanaged HDFS environment. Much of the raw data is of dubious quality, containing errors in the content, missing values, and so on. With little incentive to delete these files (after all, storage is cheap and a data scientist’s time to manage it is expensive), they form an ever-thicker layer of sludge on the lake floor, impeding new exploration and turning the once pristine data lake into the dreaded data swamp. All for the lack of data governance.
It’s not just the number of files or the quality of data therein; it’s the lack of meaningful, consistent, or comprehensive metadata about them. A small percentage of the raw data files may have reasonably complete descriptive information attached. Many more have minimal metadata. As data scientists load, cleanse and explore the raw data, they discern context and embed that in their tools and programs. Some of it may be accessible or usable elsewhere but, in general, it exists as silos of metadata: occasionally in formal stores, more often in the processing code and scripts used to manipulate the data, and many times only in the data scientist’s heads.
As storage and management costs have climbed, IT has panicked. They have been here before. Granted, proprietary storage was more expensive, but when the data volumes curve starts to look like a hockey stick, its impact on a decreasing IT budget will clearly become a problem, sooner or later.
Data management tool vendors have also been here before. Since the earliest days of data warehousing, vendors of ETL (Extract, Transform and Load) tools and metadata catalogs have been marketed to IT shops concerned with issues of governance and management of expanding data warehouse environments. The positioning of metadata management within ETL tools was understandable, given that ETL is a major source of metadata, but also unfortunate, because it leaves large swathes of metadata—especially business-oriented metadata—on the sidelines. Most data warehouse metadata products today are still embedded in ETL suites from general data management vendors like IBM and Oracle, and from specialized ETL and integration vendors such as Informatica and Talend. Many standalone metadata repositories have come and gone over the thirty-year history of data warehousing; only a few survive, such as MetaCenter from DAG and Adaptive Metadata Manager.
Back in the data lake, the data scientist community has despaired of getting their analytic productivity above 20%. (As has been the case since time immemorial, data preparation always seems to take 80% of available time!) Being closer to business than IT, data scientists see the problem not through the lens of data governance, but rather through pragmatic eyes. How do we speed up data preparation? Can we share what we’ve discovered? In many ways, the answer is the same as in the data warehousing case: metadata catalogs. However, the approach differs somewhat. In the traditional approach, metadata is often closely linked to the design phase prior to database population; the approach is a modeling exercise. Data scientists start with the data itself—typically laid out as some form of table—and self-service data preparation, because that is what they first encounter and metadata is directly deduced from there.
Vendors of data science products have therefore stepped up to the plate to create catalogs from this vantage point of existing sets of data. The shift in perspective may be confusing; metadata creation and cataloging are mixed with data preparation tools, rather than being stand-alone solutions. The popular term “data wrangling” conveys this viewpoint. It’s more than data preparation, requiring the data scientist to hammer the structure and relationships from the data as part of its preparation for analysis. The spectrum of approaches spans from a primary focus on data preparation, in Trifacta, for example, to an increasing emphasis on the governance aspects from vendors such as Datawatch, Alation and Waterline Data.
With this new focus on self-service data preparation and metadata cataloging, one might assume that we are at the start of a new age of good governance. Unfortunately, I suspect this is not the case. I see the same over-emphasis on the technical metadata aspects as in the traditional approaches. True business definitions are under-represented. There remains significant manual and repetitive work to be done by the data wrangler.
This leads me to the concept of context-setting information and the emerging role of machine learning in its creation; and the eventual achievement of full-fledged data governance. These are the topics of the next part of this two-part series.