Data Governance Coming Out of the Dark (Part 2)
Despite being a longtime proponent of the inclusion of a formal metadata sub-project in all data warehouse developments, I have seen many failures. That’s not to say that no metadata was gathered. In fact, significant amounts of technical metadata from the ETL / population efforts describing data sources, targets and transformations and saved in some form of repository. But such data was IT-oriented and of limited use to business users. Efforts to define, gather and store business-oriented metadata usually floundered: the business users weren’t interested because they already understood the data being delivered and didn’t see benefit in helping users in other areas.
Data lake developments are even more metadata-averse. Data population is considerably less formal than in a warehouse and often proceeds in a fully ad hoc—usually called agile—manner. Speed and agility is great (and necessary), but the result is that very little formal metadata is harvested. At least in the early stages of the roll-out, the same data scientists are both the developers and the users of the data and models, so their immediate need for formal metadata is almost non-existent. And as for future needs? Well, as Scarlett O’Hara said: “Tomorrow is another day.” But, data governance cannot afford to wait. As lakes turn to swamps, we need it now.
In the world of data management, metadata has a bad name. In addition, the term has been repurposed by spy agencies worldwide as the “envelope” information of messages and conversations that can be collected with legal impunity: “It’s not personal information. It’s just metadata. Suck it up!”
Since my 2013 book “Business unIntelligence”, I’ve been promoting the phrase context-setting information (CSI) as a replacement for and improvement on metadata. My reasoning is that the “stuff” we’re talking about is not data, but rather information, and its purpose is to set the context not just of information but of all related processes and people as well. It is from context that meaning emerges. We must assign meaning—our own personal meaning—to information as a precursor to deciding what to do about it and how to govern it, and we do so based on the context that surrounds it.
In the data lake, the context of information is far more complex and nuanced than in the case of a traditional data warehouse. Warehouse data comes from internal systems, designed or commissioned by our own people. It has been subject to internal data governance and management. It is, in large part, the legally binding and recognized record of our business. Data in the lake, on the other hand, comes from a wide variety of sources, of varying levels of trustworthiness, reliability, cleanliness, and so on. Can we trust the business information found on Web? Can we rely on the proper calibration of sensors on the Internet of Things? Has information been compromised in transit from certain sources?
Answering these questions, and hundreds more, depends on context-setting information. Such CSI must come with the information and from multiple sources surrounding it. At the most basic level, we might expect field names for incoming information. In some cases, it does not exist. However, field names need definitions, descriptions, rules for use, constraints, allowed value ranges, and much more. We need to add ownership (however defined), legal responsibilities, privacy and jurisdictional restrictions, security assessments. The list is long, ranging from the deeply technical to the purely business.
By the time you’ve finished, the boundary between information and context-setting information is fuzzy in the extreme. One man’s CSI is another woman’s information. Where does the database end and the metadata catalog begin? In the data lake, where much of the information is file-based, the boundaries between information and CSI weave between and within files. The only viable conclusion at the conceptual and logical levels of architecture is to treat all information equally as both business information and context-setting information. Of course, at the physical level, all the usual trade-offs may lead to the use of different storage and processing technologies, depending on the primary information uses.
Even in the much more restricted world of traditional data and data warehousing, defining and managing metadata has proven a task of near impossible dimensions. Storage and processing technology has, of course, been part of the problem, but the real issue is one of conceptualization. Just as a physical card index only allows one or two ways to find a book in a library, our mental structures struggle to conceive of the dimensionality of CSI in the data lake and its external sources.
The increasing scope and complexity of context-setting information suggests that some form of machine learning solutions is required to properly address the needs of both IT and business. At Spark Summit 2016, eBay discussed its use of supervised machine learning with MLlib and GraphX based on Spark to discover metadata such as brand, style and model from advertisement listings. Some readers might disagree that these items are metadata, but the discussion above around information vs. CSI says otherwise. And eBay seem to take a similar view.
More interesting, perhaps, is their recognition that the size of their data and their need for rapid discovery of CSI means that only an automated, machine learning solution is viable. Data governance faces similar demands as data lakes grow. Machine learning is beginning to offer automated CSI collection and generation in metadata catalogs in products such as Alation and ClearStory. But it’s still early days.
With machine learning in all its guises—deep learning, cognitive computing, artificial intelligence, and more—seemingly set for a breakout year in 2017, we may expect an increasing range of applications. Here’s hoping vendors will focus on real enterprise needs like data governance, rather than cool but useless apps for predicting when your kitchen bin will need emptying.