Metadata Is Data, So Manage It Like Data
ABSTRACT: Pervasive use of metadata to solve today’s data management problems means that metadata is itself a valuable data asset that we must proactively manage.
Managing data is a critical survival skill for any organization. Companies are investing in new data architectures and solutions—such as data fabric, data access governance, and data observability—to keep pace with expanding business appetite for data. But the key to managing data at scale is metadata. Metadata makes it possible to deliver data faster, in more forms, for more uses by enabling automation of important management steps such as documenting, classifying, and certifying data.
However, the pervasive use of metadata in these new approaches creates another data management challenge that is not yet on our radar. In the rush to implement solutions to manage data, we’re making metadata duplicative and inconsistent. In other words, we’re creating the same problems with metadata that we’re using it to fix. It’s time to recognize that metadata is data and therefore, is itself a valuable asset that we must proactively manage.
It’s time to recognize that metadata is data and therefore, is itself a valuable asset that we must proactively manage.
In this article, we’ll explore what metadata is and how it’s used in the modern data stack. We’ll look at some of today’s solutions that rely heavily on metadata, and why it’s critical to manage metadata as carefully as we’ve learned to manage data.
What is Metadata?
Put simply, metadata is data that describes data. There are four broad categories of metadata (see Figure 1), each of which has certain uses:
Technical metadata. If you want to know the structure of a table or find objects that have attribute names including the string “cust”, or determine whether those attributes contain numbers, letters or both, you need technical metadata. Technical metadata includes the characteristics of data that systems need to work with it, such as format, type, length, and location. It also includes technical documentation such as data models and system designs.
Business metadata. If you want to know what “margin ratio” means and how it’s calculated, or whether a delivery address is considered PII, you need business metadata. Business metadata uses business language to provide context for the data that appears in applications, reports, and dashboards. It includes elements such as terms, definitions, classifications, and retention rules for different types of data. It also includes assignment of people to roles such as data owners and stewards.
Operational metadata. If you want to know what data sets are used together most often, or how long it took to complete data pipeline tasks, you need operational metadata. Operational metadata provides information on how data is used and what happens when it’s used. It includes information from a variety of sources such as execution logs, rule engines, error logs, and audit registers, with dates and times of events. It also includes data quality levels and operational monitoring of data quality issues that exceed defined levels.
Social metadata. Collaboration enables organizations to derive more value from their data. For example, many business intelligence and data catalog applications provide ways for users to ask questions, share tips, and make recommendations about data. Social metadata captures the enrichment that comes from collaboration. It includes elements such as tags, ratings, annotations, and questions and comments.
Figure 1. Types of Metadata
Metadata in the Modern Data Environment
Metadata is the foundation of the modern data environment. Innovative uses of metadata in new solutions enable organizations to tackle complex data management problems. For example, without metadata, the following three new approaches to data management challenges would not work:
Data Fabric. Data fabric is an architectural approach that uses metadata, machine learning, and automation to weave together data of any format from any location. It automates functions, such as onboarding new data sources and combining data from different sources. Data fabric relies on technical and operational metadata to drive the automation that makes managing data at scale feasible. Business and social metadata help make data easy to find and use.
Data Access Governance. Data access governance is an approach that seeks to democratize data consumption while also locking it down to prevent breaches and meet ever-changing compliance requirements. Granular policy definitions that determine who can access specific data elements under certain conditions are necessary to walk this razor’s edge. Platforms that enable management and application of data access policies across the enterprise data landscape create and consume technical, operational, and business metadata.
Data Observability. Data observability refers to real-time monitoring and optimizing of data processes for quality, performance, and compliance. It’s concerned with identifying and addressing bottlenecks in data pipelines, monitoring who is using data and what for, and tracking compliance with policies and regulations. Data observability uses technical and operational metadata to manage the health and security of the data estate.
Beware Metadata Silos
Tools in these categories each maintain one or more of their own metadata stores. So do data catalogs, BI tools, data quality tools, and data integration tools. Since most organizations implement many solutions that rely on metadata, the risk of not managing metadata is creating metadata silos (see Figure 2) and a metadata landscape that looks just as scary as the data it’s being used to help tame.
Figure 2. Metadata Silos
Siloed metadata is just as bad as siloed data because it creates many of the same challenges. For example, if business metadata is stored in multiple places, you must ensure that it’s consistent—i.e., that it provides one version of the “metatruth.” Completeness is another challenge. None of the toolset-related stores will have a complete set of metadata. You need a composite view of metadata domains from different sources.
If business metadata is stored in multiple places, you must ensure that it’s consistent—i.e., that is provides one version of the “metatruth”.
Faced with the potential chaos of uncoordinated creation and use of metadata, the natural impulse might be to bring it all together into one centralized metadata store. But, one thing we’ve learned about data in the last ten years is that it needs to be everywhere. History tells us that centralized, monolithic approaches to data are neither flexible nor scalable enough for present-day uses. The same is true for metadata. So, where do you begin?
A Change of Mindset
The first step in facing these challenges is a change of mindset. Metadata is data and you must manage it like data. It makes little sense to invest in data solutions only to hobble them with duplicative and inconsistent metadata. Here are three recommendations to help make that mindset change happen:
Create an enterprise metadata model. An enterprise metadata model is just as important as an enterprise data model. Invest the time to create one. The model should be comprehensive enough to map the four types of metadata to the uses your organization has for them. A comprehensive metadata model provides the framework for coordinating distributed and disparate metadata stores without requiring centralization.
Integrate metadata. Once you have a model, plan how to integrate metadata. Determine the metadata source of record for each domain. For example, if you already have a well-maintained data catalog, that may serve as a source of record for business metadata. Determine what metadata needs to be synchronized or referenced between stores. You may already be synchronizing business semantic data between a data catalog and BI tools like Tableau.
Proactively govern your metadata. Create standards and policies for how metadata will be created, integrated, and used. Include metadata in the design review steps of your development process. Define quality standards for metadata and methods for identifying and mitigating quality issues.
Tim Berners-Lee famously said, “Data is a precious thing and will last longer than the systems themselves.” That recognition is now backed with action and resources yielding capabilities that were the realm of fantasy twenty years ago. Getting to this point has been a long road with many hard lessons learned about managing data. Let’s not repeat the journey with metadata and force ourselves to relearn those lessons.