Wrangling Metadata: Making It the Focus of Data Management
ABSTRACT: We must treat metadata like a fully-vested member of the data landscape. This post explores how to start wrangling diverse and distributed metadata.
This is the final installment in a series of articles about metadata. When I started this journey, I didn’t have a particular destination in mind. I just knew my understanding of metadata was incomplete given the radical changes in data over the last fifteen years. It started with an observation. We’ve used metadata as a tool to manage data for years. But we haven’t managed metadata itself with the same rigor. With the ascendance of active metadata that powers crucial functions such as managing data access and data quality, metadata is now mission-critical.
We must treat metadata like a fully-vested member of the enterprise data landscape. A unifying taxonomy is a good place to start bringing metadata out of the shadows and make it an object of data management rather than just a tool. This article will expand on the taxonomy introduced in a previous post, examines an example of how metadata relates to intrinsic data, and explores how to start wrangling diverse and distributed metadata.
Before we dig in, let’s recap some key definitions:
The data that metadata describes is intrinsic data. It’s the data that organizations create and consume in pursuit of their mission.
Intrinsic data describes specific instances of things and events such as a customer, a product, or a sale.
Metadata describes classes of things and events. It keeps track of what we want to know about all instances of customers, products, and sales.
The unified taxonomy pictured in Figure 1 illustrates the relationship between intrinsic data and metadata. For each orange box depicting a type of intrinsic data, you can picture one or more blue metadata boxes that describe it. This taxonomy also shows that the imperatives of data management apply to all data.
Figure 1. A Unified Data Taxonomy for Data Assets Under Management
Now let’s look at the relationship between intrinsic data and metadata in more detail. We’ll use the example of how the XYZ Insurance Company manages its customer data.
The imperatives of data management apply to both intrinsic data and metadata
XYZ provides personal auto and home insurance to over 2 million customers. The intrinsic data about XYZ’s customers tells us specific things about each of those individuals. The metadata about the customer domain tells us what information we need about all XYZ customers (see Figure 2).
Figure 2. Metadata: Information Needed About All XYZ Customers
Technical metadata includes the characteristics of intrinsic data that systems need to work with it, such as format, type, length, and location. It also includes technical documentation such as data models and system designs.
Models. A model is a representation of real-world things or events and how they interact with each other. XYZ maintains an enterprise data model that depicts the relationship between customers and other domains such as products and sales. It’s a high-level conceptual model, so it does not map all the ways that the company captures and stores customer data.
Schema. Schemas define the structure of specific data sets detailing the attributes they capture. XYZ’s customer schemas include many attributes, such as name, mailing address, email address, birthday, occupation, and income range. However, since XYZ captures customer data in a variety of ways, they have many customer schemas. Their core insurance processing system (IPS) uses a relational database where the customer schema is predefined, a.k.a. “schema on write”. Their mobile app captures customer data in JSON format, to which they can apply an appropriate structure when they consume it for a particular purpose, a.k.a., “schema on read”. Their marketing team pulls CSV data from external sites such as Twitter, Instagram, and HubSpot.
Lineage. Lineage defines how data moves and changes throughout its journey from a source to a target. XYZ, like many companies, has very limited lineage metadata. What they do have appears in a couple of different line-of-business reporting systems that show the warehouse sources or the embedded queries in a particular dashboards, i.e., only part of the lineage story.
Business metadata uses business language to provide context for the data that appears in applications, reports, and dashboards. It includes elements such as terms, definitions, and classifications for data objects. Data catalogs are now the go-to solution for managing business metadata. But other applications, such as data access management and observability tools, also create and use business metadata.
Terms and Definitions. Terms and definitions provide business context for technical metadata. They answer questions such as “What is a customer?” or “What do we call the attribute that describes the job a person has?” XYZ implemented two data catalogs. A regional office saw the need years ago and stood up a cataloging tool that they’re now fully dependent on. The IT department followed suit a couple of years later but chose a different product that the regional office didn’t like. Now there are two sets of business metadata that have drifted apart.
Classifications. Classifications, such as PII, PHI, PCI, Privileged, and Confidential, are critical for managing data and complying with internal policies and government regulations. For example, XYZ classified customer email as PII, which applies to that attribute no matter where it’s used or stored. Unfortunately, XYZ classifies data differently in its two data catalogs. In one, they do it manually; in the other, an AI-driven process does the work. The organic and artificial intelligences sometimes make different choices.
Rules. Data rules define such things as standards for data quality, ownership of domains, and access privileges. Most data catalogs can passively document data rules. But other classes of tools are more prevalent when it comes to enforcing rules. For example, data access management (DAM) solutions enforce granular rules about who can use what data and how. Quality standards are actively monitored by data observability systems that can take action in handling non-quality data. In addition to its two data catalogs, XYZ has both a DAM tool and an observability tool. They each integrate with one of the catalogs but not the other.
Social metadata captures the knowledge enrichment about data that comes from data consumers. It reflects their subjective experience with various data sets through tags, ratings, annotations, and comments.
Tags. Tags are words or phrases that data consumers associate with data objects that indicate the business subjects for which they found the data useful. Tags are typically stored and managed in data catalogs that use them as part of their data discovery functions. Since XYZ has two data catalogs, they have two curated lists of tags and two sets of object-to-tag associations.
Ratings. Most data catalogs also enable consumers to rate data in terms of trustworthiness. Both of XYZs data catalogs capture user ratings. However, one uses five stars while the other uses three stars. Neither rating system explains what the different star levels mean.
Annotations/Comments. Catalogs also enable consumers to collaborate and share insights about data through annotations and comments. Again, XYZ’s two data catalogs maintain separate bodies of annotations and comments.
The first step toward effective metadata management is understanding the nature and scope of the challenges
Wrangling XYZ’s Unmanaged Metadata Data
Figure 3 below visualizes the state of XYZ’s unmanaged metadata landscape described above. If it seems complicated, consider that we left out components such as data lakes, data warehouses, and data pipelines to make the example simpler.
Figure 3. XYZ’s Intrinsic and Metadata Landscape
Several things stand out in this diagram:
The IT-managed data catalog has customer schema info for all three data sources; the regional office catalog does not.
If the two catalogs don’t reflect the same customer schema information, then the other metadata domains such as, terms and definitions, classifications, tags, and comments will be different as well.
Manual syncs between the catalogs and the active metadata tools are likely causing drift between the two sets of access rules and data quality rules.
XYZ can approach these challenges in a couple of ways—standardize on one data catalog or improve automated synchronization between them. For years, data teams had only one best-practice option—standardize. However, the world has changed to favor diverse and distributed data. It follows that the same preference can apply to metadata.
The company’s business and data leaders must evaluate whether there were good reasons at the time for not standardizing on one data catalog. Are those reasons still valid? Does the company have the technical prowess and capacity to automate synchronization processes for metadata between all the tools that use it?
Toward Metadata Recovery
The first step toward effective metadata management is understanding the nature and scope of the challenges, i.e., a current state assessment. Our fictitious case study of XYZ Insurance Company demonstrated a viable approach. Starting with a key intrinsic data domain, such as customer:
Create a diagram of how the domain’s intrinsic data is captured and stored.
Add related metadata domains to the diagram, including all the tools that store and use metadata, such as catalogs, observability tools, and access management tools.
Analyze the metadata sets according to the same data quality criteria that applies to intrinsic data—completeness, consistency, accuracy, uniqueness, and timeliness.
Illustrate metadata quality problems in the diagram.
Identify root causes for quality problems, focusing on what they have in common.
The current state assessment described above applies common data analysis techniques to metadata and should be repeated for other key intrinsic data domains. Use other tried-and-true techniques in the data management toolbox to treat metadata like the mission-critical asset it’s become.