Fixing Metadata’s Bad Definition

ABSTRACT: “Metadata is data about data” is a bad definition. It’s vague and recursive. How can we manage metadata if we can’t even define it clearly?

I’ve always been bothered by the expression “metadata is data about data”. That’s a bad definition—vague and recursive. It’s like saying climate is the weather of weather. But if metadata is not data about data, then what is it? I pondered this while writing my recent post Metadata is Data, So Manage it Like Data. How can we manage metadata if we can’t even define it clearly? 

How can we manage metadata if we can’t even define it clearly?

A bad definition has practical implications. It makes misunderstandings much more likely, which can infect important processes such as data governance and data modeling. Thinking about this became an annoying itch that I couldn’t scratch. What follows is my thought process working toward a better understanding of metadata and its role in today’s data landscape.

The problem starts with language. Our lexicon hasn’t kept up with modern data’s complexity and nuance. There are three main issues with our current discourse about metadata:

  • Vague language. We talk about data in terms of “data” or “metadata”. But one category encompasses the other, which makes it very difficult to differentiate between them. These broad, self-referencing terms leave the door open to being interpreted differently by different people. 

  • A gap in data taxonomy. We don’t have a name for the category of data that metadata describes, which creates a gap at the top of our data taxonomy. We need to fill it with a name for the data that metadata refers to. 

  • Metadata is contextual. The same data set can be both metadata and not metadata depending on the context. So we need to treat metadata as a role that data can play rather than a fixed category. 

Vague Language Leads to Confusion

The vague language we use about data and metadata doesn’t communicate the difference between the two. So there are different interpretations of what metadata is. For example, the National Information Standards Organization (NISO) says that the title, author, subject, genre, and publication date of a book are metadata. But these characteristics aren’t metadata; they’re attributes of a book. If these are considered metadata, then the characteristics of a customer, such as name, age, and income are metadata. But we don’t think of these as metadata; they are attributes of a customer.

To address this confusion, we must agree on what makes metadata different from the data it describes. But before we can answer that question, we have to figure out what to call the data that metadata refers to or the conversation will get very convoluted. 

The Taxonomy Gap

The term “data” is the root node of the data taxonomy. It encompasses “metadata” as well as what metadata refers to—the category we presently don’t have a name for. This is the gap in the taxonomy (see Figure 1). 

Figure 1. What Do We Call the Data That Metadata Describes?

We need a common understanding of what goes in the yellow box above to differentiate it from metadata. To fill the gap, I’ll propose the term “intrinsic data”. 

Intrinsic Data. Intrinsic data is generated by organizations pursuing their mission—data that’s intrinsic to their purpose, whether they sell shoes, provide health care, or operate a foodbank. We can break intrinsic data into two subcategories, business data and systems (see Figure 2).

Figure 2. Metadata Describes Intrinsic Data

Business Data. Business data is generated when organizations deliver their products or services. It includes familiar domains such as sales, customers, products, services, suppliers, and logistics. It also includes data that comes from the administrative functions needed to manage an enterprise, such as finance, human resources, and legal. 

Systems Data. The systems that IT manages to run the business such as networks, storage systems, servers, operating systems, databases, and business applications also produce data. They generate usage events, job schedules, run times, errors, configurations, and data about many other aspects of their operations. 

If metadata describes intrinsic data, what then, is the difference between metadata and intrinsic data?

Instances Versus Classes

Intrinsic data is about discrete instances of things and events—a specific customer, a sale that was just made, a certain router, or a data pipeline job failure from the last run. Metadata is about classes of things and events, such as the structure of a customer table, the business definition of “date of sale”, or the details that a pipeline error log should capture. 

Intrinsic data describes specific instances of things and events 

Metadata describes classes of things and events

The principle of instances versus classes caused me to reflect on the metadata taxonomy I referenced in Metadata is Data, So Manage it Like Data. While the categories of metadata I described—technical, business, operational, and social—are commonly used, a couple of changes are necessary to incorporate the ideas I’ve floated. 

The first change is a simple one: rename business metadata to descriptive metadata. It’s accurate and it’s another label in common use for the metadata category that includes terms, definitions, and classifications. The other change has to do with operational metadata, which refers to things like job schedules, run times, and errors. According to the principle of instances versus classes, this category is not metadata. It’s intrinsic data, specifically systems data, because it refers to instances of systems, and their components and events. 

With these changes, we can start to construct a data taxonomy that harmonizes intrinsic data and metadata (see Figure 3). It provides a clear picture of and names for the data that metadata refers to which allows us to have coherent discussion about their differences. While this taxonomy is not complete, it provides a good starting point for drilling farther down into more detailed subcategories. 

Figure 3. Harmonized Data Taxonomy

Metadata is Contextual

I’ll cover one last point for now. The same data set can be both intrinsic data and metadata depending on the context. Consider, for example, information in a data catalog describing a customer entity. From a catalog user’s perspective, this is metadata because it describes the class of “customer”, not specific customers. But the data catalog is a system that manages classes of data. The information it stores about the customer entity is an instance of the data it works with. Therefore, from that system’s perspective, it’s not metadata, it’s intrinsic data.

Metadata is a role that a data set can play depending on how it’s used, rather than a fixed category it belongs to 

Other types of data change from metadata to intrinsic data depending on the lens through which you observe them. Data access management policies are metadata because their rules refer to classes of data objects and data consumers. However, the policies in a data access management system are instances of the access rules that system manages. The same is true for data quality specifications in data quality tools and object structures in data management systems.      

So metadata is a role that a data set can play depending on how it’s used, rather than a fixed category it belongs to. 

Still Itching

I started out with the intention of gaining a better understanding of metadata and its role in today’s data landscape. That was the itch I couldn’t scratch. I got some relief from these three propositions:

  • The name for the data that metadata describes is intrinsic data. 

  • Intrinsic data describes specific instances of things and events. Metadata describes classes of things and events.

  • Metadata is a role that a data set plays when it’s used to describe classes. However, the same data set can be intrinsic data when it describes instances. 

But now the itch has moved! It’s become a new question. What practical value do these concepts have? I’ve already addressed the value of improving communication with clear language. And there’s more. How do these concepts affect data modeling or wrangling the many silos of metadata that come with applications and tools that we implement? That, my friends, is the subject of the next installment in this series.

Jay Piscioneri

Jay has over 25 years of experience in data technologies including data warehousing, business intelligence, data quality, and data governance. He's worked with organizations in a wide variety of industries...

More About Jay Piscioneri