If Data is the New Oil, Metadata is the New Gold

If Data is the New Oil, Metadata is the New Gold

What is Metadata?

Simply put, metadata is information about other data. For illustration, think of a phone call (Figure 1). On the one hand, you have actual data, which in this case would be the spoken words exchanged in a call. On the other hand, there are tons of metadata around the actual data, like the duration of a call, the geo locations of each participant, the volume during the conversation or even the phone models used by the callers.

Figure 1. Data and Metadata of a Phone Call

The phone call example also perfectly illustrates how powerful metadata can be. Research from the Stanford University, for instance, has shown that metadata of phone calls discloses a significant amount of personal information without accessing actual voice records. Graph analyses of phone call metadata can reveal relationships amongf people and even medical conditions of individuals [1].

Metadata in Business Intelligence and Analytics

For business intelligence and analytics (BIA), metadata is essential to manage datasets, assure data quality, and integrate and transform data to gain new insights. Ralph Kimball described metadata as the “DNA of the data warehouse” [2].

There are three types of metadata:

  1. Descriptive metadata that contains additional information to identify a dataset (e.g. title, description)
  2. Structural metadata that depicts how records of a dataset are related (e.g. pagination, parent records)
  3. Administrative metadata that provides information to operationally manage a dataset (e.g. author, creation time and the like).

Kimball [2] provides a data warehouse (DW) oriented perspective on metadata. He distinguishes between (i) technical metadata like data structures, tables or data types, (ii) business metadata that provides information about the business-meaning of data (e.g. description of a KPI) and (iii) process metadata that contains data logged throughout the operation of a DW (e.g. transformation steps, errors, CPU consumption).

Figure 2. Metadata Management in a BIA Architecture


Obviously, metadata is everywhere and quite often, there is more metadata than actual data. Accordingly, metadata management is a cross-sectional task that spans an entire BIA architecture (Figure 2). The goal is usually to establish a centralized metadata management in order to provide a holistic perspective of an entire organization.

In summary, metadata in BIA serves first as documentation and second as glue that keeps a DW together. However, modern metadata management can provide more than that. Below there are some use cases that illustrate the value of metadata.

  • Data Lineage.   A straightforward use case for metadata is data lineage. The goal of data lineage is to track data records over their entire life cycle, e.g. from an aggregated KPI in a report to its original data source. This can increase trust and acceptance of BIA by making data transparent and comprehensible for business users. Moreover, data lineage can be used to track down error causes or comply with laws and regulations (e.g. in banking and finance) [3].

For complete data lineage, it is necessary to track and store metadata  for every step in the data life cycle. Therefore, a holistic metadata management is essential to data lineage.

  • Data Warehouse Automation.  Data warehouse automation (DWA) is a big topic recently. DWA automates DW lifecycles from source system analysis, testing, and documentation to reduce the effort required to build and run a DW. Automation needs a lot of metadata, especially, technical metadata about data structures, data types or relations, and dependencies.

DWA can lead to significant cost savings, and, if done right, it can increase agility by accelerating development and change cycles. Moreover, automated tests can also increase quality and solidity of systems. Similarly, metadata management is not only necessary to automate DWs, but can also benefit from automated extraction and transformation of metadata from data sources.

  • Advanced Analytics and Artificial Intelligence.  In the era of big data, there is talk of advanced analytics and innovative applications of artificial intelligence to get more out of data. However, with tons of unstructured data and concepts like data lakes [4], it becomes increasingly hard to stay on top of things. This is why metadata is necessary to assure that big data does not just become a large collection of unusable trash data.

In this context, metadata enables algorithms to automatically analyze and match data and thereby make big data manageable and provide valuable results. Likewise, metadata also helps human data scientists to explore data sets and extract insights. For instance, it can be hard for data scientists to understand certain data without knowing where  it comes from, how data was calculated and the business meaning.

  • Open Data and Data Monetization.  Metadata is often a good candidate for open data and data monetization, since actual business data is usually too confidential or too business-critical to share with external parties. For illustration, it is not very likely that a car manufacturer would share its customer data. However, metadata like abstracted GPS profiles or engine metadata are not that critical and can be valuable to external partners, e.g. to improve route guidance systems or provide more accurate insurance rates.

Moreover, there is a trend to provide selected data through open APIs to enable external parties to explore and use all kinds of data for new applications. Data shared here is mostly rehashed metadata coming from various processes throughout an organization.

Summary

This article illustrated that metadata is essential for operational data management but also holds a lot of potential for many other applications. Hence, it seems wise to place a certain importance on metadata management.

A starting point for this is a thought-out metadata concept that defines what metadata is collected, how it is structured and what to do with it. Optimally, such a concept is part of an over-arching data governance that regulates data handling throughout an entire organization [5].

Due to the absence of generally accepted standards and flexible tools that can deal with any kind of metadata, establishing a holistic metadata concept is not an easy task. However, as data sets grow and the number of heterogeneous data sources increases, the need for professional metadata management becomes even more pressing.

Notes and Further Reading

[1] Mayer, J., Mutchlera, P.  and Mitchella, J. C. (2015): Evaluating the privacy properties of telephone metadata. http://www.pnas.org/content/113/20/5536.full

[2] Kimball, R. (2008): The Data Warehouse Lifecycle Toolkit, 2nd Edition

Hawking, P., Sellitto, C. (2010): Business Intelligence (BI) critical success factors.  http://www.business.vu.edu.au/staff/paulhawking/publications/ACIS%20BI%20CSFfinal.pdf

[3] Jain, S., Thomson, B. (2013): Data Lineage: An Important First Step for Data Governance. http://www.b-eye-network.com/view/17023

[4] Devlin, B. (2017): Big Fish Swim in the Data Lake. https://www.eckerson.com/articles/big-fish-swim-in-the-data-lake

[5] Devlin, B. (2016): Data Governance Coming Out of the Dark. https://www.eckerson.com/articles/data-governance-coming-out-of-the-dark-part-1

Julian Ereth

Julian Ereth is a researcher and practitioner in the field of business intelligence and data analytics.

In his role as researcher he focuses on new approaches in the area of big...

More About Julian Ereth