An Architectural View of Metadata Management
ABSTRACT: Most organizations actively manage data but are passive about metadata. We need to think about metadata management from an architectural perspective.
Most organizations today recognize the importance of active and disciplined data management. They view data as an asset and manage it with governance and architectural standards and controls. The problem is that, in contrast, most organizations are passive and casual about metadata management.
Data teams typically look to data catalogs as the answer to metadata needs. Looking from an architectural perspective, it is clear that data catalogs are only part of the solution, and frequently they are also part of the problem. Organizations manage data as an asset, but view metadata simply as a by-product of data management processes. This “data is managed, metadata happens” approach is fraught with risk. As data management complexity continues to increase, metadata management has become an essential discipline.
In this article I present an early draft of Metadata Management Architecture co-developed with my eLearningCurve colleague, Olga Maydanchik. I offer this architectural view as a thinking tool – that is, a means to begin to understand the scope and complexity of metadata management. It doesn’t provide solutions for all of the metadata management challenges. It is a beginning – not an end – and a tool to start finding solutions to metadata challenges such as silos, disparities, self-service difficulties, and poor data catalog adoption.
A Macro View of Metadata Management Architecture
Let’s begin with a big picture view of metadata management architecture. (See figure 1.) At the macro level, metadata management comprises three broad topics:
Metadata Subjects and Sources are things that are described by metadata (the subjects) and the things from which metadata is derived or created (the sources). These include the inventory of data that is managed by the organization and the processes by which it is managed.
Metadata Lifecycle is the path that metadata follows from inception, through various stages of processing and management activities, to the point of consumption and use.
Metadata Management Processes and Products are the tasks and activities performed to manage metadata and the tangible results of those tasks and activities.
Figure 1. Macro View of Metadata Management Architecture
Zooming In On Metadata Management Architecture
Now let’s take a closer look at each component of the metadata management architecture.
Metadata Subjects and Sources
As described above, metadata subjects and sources include the inventory of data that is managed, and the processes by which that data is managed. The data inventory of a typical organization is quite large and diverse. (See figure 2.) It includes both operational data that is used to run the business, and analytical data that is used to measure and manage the business. Both operational and analytical data consist of enterprise data that is generated internally, and external data acquired from partners and data providers.
Figure 2. The Data Inventory
The data inventory is obviously a primary subject of metadata. It is much of what is described by metadata – names, meanings, rules and constraints, etc. It is important to recognize that the inventory is also a metadata source. AI/ML algorithms can be used to extract metadata from the inventory in forms such as semantic inference, tagging of privacy and security sensitive data, knowledge graphs showing data relationships, and other kinds of automated metadata discovery.
Metadata subjects and sources also include the data management processes that are used to manage the data inventory. (See figure 3.) These include (but are not limited to) the processes of operational systems, data warehousing, data lake management, master data management, data quality management, and data observability.
Figure 3. Data Management Processes
Data management processes are both subjects and sources of metadata. Ideally the core data systems – operational, data warehousing, data lake, and MDM – are built on metadata foundations such as data models and data definitions, and they are designed to generate metadata describing how data is created, updated, and deleted. Data quality management and data observability systems generate additional metadata about characteristics of data and processing of data.
The Metadata Lifecycle
The metadata lifecycle is the path that metadata follows from inception to consumption – a progression through activities of Metadata Collection, Metadata Storage, Metadata Access, and Metadata Consumption.
Metadata collection encompasses all of the activities of capturing metadata from sources and subjects. (See Figure 4.) Those activities include metadata creation, metadata discovery, and metadata acquisition.
Figure 4. Metadata Collection
Metadata creation occurs when processes create new metadata. These may be computer processes such as data pipeline execution describing data lineage as metadata, or they may be human processes such as data modeling in systems design, source/target mapping in data warehouse design, and data governance processes to describe and tag data. Any task or activity that generates data describing data inventory or data management processes is a creator of metadata.
Metadata discovery occurs when intelligent processes find metadata by looking at the data. Discovery may be in the form of AI/ML agents that crawl stored data to extract metadata – for example discovering data semantics. This process is also known as metadata scanning. Discovery may also occur as part of processing data – for example intelligent data lake ingestion automatically cataloging data as in is brought into the data lake, and AI/ML based auto-tagging of data at time of ingestion. Manual discovery may also occur as part of activities such as data exploration by data scientists and data profiling by data quality analysts.
Metadata acquisition encompasses the processes of collecting metadata when that metadata is not readily created or discovered. Acquisition is the work of acquiring metadata from sources, both human and digital. This includes manual recording of metadata such as curator annotations, and crowdsourcing of metadata to capture SME knowledge and data consumer experiences. Acquisition may also occur as metadata import processes to acquire metadata created by tools and processes that don’t readily interoperate with your enterprise metadata repository or data catalog.
Metadata storage includes the technologies that are used to store metadata and the locations where metadata is stored. (See figure 5.) These typically include data catalogs, metadata repositories, tool-specific metadata stores, and file and database management systems (including spreadsheets – a common but not ideal practice).
Figure 5. Metadata Storage
Metadata storage is an area where many metadata management problems exist. Note that each of the things listed above is expressed in plural form – catalogs, repositories, metadata stores, file systems, database management systems. This is the stuff of metadata silos, redundancy, disparity, and confusion. Multiple metadata stores are perhaps unavoidable with modern data management technology. Vendor-proprietary and tool-embedded metadata, data catalogs built into data preparation and analysis tools, and custom built metadata solutions all contribute to the problem. Architecturally, we need to think about metadata interoperability and designated metadata system-of-record concepts such as a formally recognized enterprise data catalog.
Metadata access provides the capabilities needed for people and processes to find and use metadata. Access is provided through data catalogs, metadata connectors, metadata APIs, and metadata queries. Connectors and APIs may be provided by metadata management tools, and they may also be internally developed to simplify access and to embed access controls for metadata.
Figure 6. Metadata Access
Metadata access difficulties – problems with finding and accessing metadata – are compounded when metadata is stored in different forms across metadata silos. Architecturally, you may want to consider solutions such as a metadata registry or a metadata portal to partially mitigate the difficulties.
Metadata consumption encompasses all of the various ways that metadata is used by people and by software and computerized processing. (See figure 7.) Metadata may be used actively – that is processes accessing metadata and using it to make run-time decisions. It may also be used passively – human access for data understanding and decision-making about how to work with the data.
Figure 7. Metadata Consumption
Finding and understanding data is a common use case for data analysts and self-service data consumers. Managing data lifecycles is a metadata-dependent activity that is central to the work of data managers, data governors, and automated tools such as those for data lake management and data pipeline management. Reporting, analytics, and AI/ML all depend on metadata – both for the human processes of design and development, and for the automated processes of operation and execution.
Metadata Management Processes and Products
Metadata Management Practices and Products are the tasks and activities performed to manage metadata and the tangible results of those tasks and activities. (See figure 8.) Products are the things in the metadata inventory. Practices are the activities of managing metadata – the processes that are executed and tasks that are performed.
Figure 8. Metadata Management Practices and Products
Metadata inventory includes business metadata describing the semantics and business meaning, associating data with business processes, and describing business rules that establish data constraints. Technical metadata describes the data from technology perspective including database schema, data formats, data types, platforms and storage locations, and other technical aspects of data implementation. Operational metadata describes the processes that operate upon data and the results of those processes – for example, data transformations in data warehousing and data lineage as data moves through data pipelines. Social metadata describes the human aspects of data, answering questions such as:
Who are the data stewards?
Who are the data SMEs?
Who are the business SMEs?
Who are frequent users of the data.
Although some may consider social metadata trivial or extraneous, making the human connection can be especially significant when working to improve data catalog adoption.
Putting the Pieces Together – An Architecture View of Metadata Management
Bringing together all of the metadata management pieces discussed above creates the Metadata Management Architecture shown in figure 9.
Figure 9. Metadata Management Architecture
This diagram is admittedly small print and difficult to read. For readability, it is best to refer back to the individual diagrams of each part. The purpose of this diagram is to illustrate the scope and complexity – the number of components and the relationships between them – that make metadata management a challenging endeavor.
In closing, let me restate a goal from the beginning of this article: I offer this architectural view as a thinking tool – a means to begin to understand the scope and complexity of metadata management. It doesn’t provide solutions for all of the metadata management challenges. It is a beginning – not an end – and a tool to start finding solutions to metadata silos, metadata disparity, self-service data difficulties, poor data catalog adoption, and many other metadata challenges. I hope you find it useful in that way.