Operational Data Hub – Responding to Data Friction and Technical Debt
ABSTRACT: The operational data hub is a messaging hub where applications share information about data events using a publish-and-subscribe model.
In a previous article, Operational Data Architecture, I asserted that we have focused almost exclusively on analytical data architecture for decades, at the expense of decaying operational data architecture. I stressed the need to return attention to operational data architecture and take actions to reverse the trends of increasing data sprawl, data friction, and technical debt. Improving operational data management has the potential to benefit both operations and analytics. Improved data quality, greater data cohesion, and enhancement of embedded analytics are among the analytical benefits that are achieved with a new focus on operational data architecture. In the previous article, I described three architectural styles—operational data hub, zero-copy data network, and data product based data sharing—as design patterns to meet that objective. This article, the first in a series of follow ups, expands on the operational data hub pattern.
What is an Operational Data Hub?
An operational data hub is a pattern in data architecture that provides a central location and a standard protocol for operational systems to communicate about and share data among themselves. The terms “data hub” and “operational data hub” have many definitions and are used to describe several kinds of data management concepts. Here the term is used specifically to describe a communications hub where operational systems post messages about data events—adding, changing, or deleting data. The hub operates as a publish-and-subscribe model. Applications publish messages to inform other applications about data events. They subscribe to be informed about the data events of interest to them. (See figure 1.)
Figure 1. Operational Data Hub
Messaging Protocol in the ODH
Messaging protocol is a fundamental requirement of the ODH. A standard protocol by which messages are constructed and interpreted is essential. Messages to describe data events must use standard language to describe event types—add, change, delete—and to identify the date and time at which an event occurred. Standard language to describe the data itself is required and that language is based on a common semantic model. (See the previous article for more about the semantic model.)
The body of the message must describe the entity (node in the ontology) and identifier (identifying property in the ontology) along with the value for the identity. Attributes (properties in the ontology) and their corresponding values are also required for add and change events. Delete events apply only to entities. Attributes are deleted using null values, and relationships are removed using null values for foreign keys.
Messages must be categorized as members of a class where each class identifies a topic of interest to subscribers. Kafka topics are one technology-specific example of message classes. Classes might decompose hierarchically as sub-classes. Define the class structure based on the needs and interests of the subscribers. Ontology is a practical way to identify top-level classes—for example, the customer as a message class for all data events affecting customer data, and customer account for all data events affecting customer account data. Some subclasses can be identified based on taxonomies—deposit account and loan account for example, and sub-classes of deposit account such as checking account, savings account, and investment account. Other sub-classes might be based on properties, or even on properties and event types such as customer address changes. At a fine-grained level you might occasionally find subclasses where value is a consideration—for example adding deposit transactions where the transaction amount exceeds $10,000. Data protection is another consideration. You might want to create public and protected sub-classes for some messages where the public class messages omit or obfuscate privacy sensitive data.
The illustration in Figure 1 shows RestMQ. It is a simple but highly-versatile messaging method that uses HTTP as transport, JSON for message formatting, and a RESTful interface for publishers and subscribers. Alternatively, the ODH can be implemented with systems such as Apache Kafka, a distributed messaging system that blends features of a traditional message queue with an integral publish-and-subscribe model. Whether using a traditional message queue or Kafka, it makes sense to use change data capture (CDC) technology to identify data events and publish the messages. Coupling a real-time capable CDC tool with Kafka streaming makes it practical to deliver data change messages from databases in real time when needed, and to queue them for on-demand processing by applications not ready for real time. (See figure 2).
Figure 2. Messaging with CDC and Kafka
Implementing the ODH requires operational systems to change from sending and receiving point-to-point feeds to publishing, subscribing to, and processing messages. The effort is significant, but the benefits are real. The changes can be made over time, prioritizing first the point-to-point interfaces that are the greatest pain points. As a practical matter, you’re not likely to eliminate all point-to-point interfaces, but each that is removed is a step in the right direction.
Benefits of ODH
The operational data hub is a substantial step toward resolving operational data difficulties. It does not seek to integrate data into a single database or to force-fit data to a shared schema as is characteristic of the decades-old operational data store (ODS) pattern. The ODH uses only semantic integration in messages. Each consuming application can then interpret and translate messages to their local dialect as part of processing. The ODH doesn’t eliminate database silos. They continue to exist, but consistency among the databases is improved with application-to-application communications. With communication through a messaging hub, it becomes practical to remove point-to-point interfaces, eliminating some points of data friction, reducing technical debt, improving responsiveness to change, and easing the maintenance burden of complex system interdependencies.
This article describes the operational data hub—one of three operational data architecture patterns that I introduced in a previous article. In upcoming articles, I’ll expand on the zero-copy data network, and product data sharing patterns.