Register for "A Guide to Data Products: Everything You Need to Understand, Plan, and Implement" - Friday, May 31,1:00 p.m. ET

Building an Effective Transactional Data Streaming System

Production transactional systems, including RDBMS and mainframe, are the workhorses of the modern enterprise. These systems drive financial bookkeeping, supply chain logistics, customer interactions, and many other daily operations.  

All that data needs to be copied to other repositories where it can be studied and measured to enable smart decisions without impeding production workloads. To make this happen, for decades enterprises have copied these database entries in periodic, duplicative batches to reporting databases, data warehouses and now data lakes.

Two changes have made batch untenable. First, business events are increasingly urgent: consider the location-based cross-selling opportunity created by a retail purchase, or the rippling supply-chain pressure created by a spike in orders for a hot product on Amazon. You cannot wait for hours or days to act on this data. Second, rising data volumes mean that you cannot afford the wasteful practice of re-copying unchanged data sets. Even if you don’t need to analyze data changes real-time, you should process them incrementally as they happen to avoid bottlenecks.

Hence the shift away from batch. While data streaming is not new, in the last five years Kafka and the cloud vendor variants Amazon Kinesis, Azure Event Hub and Google Pub/Sub have ushered in a new era of streaming to answer modern data requirements. These requirements include the following.

  • Real-time event response: The value of event data is increasingly short-lived.
  • Data distribution: Many users need to address many use cases on many platforms.
  • Asynchronous communication: Enterprises seek to capture data as it is created, but have applications consume it at their own pace.
  • Parallel consumption: Multiple parties often need copies of the same data for different uses.
  • Service modularity: A key principle of modern microservices architectures is that one service does not depend on another, for data sharing or anything else.

With these pain points in mind, we can easily see the value of a highly distributed, highly scalable streaming platform that asynchronously collects, buffers, and disseminates data records (a.k.a. messages). The producers and consumers of these records can be independent, working with an intermediate broker rather than one another. And multiple consumers can each process their own record streams as they need it. (For more detail on why and how Kafka is used, see Assessing the Kafka Data Streaming Opportunity. Also, be sure to check out Julian Ereth’s article on the rise of Streaming-First Architectures.)

Building and managing an effective enterprise streaming platform is neither easy nor straightforward. Let’s consider six common attributes of effective streaming systems for transactional data. These draw on the experiences of data teams at enterprises such as Capital One (see the presentation of their Data Streaming VP Chris D’Agostino at the Kafka Summit in October) and Bloomberg (also see their Kafka Summit presentation.)

Shared resource pool. A centralized architecture, often cloud-based, provides the most efficient platform for serving various streaming data consumers in an enterprise. Use cases, resource requirements, and usage patterns will vary by line of business and department. Supporting them with a common platform and ideally, an automated service menu improves the efficiency of infrastructure utilization. When it comes to cost, individual “shadow IT” cloud arrangements can rarely match the economics of an enterprise-wide, cloud-based approach.

Standardized process. IT personnel and business managers within the lines of business need clear guidance and guard-rails about how best to use the shared streaming infrastructure. Developers, DBAs and architects alike need a common set of steps, addressing planning, design, data quality, testing, and production. Developers need a common dictionary and ontology as they design their streaming analytics algorithm. With such repeatable processes in place, lines of business can roll out projects faster and focus their creativity on the business opportunity within their specialty.

Productive collaboration. Two sets of interactions are critical to the success of a transactional streaming system.  The first is the interaction of business and IT managers. As with any data initiative, business requirements must be translated plainly into architectural requirements. Typically, enterprise architects play this role, boiling down service requests and SLAs into a clear blueprint for Kafka clusters. This includes careful configuration of topic streams, partitions, and other components to reconcile often-competing requirements for data latency, throughput, durability, and availability. Second, the DBAs and data architects that manage production databases need to reduce the burden on ETL/Kafka developers wherever possible. For example, they can use automated change data capture solutions to configure the publication of database transactions to Kafka or other data streams with no manual scripting.

Data governance. There are several dimensions to data governance. For example, data producers are often the right accountable parties for data accuracy, using profiling tools, quality checks, etc. In this scenario, if Line of Business A publishes transactions to Kafka for use by Line of Business B, A rather than B is responsible for source data quality.  Compliance is more often a shared responsibility. For example, GPDR will require that all parties along the pipeline manage Personally Identifiable Information (PII) only in ways that are explicitly authorized by the original owners of that data. The central IT organization should define, monitor usage and enforce policies, then provide reports to the appropriate internal compliance officer. All this requires careful consideration of factors such as the “blast radius” and organizational politics.

Wide adoption. The more business managers you empower with real-time data, the smarter the decisions your organization will make. Establishing a self-service shared streaming platform and productive methods of interaction as described here can help drive this adoption throughout the enterprise.

Monitoring. It is critical to monitor state, memory utilization, throughput, latency, number of topics and lags in creation/population of partitions. These and other metrics will signal your ability to meet service level agreements from the business and maintain sustainable loads on the infrastructure.

With the right architecture in place, enterprise data teams can then focus on the processing of the transactional data itself. And ensuring that database records are handled appropriately can be tricky. Let’s examine the key aspects of an effective, efficient streaming data pipeline.

Data processing stages. Pipelines must be designed to address three phases. First, transactional record insertions, updates, and deletes, as well as source schema changes, must be injected into the appropriate topic stream. Various connectors and CDC offerings can achieve this. Second, an update processor will need to run data integrity checks, ensuring that inserts/updates/deletes are processed once and only once. Third, errors or processing failures must be handled and re-tried.  

Transactional consistency. A closely related requirement is ensuring that transactions are processed holistically, exactly once and in order. This entails several steps, including transactional boundary definition, committed status tracking and rollback where necessary. Data managers also should weigh their topic and partition configuration choices, which introduce tradeoffs in terms of processing overhead, coding requirements, etc. Certain topic/partition options might be faster to process, but require custom transformation logic to be sure transactional boundaries are correctly maintained on the consumption side.

A well designed and carefully implemented transactional streaming system can deliver significant benefits. Mastercard, for example, processes 150 data fields for real-time authentication checks on credit card purchases. Their system reduces both fraudulent transactions and false alerts, delivering benefits to the top and bottom line.

To learn more about how the Fortune 25 firm Express Scripts is using stream processing to enable microservices, feel free to check out their upcoming webinar with Attunity and Confluent.

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie