Why and How Data Engineers Realize DataOps Benefits with Apache Kafka Streaming
The rise of DataOps is based on nearly universal agreement that enterprise data management is rife with challenges. While solutions are far from universal, data streaming architectures based on Apache Kafka are becoming a preferred platform to achieve DataOps objectives.
The emerging discipline of DataOps seeks to build and manage efficient, effective data pipelines. DataOps applies the principles of DevOps, agile development, lean manufacturing and total quality management to data management. It offers hope to data teams that struggle to answer modern business demands for higher volumes, variety and velocity, with uncompromised quality. By collaborating more efficiently, data managers and consumers can create new analytics value at a lower cost. As Wayne Eckerson writes in his report, Trends in DataOps: Bringing Scale and Rigor to Data and Analytics, the hunt for “faster, better and cheaper” data operations involves multiple tools and process changes.
Streaming-first data architectures based on Apache Kafka can greatly assist the effort by effectively enabling lean manufacturing for data pipelines. Kafka can reduce waste and improve productivity as it connects data producers and consumers with granular topic streams of data. You can publish once to many consumers, then reconfigure topics to absorb new end points or business rules. You can identify, then remediate errors (poka-yoke in “lean” lingo) with continuous monitoring and stream processing. You also can continuously improve your environment with incremental code updates, akin to a Kaizen approach. (Also see my earlier article, Assessing the Kafka Data Streaming Opportunity.)
This article explores a real DataOps success story that entails re-architecting one of the world’s largest data environments to be streaming-first. This Fortune 500 provider of subscription and pharmacy benefits, which we will call “GetWell,” is migrating to the Apache Kafka real-time data streaming platform to improve efficiency and data quality.
Like many well-established enterprises, GetWell had until recently become beholden to its legacy architecture.
Like many well-established enterprises, GetWell had until recently become beholden to its legacy architecture. Over the years its data team had built a dedicated, custom system for each new services project, leading to redundant and conflicting data repositories. As silos accumulated, GetWell’s IT organization struggled with mounting operational costs and declining data quality. Multiple DataOps warning lights were flashing.
GetWell decided to set a new course and standardize on a Kafka-based microservices platform to which they would migrate the processing of various prescription and benefits offerings. Their mantra: build once and reuse many. They are now reusing datasets and infrastructure components across services to accelerate service rollout and delivery while reducing operational costs. GetWell is replicating data from their DB2 z/OS system of record and injecting them into granular, persistent topic streams. Their services draw data as needed to support various real-time customer interactions.
The technical requirements of this approach are not trivial. GetWell needed to identify and replicate transactional data from their mainframe system of record without slowing production operations or increasing MIPS cycles that incur ongoing fees. They needed to flexibly configure and re-configure transactional data streams to meet changing customer and business requirements, minimizing dependency on over-burdened developers. Finally, they needed to maintain transactional consistency (i.e., data quality) by accurately aggregating Kafka records for inserts, updates and deletes to ensure that transactions were processing holistically for customer service operations.
GetWell is using automated change data capture software to replicate live transactional data from their mainframe to their customer services SaaS platform in an efficient, granular and flexible fashion. They are non-disruptively identifying and capturing real-time inserts, updates, deletes and DDL schema changes on the mainframe production system.
These Kafka record streams, each supporting a specific customer service requirement, are subjected to a Java-based stream processor that performs some light transformations and integrity checks before they are reinjected into a second Kafka stream. The Java-based update processor then performs more integrity checks and delivers transformed records to the appropriate service. A final Java-based processor will “retry” any failed records to ensure they reach the right service through Kafka. The figure below illustrates this process flow for publishing, processing and consuming transactional data.
The stream and update processors also perform additional, critical steps to ensure transactional consistency. GetWell has carefully configured time-based windows (extensible when needed) that aggregate all Kafka records related to a specific customer transaction into its own “state store” for integrated processing. They also have configured their Kafka system to send all inter-related events to a single topic stream and single partition. This enables records to be published in the correct sequential order and thereby more easily reassembled into accurate and comprehensive transactions, even if records have different formats. GetWell partitions by “entity key,” for example logical data definitions such as membership or account number, in order to ensure that all inter-related data for a given transaction fall into the same partition.
GetWell has reduced the redundancy and cost of service rollouts, and now more flexibly answer dynamic business requirements. They have used Kafka to achieve DataOps results that Toyota, the creator of lean manufacturing, would respect.