Register for "What's New in 2025? Everything!" - Thursday, December 12, 2:00 pm EST

Best Practices for Real Time Data Pipelines with Change Data Capture and Spark

It is no secret that the most compelling operational analysis is increasingly real-time rather than historical. By studying thousands or millions of transactions you can better measure sales performance, customer behavior, and plant efficiency right now. Advanced algorithms such as machine learning help make ever-more-accurate operational predictions.

All this requires real-time pipelines that acquire, transform and analyze high volumes of transactional data with minimal latency at each stage – data acquisition, transformation, and analysis. Picture a futuristic version of the Panama Canal that moves water between locks instantaneously as needed.

The Apache Spark open source in-memory processing engine has rightfully gained a lot of attention for its ability to accelerate the transformation and analysis stages, running on data stores such as S3 and HDFS. For example, the Spark Streaming API can process data within seconds as it arrives from the source or through a Kafka stream.

But there is a problem: latency often lurks upstream.

Real-time processing on the analytics target does not generate real-time insights if the source data flowing into Kafka/Spark is hours or days old. And this is the logjam that change data capture technology (CDC) can break, especially for transactional data. CDC acquires live DB transactions and sends copies into the pipeline at near-zero latency, eliminating those slow and bloated batch jobs. It also reduces production processing overhead and cloud transfer bandwidth requirements.

Together CDC and Spark can form the backbone of effective real-time data pipelines. Here are best practices you can employ to make this happen.

  • Accept that new analytics workloads break old-fashioned pipelines and platforms. Machine learning best illustrates the voracious appetite of new algorithms. The more customer transactions, supply-chain checks and clinical records you process, the more likely you are to identify the telling patterns and thereby make accurate predictions. Traditional ETL tools, data warehouses, and even the batch-oriented MapReduce Hadoop engine cannot keep pace with large, fast-moving datasets.
  • Select a data store that best supports your transactional datasets. The tradeoffs here can be tricky. While databases are optimized for structured data formats, new platforms such as AWS S3, Azure Blob Store and Hadoop platforms can be more cost-effective and scalable. In addition, data engineers often can more flexibly resolve data quality issues (gaps, etc.) on HDFS and supporting Hadoop components. If you require ACID (Atomicity, Consistency, Isolation, and Durability) compliance, you should select either a database or Hive, which now supports ACID merging processes within the Hadoop stack. CDC and Spark can run on either platform. 
  • Consider Apache Kafka or other streaming platforms for ingestion before persisting the data to object storage. While not required for ingestion, Kafka message queues often more easily support many sources and high data volumes. They also can buffer incoming data for configurable time periods when your processing time intervals differ from source production. High numbers of sources, high data volumes and different source-target time intervals are common with database transactions that are generated continuously but analyzed periodically.
  • Limit the number of targets and streams when using Kafka message queues. Data that is landed via CDC might require additional logic to set accurate transaction consistency and ensure ACID compliance based on source database schemas and record/row parameters. This processing overhead rises quickly when that logic needs to be applied to multiple consumers that use data from multiple topics and multiple producers.
  • Consider cloud packages to reduce deployment and management complexity. AWS, Azure and Google Cloud include Spark in IaaS offerings such as Amazon EMR, Azure Data Lake Store (ADLS) and Google Dataproc. In addition, Databricks provides a Spark-based Platform as a Service on which you can configure your own data pipeline and analytics system. These approaches can eliminate the need to deploy and manage computing and storage resources on premises.

Here are two examples of companies that apply these best practices.

Accelerating the Supply Chain for an On-Premises Data Lake

Decision makers at an international food industry leader, which we’ll call “Suppertime,” needed a current view and continuous integration of production capacity data, customer orders, and purchase orders to efficiently process, distribute, and sell tens of millions of chickens each week. But Suppertime struggled to bring together these large datasets, which were distributed across several acquisition-related silos within ten SAP enterprise resource planning (ERP) applications. Using nightly batch replication, they were unable to match orders and production line-item data fast enough. The delays slowed plant operational scheduling and prevented sales teams from filing accurate daily reports.

To streamline the process, Suppertime first selected the right data store, in their case a cost-effective and scalable Hadoop platform. They adopted a new Hortonworks data lake based on Spark and CDC. They now use CDC to efficiently copy SAP record changes every five seconds, decoding that data from complex source SAP pool and cluster tables. CDC injects these source updates, along with any metadata updates, to a Kafka message queue that buffers these high volumes of incoming messages and sends them on request to HDFS and HBase consumers in the data lake. Suppertime has limited the number of topics (a.k.a. streams) to 10, one per source table, to minimize processing overhead.

After the data arrives in HDFS and HBase, Spark in-memory processing helps match orders to production on a real-time basis and maintain referential integrity for purchase order tables. As a result, Suppertime has accelerated sales and product delivery with accurate real-time operational reporting. It has replaced batch loads with CDC to operate more efficiently and more profitably. 

Real-Time Operational Reporting

Another example (adapted from a real case study) is a US private equity and venture capital firm that built a data lake to consolidate and analyze operational metrics from its portfolio companies. This firm, which we’ll call “StartupBackers,” opted to host its data lake in the Microsoft Azure cloud rather than taking on the administrative burden of an on-premises infrastructure.

CDC is remotely capturing updates and DDL changes from source databases (Oracle, SQL Server, MySQL, and DB2) at four locations in the United States. Their CDC solution then sends that data through an encrypted File Channel connection over a wide area network (WAN) to a virtual machine–based replication engine in the Azure cloud. This replication engine publishes the data updates to Kafka and on to the DataBricks file system on request, storing those messages in the JSON format.

The Spark platform prepares the data in micro-batches to be consumed by the HDInsight data lake, SQL data warehouse, and various other internal and external subscribers. These targets subscribe to topics that are categorized by source tables. With this CDC-based architecture, StartupBackers is now efficiently supporting real-time analysis without affecting production operations.

Implemented effectively, change data capture technology can serve as a powerful foundation for modern Spark transactional pipelines. That real-time data canal, with efficient locks for immediate transfer, is achievable and real. 

These customer examples are adapted from Streaming Change Data Capture: A Foundation for Modern Data Architectures, OReilly 2018.  

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie