Aug 13, 2021 / by Kevin Petrie Decoding Data Software

How to Design Streaming Data Pipelines for Open, Distributed, and Elastic Cloud Platforms

To build the right bridge, you need to understand where it goes. And to build the right streaming data pipeline, you need to understand its target--which more and more is a cloud data platform.

Cloud data platforms rank as the most popular pipeline targets these days because they underpin many strategic initiatives. Enterprise data teams modernize their businesses with strategic initiatives such as digital transformation, application modernization, and real-time or advanced analytics, on cloud data platforms.

To build the right streaming data pipeline, you need to understand its target.

This blog defines how to design streaming data pipelines for cloud data platforms. It examines the key characteristics of cloud data platforms—open, distributed, and elastic—and how streaming data pipelines support them. For example, streaming data pipelines must integrate data across various APIs and data formats. They also need to support the high throughput requirements of distributed cloud data platforms, and quickly add or change targets as those platforms grow over time. Data engineers should keep these requirements in mind as they design and implement their pipelines.

The streaming data pipeline

A streaming data pipeline is a workflow that extracts data updates from a source and loads them to a target in real-time increments. The pipeline also might transform the data, for example, by joining or transforming data streams. Modern streaming data pipelines support many sources and targets. They also automate repetitive tasks related to the configuration, deployment and monitoring of data pipelines. Finally, streaming data pipelines must minimize their impact on source operations; for example, by scanning source database logs rather than imposing additional change tables on the source server.

Cloud data platforms consume streaming data in several ways. For example, a cloud database might run a mobile ecommerce application that ingests customer orders from customers’ smartphones and processes their credit-card transactions. A cloud analytics platform, meanwhile, might use a machine learning model to analyze IoT sensor data and delivery truck service records. This helps them identify breakdown risks.

Cloud platform characteristics and pipeline requirements

Cloud data platforms include operational databases, and analytics platforms. Let’s walk through the key characteristics of cloud data platforms and the requirements they create for streaming data pipelines.

Open. Many cloud data platforms run on open-source code, which means that enterprises can download, use, and modify the source code to meet their custom needs. For example, Databricks built its Delta Lake analytics platform on top of Spark, which its founders created, and now even proprietary platforms, such as Snowflake, integrate with Spark.

In addition, cloud data platforms provide open APIs such as open database connectivity (ODBC), Java database connectivity (JDBC) and Representational State Transfer (REST) to assist application integration. They use open data formats such as comma-separated values (CSV), Apache Parquet and Avro to make it easier for applications to share data.

Pipeline requirement: Streaming data pipeline tools need to integrate with common software components, open APIs, and data formats. This integration includes the ability to convert data formats, for example by automatically converting data to the popular Parquet format.

Distributed. Cloud data platforms, cloud databases in particular, distribute data—i.e., partition or replicate it—in order to meet performance, availability, and sovereignty requirements. For example, you might partition a table so that all the records (table rows) for customers reside on a cloud server in their geographical region, speeding up performance. In aggregate, these sprawling platforms consume large rivers of data.

Pipeline requirement: Cloud data platforms typically distribute data using their own native capabilities. However, streaming data pipeline tools must avoid becoming a bottleneck as they feed external data into these distributed platforms. They need a scalable architecture to handle rising data volumes, for example by automatically adding compute nodes and replication threads to increase throughput.

Elastic. As described earlier, cloud data platforms integrate their data models and applications with underlying storage and compute infrastructure owned by AWS, Azure, and Google. They scale up or down their usage of this elastic infrastructure based on changing business needs. For example, they might increase compute resources to handle a boom in sales, or add storage resources to support regional expansion.

Pipeline requirement: Streaming data pipeline tools need to quickly add or change targets to integrate with cloud data platforms as they grow. If you launch a new ecommerce site or factory in Asia, you need to configure streaming data pipelines that ingest data into the new database targets that support those new operations.

Summary

Streaming data pipelines must align with the open, distributed, and elastic characteristics of the cloud data platforms they feed. Data engineers that architect and design pipelines with these targets in mind will help their business stream to the clouds and beyond.

Previous post by expert Next post by expert

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie