Data Pipeline Design Patterns
Design patterns in software engineering are repeatable solutions to common software design requirements. A design pattern is an abstraction that does not translate directly into executable code. It is a problem-solving template that can be used as the foundation to design a solution. There are many benefits to using design patterns including reuse of thought processes, foundations for reusable code, accelerated design and development, and greater consistency and ease of maintenance. Design patterns are used extensively in software engineering. Perhaps it is time to adopt pipeline design patterns as a core discipline of data engineering. Reusable, accelerated, consistent—those characteristics are certainly among the things needed to break the data engineering logjam.
DATA PIPELINE VARIETY
The purpose of a data pipeline is to move data from an origin to a destination. There are many different kinds of data pipelines: integrating data into a data warehouse, ingesting data into a data lake, flowing real-time data to a machine learning application, and many more. The variation in data pipelines depends on several factors that influence the shape of the solution. These factors include:
- Purpose—What is the reason for the pipeline? Moving data from what source(s) to what destination? Once only or repeating?
- Data—What is the expected data volume? Is the data stored or streaming? Is it structured or unstructured?
- Usage—For what use cases is the data intended? Reporting? BI? Analytics? Data science? AI/ML? Automation or robotics?
Considering these factors, it is obvious that there isn’t a one-size-fits-all data pipeline. The data flows that load data from an ERP system into a data warehouse are distinct from those that feed a real-time dashboard used to monitor a manufacturing facility. It is unlikely, however, that you’ll encounter a data pipeline so unique that it has no similarities to other data pipelines. Those similarities are the basis of design patterns. With that in mind, I propose eight fundamental data pipeline design patterns as a practical place to start bringing the discipline of design patterns to data engineering.
RAW DATA LOAD
Figure 1. Raw Data Load
A raw data load pipeline, as illustrated in figure 1, is built to move data from one database to another. These pipelines perform the bulk data movement that is needed for the initial loading of a database such as a data warehouse, or for migration of data from one database to another—from on-premises to cloud, for example. The term “raw data” implies data that has not been modified, so the raw data load pipeline pattern consists of two processes—extract and load—with no data transformation processing. Data load processing can be slow and time-consuming, especially with large data volumes. They are typically built for one-time execution and are rarely run on a periodic or recurring schedule.
ETL AND VARIATIONS
Figure 2. ETL
Extract-Transform-Load (ETL) as shown in figure 2 is the most widely used data pipeline pattern. From the early 1990’s it was the de facto standard to integrate data into a data warehouse, and it continues to be a common pattern for data warehousing, data lakes, operational data stores, and master data hubs. Data is extracted from a data store such as an operational database, then transformed to cleanse, standardize, and integrate before loading into a target database. ETL processing is executed as scheduled batch processing, and data latency is inherent in batch processing. Mini-batch and micro-batch processing help to reduce data latency but zero-latency ETL is not practical. ETL works well when complex data transformations are required. It is especially well-suited for data integration when all data sources are not ready at the same time. As each individual source is ready, the data source is extracted independently of other sources. When all source data extracts are complete, processing continues with the transformation and loading of the entire set of data.
Figure 3. Streaming ETL
The ETL pattern that is typically used with stored data can also be applied with some variation when the source is a data stream. (See figure 3.) Instead of extracting from a data store, streaming data is parsed to separate individual events into unique records, then filtered to reduce the data set to only the events and fields of interest for the use case. Parsing and filtering are followed with transformation, and then data is loaded into a data lake. Batch processing is replaced by stream processing using a stream processor such as Kafka. This pattern is particularly useful for machine learning use cases that often focus on a only a few fields in much larger data sets.
Figure 4. ELT
Extract-Load-Transform, as shown in figure 4, is a variation on ETL that was conceived to offset some of the latency that results from pure ETL processing. Waiting for all transformation work to complete delays availability of data for business use. Loading immediately after extract, then transforming data in place reduces the delay. ELT has the greatest impact on accelerated data availability when multiple sources that are not ready for extract simultaneously would be held in a staging area with ETL processing. With ELT the data warehouse serves the role of data staging so each source becomes available for use upon extraction and loading. Data transformation is performed in place in the data warehouse once all sources are loaded. Data quality and data privacy considerations are sometimes a concern with ELT processing. When data is made available for use without transformation, it is widely exposed without first performing data cleansing and sensitive data masking or obfuscation.
Figure 5. ETLT
The Extract-Transform-Load-Transform (ETLT) pattern shown in figure 5 is a hybrid of ETL and ELT. Each source is extracted when ready. A first stage of “light” transformations is performed before the data is loaded. The first stage transformations are limited to a single data source and are independent of all other data sources. Data cleansing, format standardization, and masking of sensitive data are typical kinds of first stage transformations. Each data source becomes available for use quickly but without the quality and privacy risks of ELT. Once all sources have been loaded, second stage transformation performs integration and other multi-source dependent work in place in the data warehouse.
DATA VIRTUALIZATION
Figure 6. Data Virtualization
Data virtualization, illustrated in figure 6, serves data pipelines differently than the other patterns. Most pipelines create physical copies of data in a data warehouse, data lake, data feed for analytics, etc. Virtualization delivers data as views without physically storing a separate version for the use case. Virtualization works with layers of data abstraction. The source layer is the least abstract, providing connection views through which the pipeline sees the content of data sources. The integration layer combines and connects data from disparate sources providing views similar to the transformation results of ETL processing. The business layer presents data with semantic context, and the application layer structures the semantic view for specific use cases.
Unlike ETL processing that is initiated by a schedule, virtualization processes are initiated by a query. The query is issued in the application and semantic context and is translated through integration layers and source layers to connect with the right data sources. The response traverses the path in the opposite direction to acquire source data, transform and integrate, present a semantic view, and deliver an application view of the data. Virtualization can work well when people want the freshest and most up-to-date data possible. The data sources determine the degree of data freshness. The amount of historical data available is also determined by the sources.
Virtualization has the distinct advantage of integrating and transforming only data that is requested and used, not all data that may be requested. It works well with relatively easy transformations and modest data volumes, but may struggle to perform with complex transformations or when lots of data is needed to respond to a query.
DATA STREAM PROCESSING
Figure 7. Stream Processing
Stream processing, as shown in figure 7, has two similar but slightly different patterns. In both patterns, the data origin is a stream with continuous flow of event data in chronological sequence. Processing begins by parsing the events to isolate each unique event as a distinct record. Individual events can then be evaluated to select only those events appropriate to the use case. In many cases, and especially with large data volumes or high percentage of unneeded events, it is desirable to push parsing and selection to the edge of the network—close to the sensors where event data is captured—and avoid moving unneeded data across the network.
At the destination end of the data flow, the two patterns diverge slightly. For some use cases, the goal is to post events to a message queue, an event log, or an events archive. When events are posted, they become the origin or input to a downstream data pipeline where data consumption is somewhat latent. For other use cases, the goal is to push events to a monitoring or alerting application where information about the state of a machine or other entity is delivered in real time. Stream processing pipelines typically work with sensors and Internet of Things (IoT) data using technology that is optimized for high-volume, fast-moving data.
CHANGE DATA CAPTURE
Figure 8. Change Data Capture
The Change Data Capture (CDC) pattern shown in figure 8 is a technique that can be used to increase the freshness of data that is typically latent. Many of the operational data sources that flow into data warehouses and data lakes are processed using ETL, which is inherently latent. CDC technology, when applied to those operational databases, can identify data changes as they occur then deliver information about those changes in two forms. Pushing changes to a message queue makes them available for downstream processes to work with them in mini- or micro-batch modes. CDC with a message queue can reduce data warehouse latency, for example, from days to hours or minutes. When data changes stream to the destination database they are available immediately, making the shift from batch latency to real-time data ingestion.
USING THE PATTERNS
I encourage data engineers and data architects to enthusiastically adopt pipeline design patterns as a core discipline. You can begin with the set of design patterns described here and expand based on your experiences and needs. If you’re a data engineer tasked with developing data pipelines start by knowing the data characteristics and the use cases for each data pipeline. Then choose the best fit pattern and look for reusable code or a code framework to implement the pattern. If you find a framework, use it and perhaps improve it. If you don’t find one, then make the effort to produce one as a by-product of developing the pipeline. If you’re a data architect, take on the responsibility to build and publish frameworks. You may need to adapt existing frameworks, creating multiple frameworks for a single pattern—perhaps different patterns for different environments, technologies, programming languages, etc.
One final thought: If you’re a software vendor providing data pipeline technology, consider making design patterns a central part of your functionality and user interface. Offer a library of design patterns and corresponding code frameworks that supports all of the common patterns. Then include functions for users to build and embed their own custom patterns and frameworks.