Modern Data Pipelines: Three Principles for Success
ABSTRACT: For successful data pipelines, enterprises should learn from innovative startups, use suites where they can, and use point tools where they must.
Data often needs to move before it can yield value. Most enterprises place their data on multiple clouds as well as their own data center. They might replicate operational data to one cloud platform for analytics, then shift to another platform as business requirements evolve. They continuously manipulate, migrate, and synchronize data across these changing environments.
So data pipelines and the tools that manage them become a center of gravity in their own right. To meet dynamic business requirements, data teams must select flexible tools that can add, change, and remove data pipelines as part of a living organism. Data teams must pick and use the right pipeline tools to keep the organism healthy. This blog defines the market for data pipeline products, and recommends three principles for making them successful: (1) watch the innovative startups; (2) use suites where you can; and (3) use point tools where you must.
The data pipeline market comprises four segments: data ingestion, data transformation, DataOps, and orchestration. In each segment, vendors offer tools that standardize and automate tasks to simplify the data engineering involved. These tools, some of which incorporate open-source code, offer an efficient alternative to homegrown solutions. They manage data, schema, and metadata that flow from source to target across a mix of on-premises and cloud environments.
Let’s define these market segments, from the bottom up.
Data ingestion tools help configure, execute, and monitor the extraction and loading of data—both in batch and real-time increments—between platforms. Vendors in this category include Qlik, Fivetran, and Arcion.
Transformation tools help design and manage jobs that merge, format, filter, or otherwise prepare data for analytics and data-driven applications. Vendors such as dbt and Coalesce offer tools in this category.
DataOps tools help build pipelines with continuous integration and continuous development (CI/CD) of code as well as testing and monitoring. DataOps.live and DataKitchen offer DataOps tools.
Orchestration tools help manage workflows across pipelines and the applications that consume their outputs. Prefect and Astronomer (with its package for Apache Airflow) play in this category.
Now let’s explore three principles for putting these tools to good use.
1. Watch the startups
For innovative ideas about managing data pipelines, look to startups. Over the last decade many entrepreneurs built their digital businesses from day one on the cloud. Their backers include marquee VC firms, tech gorillas, and even celebrities. While they have SQL or python coders build pipelines, they also empower business owners to consume data themselves. These cloud-native startups use advanced pipeline tools to drive innovation in areas such as customer analytics, data democratization, and data monetization.
Cloud-native startups use data pipelines to support their
digital businesses and drive innovation
Acorns personifies this generation of cloud-native startups and their approach to data pipelines. Acorns offers a digital platform that 10 million customers have used to squirrel away $15 billion in savings, for example by rounding up small purchase amounts and investing the spare change. Backed by investors such as BlackRock, PayPal, and Dwayne Johnson, Acorns seeks to help consumers make smart, data-driven decisions every day, with benefits that accumulate over time.
To make this work, Acorns must study every digital step of their prospect and customer journey. And they achieve this by analyzing both IT events and business transactions. Acorns’ data engineers stream event logs from their mobile application and website via Rudderstack into a lake house, and create transformations with dbt that enable analysts to parse the data with Tableau. Analysts then query the assembled, refined data to examine users’ experiences. Did a customer create an account but choose not to subscribe to a service? What pages did they view before making that decision? By answering such questions, Acorns can convert, upsell, and delight a higher portion of customers.
To get a broader view of how startups innovate with data pipelines, check out the many customer sessions at dbt’s Coalesce 2022 event.
2. Use tool suites where you can
Data teams can boost productivity by managing multiple pipeline categories with one tool. Vendors, including many of those listed above, seek to help with converged products or integrations that help prepare data for analytics. This approach makes sense for cloud-based environments in particular because they have more accessible and homogeneous datasets. For example:
Keboola addresses all four categories, with a focus on CI/CD and orchestration. Keboola offers a “marketplace” in which hundreds of contributors have created and shared reusable pipeline artifacts. Data analysts and even business managers use Keboola to create quick pipelines for analytics projects.
Rivery also addresses all four categories, with a focus on ingestion and transformation. Enterprises use Rivery to integrate data from a wide range of SaaS applications, including digital ad platforms, and prepare it for use cases such as customer analytics.
Like many vendors, Fivetran partners with dbt to help data engineers and analytics engineers combine ingestion with transformation. They can use Fivetran to extract incremental data updates from software as a service (SaaS) applications and load them into a cloud data warehouse. This triggers an automated transformation job by dbt.
3. Use point tools where you must
Large enterprises often have complex pipeline requirements that require specialized tools. This holds true across market segments. For example, Fivetran and Airbyte specialize in incremental data extraction from SaaS applications, while Qlik specializes in extraction from mainframe, SAP, and iSeries sources. The Airflow orchestration tool, meanwhile, helps data engineers and developers process batches of multi-sourced data for analytics and operational workflows. And Prefect specializes in helping data scientists orchestrate complex data pipelines for AI/ML projects with changing requirements. Tools such as these enable various stakeholders to simplify how they handle complex interfaces, formats, and interdependencies.
All four segments of the data pipeline market–ingestion, transformation, DataOps, and orchestration–are ripe with innovation. Data teams can capitalize on this innovation and deliver more value to their business by observing the three principles outlined here. They should (1) observe and adopt the cutting-edge approaches of cloud-first startups, (2) employ consolidated tool suites to reduce complexity where possible, and (3) still invest in specialized tools to handle tricky integrations and sophisticated tasks. To learn more about this topic, also be sure to register for our "CDO TechVent for Modern Data Pipelines: Practices and Products You Need to Know" on March 30.