Diving into DataOps: The Underbelly of Modern Data Pipelines

DataOps is a fast-growing discipline designed to tame unruly data pipelines that sprawl across most corporate landscapes. (See our recent blog and report which define DataOps and examine its benefits.)

Once upon a time… In the good old days, there were just two data pipelines: a data warehouse to support general purpose reporting and analysis activities and a financial reporting pipeline that produced audited numbers for investors and the board. Today, with the growth of self-service analytics and machine learning, companies have as many pipelines as they have data analysts, data scientists, and data-hungry applications. Each requires specialized data sets and data access rights to produce content. It’s pipeline-palooza!

Without DataOps, each data pipeline becomes a data silo, with little or no relation to other data pipelines, data sets, and data producers. There is no collaboration or reuse, lots of manual effort and rework, large numbers of errors and data defects, and exasperatingly slow delivery times. Business users don’t trust any data except their own, and many make decisions without any or sufficient data—they just can’t wait any longer.

DevOps to the Rescue. The world of software engineering was plagued with similar issues until it introduced agile development and DevOps techniques. Today, DevOps pioneers, such as Amazon, Google, and LinkedIn, deploy software releases daily, if not hourly or faster—a development cadence unfathomable a few years ago. Amazingly, even though cycle times have accelerated, software bugs and defects have declined. The advent of containerization and microservices will further accelerate and harden software delivery cycles. In short, DevOps delivers better code, faster, at less cost.

Now For DataOps

The Data Challenge. Seeing an opportunity to break its own logjam, the data world is latching on to DevOps principles with a twist. Whereas DevOps manages the interplay of code, tools, and infrastructure to accelerate the delivery of application functionality, DataOps adds a fourth element—data—which is as unruly as the other three combined! In every pipeline, data must be identified, captured, formatted, tagged, validated, profiled, cleaned, transformed, combined, aggregated, secured, cataloged, governed, moved, queried, visualized, analyzed, and acted upon. Phew! And these tasks are getting more complex as organizations accumulate mountains of data from hundreds of sources.

Tools and People. Moreover, there are specialized tools to manage each of the tasks. They range from traditional ETL/ELT, data quality, and master data management tools--to data catalogs, data preparation, and data governance products--to reporting, data visualization, and data science tools. Each of these tools is targeted to different types of users—from systems engineers and database administrators in the IT department--to data engineers, data stewards, and report developers in the BI team--to data analysts, data scientists and data consumers in business departments.

Coordinating all these tools, technologies, and people is a huge endeavor, especially in large organizations with sizable development teams, huge volumes of data from hundreds of sources, and large numbers of data analysts and data scientists in the field. This is where DataOps comes in.

DataOps Framework

Defining DataOps is as hard to do as nailing a jellyfish to the wall. It has a lot of moving parts and processes. Figure 1 tries to paint a simplified picture of the key components of a DataOps environment. To simplify this environment, some organizations prefer to source all the components from a single vendor, such as a large software or cloud provider (e.g., Microsoft, Amazon, Oracle, or IBM) or a big data engineering specialist, such as Infoworks. Others prefer a best-of-breed approach that stitches together both open source and commercial components using orchestration and monitoring tools.   

Figure 1. DataOps Components

 

Data Pipelines

The dark arrow in the middle of figure 1 represents a typical data pipelines that move source data through three phases—data ingestion, data engineering, and data analytics. Collectively, these pipelines represent a data supply chain that processes, refines, and enriches data for consumption by a variety of business users and applications. One pipeline might populate an OLAP cube used by finance; another might deliver integrated customer data to a real-time web application; and another might create a pool of segmented, raw data for a data scientist building campaign response models.

Data Technologies

Below the data pipelines are the major categories of technologies used to ingest, refine, and analyze data. These four categories—data capture, data integration, data preparation, and data analytics—get the lion share of media attention. That’s because this is where the money is--software vendors generate billions of dollars annually selling data products! Unfortunately, the heavy emphasis on data technologies overshadows the even more important data processes that coordinate and drive those technologies. (See “Data Processes” below.)

Data Capture is a hot technology category these days as organizations move from batch to streaming architectures to support big data and the internet of things. Data Integration is mainstream now, having evolved from traditional data warehousing projects. Data Preparation is new technology designed to help data analysts model their own data sets, ideally leveraging data in IT-managed repositories, such as data lakes. Data Analytics completes the cycle by giving business users the tools to query, analyze, visualize, and share insights. 

Teams and Handoffs. The IT department starts the supply chain by ingesting and integrating data to create generic, subject-oriented data files. The data engineering team then queries and models that data to meet specific business needs and use cases. Finally, business users query and analyze targeted data sets to create reports, dashboards, and predictive models. Although linear in nature, the cycle is also very iterative with many intermediate steps and artifacts that must be stored, tracked, and managed.

Supporting data technologies and teams is data storage, which includes data warehouses, data lakes, and data sandboxes running largely on high-performance columnar databases. Below the data storage is a computing infrastructure, which increasingly is cloud-based, virtualized, elastic, and massively parallel. 

Data Processes

An organization that tries to build and manage pipelines with technology alone is doomed to fail. It needs well-defined processes and methods for building, changing, testing, deploying, running, and tracking new and modified functionality. It also needs to manage all the artifacts—code, data, metadata, scripts, metrics, dimensions, hierarchies, etc. that these processes generate. And it needs to coordinate the data technologies and provision and monitor development, test, and production processes. This requires job scheduling, event triggers, error handling, and performance management to meet service level agreements. 

Development and Deployment. The first two phases—development and deployment—are well-defined by agile and DevOps methodologies. Here, the goal is to develop new functionality with self-organizing, business-infused teams that build fully-tested and functional code in short sprints, usually two weeks or less. To synchronize development, teams store code in a central repository that applies version control to avoid overwrites and duplicate effort. They also use techniques and tools to seamlessly merge code and move it into production with minimal delay (i.e., continuous integration and continuous deployment CI/CD). Tool and systems configurations are centrally stored and maintained in a configuration repository.

There are a host of tools to support the development and deployment processes, whether building new applications and use cases from scratch or modifying existing ones. (See this DataKitchen article for a continuously updated list of DataOps products.) Git is an open source repository of choice for storing code and controlling versions, while Jenkins is an open source tool that supports the CI/CD processes (i.e. merging and deploying code from multiple developers.) In the data world, there are a host of development and deployment tools geared to specific types of pipelines, including data warehousing development (i.e. data warehouse automation tools) and the creation of machine learning models.

Orchestration. The heart and soul of DataOps is orchestration. Moving, processing, and enriching data as it goes through a pipeline requires a complex workflow of tasks with numerous dependencies. Some notable data orchestration tools are Airflow, an open source project, DataKitchen, StreamSets, and Microsoft Azure’s DataFactory. A good orchestration tool coordinates all four components of a data development project: code, data, technologies, and infrastructure. In a DevOps environment, orchestration tools automatically provision and de-provision development, testing, staging, and production environments, using container management software (e.g., Kubernetes) to activate and coordinate containers that support those processes.

In the data world, orchestration tools also provision new dev/test/production environments. But they are also responsible for moving data between different stages in a pipeline and instantiating the data tools that operate on that data. They kick off jobs, monitor progress, and funnel errors and alerts to the appropriate interfaces. For example, in a cloud environment, a DataOps orchestration tool might do the following:

  • Provision platforms (e.g., databases, storage capacity, access control lists, performance management tools, data catalogs, logging servers, and monitoring tools.
  • Trigger ingestion jobs. Monitor jobs (batch or streaming), detect and recover from failures, monitor capacity and trigger auto-scaling if needed.
  • Trigger data quality jobs. Profile and validate data, check lineage.
  • Kickstart data transformation. Once the ingested dataset gets a clean bill of health, the orchestration tool might kick off transformation code to combine, format, and aggregate data elements.
  • Trigger a BI tool to download the data into its own columnar store or send a notification that a new data set is ready for querying and analysis.
  • Monitor workflow. Funnel alerts to appropriate individuals and de-provision infrastructure upon successful completion of a workflow.

Continuous Testing and Monitoring. The final component of a DataOps environment is the testing environment. Ideally, teams build tests prior to the development of any code or functionality. The orchestration tool then runs the tests before and after every stage in the pipeline. Catching bugs and problems upstream at the point of inception and integration saves oodles of time, money, and stress down the road. Tools such as Great Expectations and ICEDQ support a continuous testing environment. A related capability is the continuous monitoring of tools, applications, and infrastructure to ensure optimal uptime and performance. Unravel is one of many new application performance management (APM) products geared to big data processing. 

Development teams that use testing as a fundamental building block move faster than those that do not. This is counter-intuitive at first for most developers who want to dive into a project without building tests before they start. But once tests are in place, developers can quickly identify issues before they get nestled deep into code where they are harder to fix. With continuous testing and monitoring, teams can set targets for performance, measure their output, and continuously improve their cycle times and quality.

Conclusion

The goal of DataOps is to bring rigor, reuse, and automation to the development of data pipelines and applications. DataOps can help data teams evolve from a wild west environment that spawns data silos, backlogs, and endless quality control issues to an agile, automated, and accelerated data supply chain that continuously improves and delivers value to the business.

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson