DataOps in Data Engineering

ABSTRACT: The disaggregation of the data stack has made the process of creating end to end data platforms increasingly complex and convoluted.

In a Complex Data Landscape, You Need DataOps and Data Engineering

The data ecosystem has become a data jungle waiting for a “datastrophe” to happen. As a case in point, Matt Turck’s 2023 MAD (Machine Learning, AI, Data) landscape has close to 1400 listed tools and frameworks, up from a mere 120 tools in 2012. The enterprise data landscape is crowded with shiny objects and dazzling buzzwords, all competing for customer attention and investment dollars. Through most of 2021, a data company got funding every 45 minutes. 

Data teams struggle to create data products and a functional modern experience in this disparate data ecosystem. Data teams need data engineering principles and DataOps processes to reduce the complexity of building end to end data products, navigate and integrate the complex maze of data tools and frameworks across distributed platforms.

Why organizations need data engineering and DataOps

Data engineering has evolved beyond ETL to include new paradigms such as ELT and ETLT. Data pipelines are growing in number, volume, and complexity, with frameworks and tools constantly evolving and appearing. The lack of a single product or tool to build end-to-end data platforms, coupled with the ever-increasing demands to improve speed of adoption, is causing organizations to duct tape different products and frameworks together. In the process they ignore data engineering and operationalization principles.

Look at Figure 1, which shows the vertical stack of a data platform. Each of these layers is subject to various types of drift, which causes additional complexities in terms of maintaining compatibility, controlling versions, and managing dependencies. There are subtle but strong ways in which each of the layers affect each other. This causes numerous challenges in terms of development and operationalization.

Figure 1. Data Platform Layers

Figure 2 shows a single end-to-end pipeline. Notice the number of handshake points between the systems, from ingestion all the way to consumption. Each of the systems run in different data centers and most of these run on distributed platforms. The complexities of identifying, debugging, and troubleshooting failures at the network or system level can be difficult to isolate, fix and coordinate. For example:

  • How does one address the lack of adherence to service level agreements (SLAs), service level objectives (SLOs), and specifications for non-functional requirements (NFR)? 

  • What happens when the number of rows ingested do not match the numbers of rows stored in the sink systems? 

  • How do data teams go about looking for the needle in the haystack?

Figure 2. Why DataOps?

How Should Organizations Approach Data Engineering & DataOps?

Data engineering is a discipline for learning and adopting principles, best practices, processes, and culture from software engineering. For example, DataOps introduces DevOps-like practices that have revolutionized software development and provided velocity and agility to software product development. 

Table 1 lists the parallel concepts that data engineering and DataOps adopt from the software product development discipline of DevOps.

DevOps

DataOps

Data As Asset

Data as a Product

Code Testing

Data Testing / Data Validation & Verification

Code Versioning

Data Versioning

Technical Debt

Data Debt

Observability and Traceability

Data Observability

DevSecOps

DataSecOps

SRE (Site Reliability Engineering)

DRE (Data Reliability Engineering)

The goals of data engineering and dataops include the following:

  • Improve the velocity of delivery of data products

  • Minimize data debt and ensure data quality, validation, integrity, and consistency

  • Avoid data downtime and meet SLAs and SLOs

  • Automate, reproduce, rollback, and recover

  • Minimize operational fatigue and errors 

The principles of data engineering include the following:

  • Advocate architectural agility and using framework-agnostic solutions

  • Decouple data pipelines from deployment code

  • Adopt a configuration-driven approach to build data pipelines, called “declarative data pipelines.” Just like SQL, this is a declarative language in which one specifies what one wants as compared to an imperative language in which one writes code to implement the functionality

  • Invest in data testing, data validation, data orchestration, automation, repeatability, and traceability

Modern data architectures are a mashup of vendor products, open-source frameworks, tools, and managed service components across data ingest, data storage, and analytics. These evolve at their own cadence with bug fixes, new releases, and patches. To enable agility for adding, removing, or retiring frameworks, components should be loosely coupled without having to reimplement foundational pieces of infrastructure. Some ways of achieving this include the following:

  • Identify the portions that are prone to changing and abstract out those pieces to make pipeline components easy to add, remove, or skip

  • Avoid long running jobs

  • Capture metrics at handshake points across systems and minimize manual intervention

  • Minimize dependencies

  • Make data pipelines retryable with the right kind of checkpointing and the latest generation of data orchestration tools

  • Provide execution isolation between jobs

Organizations should ask the following questions before implementing DataOps practices.

  • What level of automation is needed?

  • What is the rollback strategy when components are changed and updated?

  • Where do we need data versioning?

  • How quickly are the data volumes and velocity going to change in a production setting?

  • How do we  analyze infrastructure performance challenges in production?

  • What are realistic workload profiles and stress tests? 

  • What are relevant DataOps principles for us to adopt?

  • How early can we provide data checks?

Once the decisions are made some of the best practices for DataOps include the following: 

  • Automate both unit and integration data tests and data validations.

  • Attempt to make deployments single-click.

  • Establish checkpoints and or savepoints in long running pipelines to restart failed jobs from the point of failures.

  • Decouple data orchestration of pipeline from business logic.

  • Always setup re-triese of failed tasks with exponential backoff.

  • Incorporate data observability with the goal to detect, correct and prevent data pipeline errors, and reduce data downtime and minimize cost of data errors.

  • Incorporate end-to-end monitoring and leverage ML to identify data issues and Root Cause Analysis. 

  • Apply automated anomaly detection based on historical trends at the field level.

  • Centralize, standardize, and automate data quality monitoring processes.

  • Understand data completeness and data freshness criteria.

  • Routinely check whether SLAs/SLOs are satisfied.

  • Ensure timeliness of alerts.

  • Ensure mediums for alerts don’t go unnoticed.

  • Log failed test results for root cause analysis.

Organizations that aspire to become data driven need to start with data engineering and dataops and not ignore the principles and best practices around them. These are fundamentally important processes and culture for organizations to adopt so they can move to the modern data stack in the cloud or on-premises, and scale to build lasting data products.