DataOps: More Than DevOps for Data Pipelines
DataOps is a data engineering approach that is designed for rapid, reliable, and repeatable delivery of production-ready data for analytics and data science. Beyond speed and reliability, DataOps enhances and advances data governance through engineering disciplines that support versioning of data, data transformations, and data lineage. DataOps supports operational agility for business operations, with the ability to meet new and changing data needs quickly. It also supports portability and technical operations agility with the ability to rapidly redeploy data pipelines across multiple platforms in on-premises, cloud, multi-cloud, and hybrid data ecosystems.
The definition above is accurate but incomplete. It represents a common misconception about DataOps—singular focus on data engineering. The missing piece is a lack of attention to data consumption, especially data science applications. Let’s redefine it: DataOps is an engineering methodology and set of practices designed for rapid, reliable, and repeatable delivery of production-ready data and operations-ready analytics and data science models. DataOps enhances and advanced governance through engineering disciplines that support versioning of data, data transformations, data lineage, and analytic models. DataOps supports business operational agility with the ability to meet new and changing data and analysis needs quickly. It also supports portability and technical operations agility with the ability to rapidly redeploy data pipelines and analytic models across multiple platforms in on-premises, cloud, multi-cloud, and hybrid ecosystems.
Although this definition is largely technical, it is important to recognize that DataOps has business drivers and benefits and that it also has substantial organizational and cultural impacts. Success with DataOps requires attention to four dimensions—business, process, culture, and technology. Wayne Eckerson’s DataOps Framework provides a process and technology perspective.
Based on DevOps—a proven methodology to increase the speed at which new software features are delivered—DataOps applies the same principles of a continuous build, test, and release cycle that is supported by automation. Building of software is done iteratively as a series of sprints that discover requirements, develop working models of the software, and test those models in collaboration with business stakeholders. Release promotes software from development to production when build and test processes deliver working software with sufficient functionality to be useful in business operations. The underlying methodology is called Continuous Integration / Continuous Development or CI/CD.
Figure 1 illustrates CI/CD for DevOps software development and integration. On the development, side requirements come from two sources—traditional processes for new software requirements based on business need, and feedback from operational use of software that has been released. Both contribute to the product backlog that is addressed with rapid development projects using agile methods. When software goes into operations it initially reduces the operations backlog, but new needs and software deficiencies feed the product backlog to drive next stages of development. The process is one of continuous development of new software capabilities and continuous integration of new capabilities into the existing operations environment. Collaboration between operations and development is critical.
Figure 1. DevOps for Software Development and Operationalization
Adapting the DevOps model for DataOps results in a somewhat more complex process model that consists of two interacting CI/CD loops—one to develop and operationalize analytic models and one to develop and operationalize data pipelines. (See figure 2.)
Analytics requirements typically come from business stakeholders. They populate the model backlog and drive CI/CD for reporting, business intelligence (BI), analytics and data science. Although we refer to analytic requirements and model backlog, the DataOps process can be applied for a broader scope of data products that includes reporting and BI as well as analytics and data science. Data pipeline requirements come from many sources including the model backlog. They populate the pipeline backlog and drive CI/CD for data pipelines. DataOps can be applied for the full range of data pipelines including batch processing such as ETL, real-time with change data capture (CDC), and streaming data.
Figure 2. DataOps for Model and Pipeline Development and Operationalization
It is important to note that in this diagram the upper loop describes CI/CD for analytics and data science, and the lower loop describes CI/CD for data pipelines. The point is that analytics drives the demand for data pipelines, not the reverse. DataOps begins with analytics. The two backlogs are a model backlog and a pipeline backlog—both product backlogs but for different kinds of products. When analytic models are operationalized, they are not sustainable without a fresh supply of data, so data pipelines are critical. Developing new models isn’t possible until data is available, so again data pipelines are critical. The red line in the diagram shown as figure 2 represents a primary dependency between the two loops. When the model backlog is understood it is used to identify the data pipelines that are required to develop and operationalize backlogged models. The model backlog becomes a new source of requirements that feeds the pipeline backlog
DataOps and Automation
DataOps is not practical without automation. Robust DataOps technology offers features and functions for model orchestration, data pipeline orchestration, test automation, and deployment automation for data pipelines and analytic models. The emerging Data Fabric technologies with AI/ML-enabled automation will fill an essential role in the future of DataOps. Figure 3 illustrates the build-test-release cycles of DataOps and highlights the points at which automation is critical. Automation is essential to DataOps, with eight points at which test automation is critical, two points of deployment automation, and two points of operational orchestration.
Figure 3. Points of Automation in DataOps Build-Test-Release Cycles
Looking first at the CI/CD cycle for data science and analytics we can see that:.
- The build activities occur as sprints where models (and reports, dashboards, scorecards, etc.) are built and trained. Unit testing and user testing are integral parts of each sprint and depend on test automation to achieve comprehensive testing at high speed.
- At conclusion of a series of sprints, integration testing may be needed to ensure that all of the software components from build activities work well together. Again, test automation is needed.
- Pre-deployment testing is performed to confirm that the models are ready to deploy and deliver operational value, and to provide a basis for post-deployment comparison. Where earlier stages of testing—unit, user, and integration—focus on building things right, pre-deployment testing focuses on deploying the right things. Pre-deployment testing may also include regression testing when preparing to deploy revisions to a previously deployed model. Test automation fills an essential role when preparing to deploy.
- Release occurs at the deploy step where models are promoted from development to production. Deployment automation supports the practices of source code management, versioning, and revision tracking that are important release disciplines.
- Post-deployment testing is performed to confirm that the model functions in the production environment in exactly the same way that it does in development and test environments, affirming that environmental factors have not influenced software behavior. Once again, test automation fills an important role.
- Model execution in the operational environment is supported by model orchestration technology that automates configuration, coordination, and management of the computing environment in which models are executed.
The CI/CD cycle for data pipelines follows a pattern highly to that for analytic models:
- Build occurs as a series of sprints that include unit and user testing. Test automation helps to achieve fast but complete testing.
- Integration testing ensures that all pipeline components work well together. Test automation supports comprehensive testing at high speed.
- Pre-deployment testing is performed to confirm that the pipelines are ready to deploy and deliver the data needed by consuming applications. Pre-deployment tests also provide a basis for post-deployment comparison.
- Release occurs when pipelines are deployed and promoted from development to production. Deployment automation supports the practices of source code management, versioning, and revision tracking that are important release disciplines.
- Post-deployment testing confirms that the pipeline functions in the production environment exactly as it does in development and test environments, assuring that environmental factors do not change software behavior—another use case for test automation.
- Pipeline orchestration technology supports pipeline execution in the operational environment to automate configuration, coordination, and management of the computing environment.
DataOps Is Strategic
Evolving practices and processes for data pipeline development and deployment, analytic model development and deployment, and operation of pipelines and models should be part of your data strategy The real payback comes by evolving your data and analytic culture to one where data is delivered quickly and reliably and where the knowledge needs of business managers are routinely satisfied with the right data and models at the right speed. Remember that the demand for data and analytics will continue to grow and that development, deployment, and operation without automation will not scale to meet the demand.