Getting Started with DataOps

This is a collaboration article that was originally published on Zaloni.com

DataOps is no longer optional. For companies that compete on analytics, the proliferation of artificial intelligence and other big data solutions has made it a matter of survival.

For years, data analytics has been an artisanal craft. At most companies, highly trained data engineers, scientists, and analysts build boutique solutions on an as-needed basis. Unfortunately, this model is ill-suited to the ever-growing demand for data.  As data sources and targets multiply and use cases become more complex, the quality and speed of analytics drops. Data teams, even exceedingly skilled ones, just can’t keep up. 

DataOps represents a sea change in how companies approach the development and deployment of data pipelines. Drawing from the DevOps and agile methodologies of the software world, it imagines a sustainable way forward. Rather than artisanal analytics, DataOps ushers in a world of industrialized data development.


Rather than artisanal analytics, DataOps ushers in a world of industrialized data development.


The modern data pipeline consists of numerous tools that interact with one another in increasingly complicated ways. Although far more intricate than the pipelines of yesteryear, the technologies that make it up still fall into one of three bins. The first category consists of tools for connecting to and ingesting the data; the second, tools for modeling the data; and the third, tools for querying and analyzing the data. DataOps doesn’t change any of this. 

Instead, DataOps focuses on the infrastructure needed to construct and automate pipelines. It’s about building the machine that makes the machine. To get started, companies must stand up four technical pillars—continuous integration and delivery (CI/CD), orchestration, testing, and monitoring. These functions don’t replace pipeline components; they complement them to support the entire system. (See figure 1.)

Figure 1. Model of a Data Pipeline Supported by DataOps

Continuous Integration/Continuous Delivery (CI/CD). Software left waterfall development behind years ago. The data world needs to as well. CI/CD relies on a central repository that stores code, typically Git. This allows teams to branch and modify code in a controlled and versioned way without affecting production. Once tested, teams can seamlessly deploy the new code by merging the changes back into the production environment.  The repository also provides a location for storing tool and systems configurations. As a result of this centralization, teams can easily repurpose old code and develop new pipelines in tandem without duplicating work. 

Orchestration. Orchestration tools coordinate software, code, and data. They connect to every part of the pipeline and shepherd data from one stage to the next. Without human input, they can provision platforms and environments, trigger new jobs, and pass data from one tool to another. This frees up developers to build new pipelines and enables one engineer to manage hundreds of pipelines in production. Given the complexity of modern data environments, DataOps simply couldn’t function without this component.

Testing. Data teams expend surprisingly few resources on testing. In comparison to the nearly 50% of code and staff dedicated to testing and quality in the software world, 20% would be high--not so with DataOps. DataOps requires engineers to bake tests into every stage of a data pipeline. These tests must check both data quality and pipeline functionality. They run not only during development, but also in production. In the long run, tests save time even though they require more work up front. A thoroughly tested pipeline delivers better data more consistently, increasing the trust of data consumers and reducing the time engineers spend fixing problems. When pipelines do break, they are much easier to troubleshoot. 

Monitoring. The final pillar of DataOps involves monitoring the execution of pipelines in production environments. Monitoring tools provide insights into the underlying infrastructure of servers, CPUs, memory, and storage nodes that process the code and data. They help engineers find bottlenecks and breakages and optimize the impact of pipelines on shared resources. As with orchestration, monitoring requires specialized tools that can see across the entirety of complex data environments. The reduction in overhead from increased efficiency generally offsets the cost of these tools, however. This is especially true in the cloud, where monitoring tools can provide more accountability for resource usage. 

Conclusion

Although the methodology has many other aspects, companies just embarking on their DataOps journey ought to concentrate on setting up these four functions first. A great way to do this is with a DataOps platform like Zaloni, which integrates multiple DataOps functions into a single solution. Even before implementing the rest of the philosophy, organizations should see a return on their investment. CI/CD will lead to quicker development times, orchestration will enable teams to support more projects, and testing and monitoring will improve the quality of solutions and resource efficiency. Because of these components, DataOps sits poised to finally fulfil the promise of “faster, better, cheaper.”

Joe Hilleary

Joe Hilleary is a writer, researcher, and data enthusiast. He believes that we are living through a pivotal moment in the evolution of data technology and is dedicated to...

More About Joe Hilleary