DataOps Explained: A Remedy For Ailing Data Pipelines
When it comes to data analytics, you don’t want to know “how the sausage is made.” The state of most data analytics pipelines is deplorable. There are too many steps; too little automation and orchestration; minimal reuse of code and data; and a lack of coordination between stakeholders in business, IT, and operations. The result is poor quality data delivered too late to meet business needs.
Self-service tools were supposed to ameliorate this situation, but in many cases they exacerbated the problem. Business users can now create fancy dashboards but they still depend on IT to deliver workable data sets. And if ambitious analysts create their own data sets—which is easier with the advent of self-service data preparation tools—they invariably create silos of conflicting data shot through with logic and other errors, leaving organizations operating in a veritable Tower of Babel.
New Approach Needed
These problems beg for a new approach to building data analytic solutions. As data pipelines become more complex and development teams grow in size, organizations need to apply standard processes to govern the flow of data from one step of the data lifecycle to the next – from data ingestion and transformation to analysis and reporting. The goal is to increase agility, reduce cycle times, and minimize data defects, giving developers and business users greater confidence in data analytics output.
This is the vision of DataOps, an emerging methodology for building data analytic solutions that deliver business value. Building on modern principles of software engineering, DataOps applies rigor to developing, testing, and deploying code that manages data flows and delivers analytic solutions. The goal is to foster greater collaboration among development, test, operations, and business teams and create a culture of continuous improvement.
DataOps uses automated testing, orchestration, collaborative development, containerization, and monitoring to continuously accelerate output and improve data quality. As one DataOps practitioner from a Fortune 50 company says, “DataOps consists of a stream of steps required to deliver value to the customer. We automate those steps where possible, minimize waste and redundancy, and foster a culture of continuous improvement.”
Roots in Software Engineering
DataOps stems from the DevOps movement in the software engineering world which bridges the traditional gap between development, QA, and operations so the technical teams can deliver high-quality output at an ever faster pace. Similarly, DataOps brings together data stakeholders, such as data architects, data engineers, data scientists, data analysts, application developers, and IT operations.
DataOps also borrows heavily from Agile, Lean, DevOps, and Total Quality Management. Like Agile, DataOps emphasizes the use of self-organizing teams with business involvement and short (i.e., 2-3 week) development sprints that deliver fully tested code. Like Lean, DataOps focuses on efficiency, using version control systems and code repositories that foster parallel development and code reuse. And like Total Quality Management, DataOps espouses continuous testing, monitoring, and benchmarking to detect issues before they turn into major problems.
DataOps Tools
In its 2018 report, “DataOps: Industrializing Data and Analytics”, Eckerson Group interviewed numerous vendors that offer DataOps tools. DataOps is a broad category that encompasses a range of solutions, from data engineering platforms and orchestration toolsets to data warehousing automation and data science platforms.
Many provide a DataOps platform that makes it easy for data engineers and operations personnel to build simple or complex data analytics pipelines. Some, like DataKitchen, orchestrate the flow of data and code using whatever data engineering and analytics tools an organization already uses. Others, such as InfoWorks, Tamr, and Composable Analytics, provide a single, integrated, end-to-end solution that makes it easy for data engineers and even savvy business users to build, test, and monitor data pipelines. Nexla offers a cloud-based solution that manages and automates inter-company data pipelines.
In addition, some DataOps solutions are targeted at specific segments of the analytics market. For example, vendors such as TimeXtender and WhereScape, target the data warehousing market, although their automation platforms now extend into big data and cloud platforms. Data science platforms, such as DataRobot, provide a one-stop shop for building, deploying, and monitoring analytical and machine learning models. Podium Data, Zaloni, Cloudera, Hortonworks, and others provide one-stop shops for managing big data pipelines.
(To see a DataOps technology framework that Eckerson Group created this year, see "Diving into DataOps: The Underbelly of Modern Data Pipelines".)
It’s All About the Culture
Although DataOps platforms can dramatically accelerate cycle times, freeing up engineering resources, most DataOps practitioners are quick to say that DataOps requires a cultural transformation. “Our goal is to create a culture of continuous improvement. That requires a transformation in the way people think, act, and communicate.”