Data Observability - A Crucial Property in a DataOps World
Data observability is a measure that provides the continuous, holistic view of a data landscape needed for a streamlined DataOps implementation. This article explains observability and how it applies to data.
What is DataOps?
Data is increasingly important to compete in today’s markets. Besides traditional management decision support, analytics has entered into operational applications and become essential for everyday tasks. Consequently, a reliable and efficient operation of data and analytics is paramount. This, however, is hard to attain as data landscapes and processes are complex, datasets are growing, and underlying infrastructures encompass manifold tools and technologies.
DataOps provides a solution to these challenges. It combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement [1]. You can learn more about DataOps by reading “DataOps Explained: A Remedy For Ailing Data Pipelines”.
What is Observability?
In classical system theory, observability describes “how well internal states of a system can be inferred from knowledge of its external outputs” [2]. In the IT sector, observability currently gets a lot of attention in the context of distributed and cloud-based systems, where hundreds or even thousands of microservices need to be monitored and orchestrated.
Figure 1. Observability as the extension of Monitoring
There is no generally-accepted definition of the term observability, but it is possible to narrow it down by describing its components. First of all, monitoring and observability are two separate things. Monitoring is a process that discovers whether a system is working as expected or not. Observability, in contrast, is a general system attribute that summarizes ways to identify, avoid, and predict the behavior of a system [3,4]. According to this understanding, many see observability as an extension of monitoring, with new methods like simulation, automated testing, or prediction that provide more insights and a broader overview of a system landscape (Figure 1).
Observability of Data
Observability is mostly used for software systems, and it is unclear how it can be adapted to data. The requirements for software and data observability are similar. In both areas, it is essential to know if everything is working (i.e. system is available / data is valid), and if something looks odd, the issue must be findable (i.e. find bugs / find flawed data sources or transformations). On closer examination, however, observability of data and software are different and, consequently, also require different tools and components (Figure 2).
Figure 2: Components of Data Observability
Data Lineage. Data lineage enables data stewards and data engineers to track data records over their entire life cycle, e.g. from an aggregated KPI in a report to its original data source. For observability, this helps examine issues and trace them back to the original source or transformation.
Metadata. Metadata is additional information about data, like timestamps or authors, and can be used to assure data quality or to integrate and transform data to gain new insights [5]. Regarding observability, metadata is essential to understand data sets, reproduce transformation, and get deeper insights for examinations or predictions.
Data Governance. Data governance defines how data should be handled in an organization regarding data quality, data integrity, and data security. It relates to observability as it also provides guidelines, processes, and responsibilities needed to make analytics more transparent and the data landscape more observable.
Testing. Testing is often neglected in data and analytics, as building an adequate testing environment in complex analytics solutions can be hard. However, testing data increases observability and can help avoid foreseeable issues and improve the stability of data operations.
Monitoring. Last but not least, monitoring is essential to data observability because it is necessary to observe data related systems and transformation processes. Moreover, monitoring and visualizing data over its lifecycle can make analytics more understandable [6] and increase transparency and, in turn, observability.
Conclusion
Data observability is not a certain process or tool, but rather a measure of transparency and maturity of data components. Using observability as a central measure when building data pipelines can increase overall manageability of data landscapes. Accordingly, observability is complementary to DataOps and its goal of “a culture of continuous improvement” [1].
Want to learn more about DataOps? |
Further reading
[1] Eckerson, W. W. (2018): “DataOps Explained: A Remedy For Ailing Data Pipelines”
https://www.eckerson.com/articles/dataops-explained-a-remedy-for-ailing-data-pipelines
[2] Kalman R. E., "On the General Theory of Control Systems", Proc. 1st Int. Cong. of IFAC, Moscow 1960 1481, Butterworth, London 1961.
[3] Schwartz, B. (2017): “Monitoring Isn't Observability”
https://www.vividcortex.com/blog/monitoring-isnt-observability
[4] Sridharan, C. (2017): "Monitoring and Observability"
https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
[5] Ereth, J. (2017): “If Data is the New Oil, Metadata is the New Gold”
https://www.eckerson.com/articles/if-data-is-the-new-oil-metadata-is-the-new-gold
[6] Ereth, J. (2018): “Analytics Needs Explanation: Helping Users to Understand Underlying Data and Processes” https://www.eckerson.com/articles/analytics-needs-explanation-helping-users-to-understand-underlying-data-and-processes