Stepping Up To Modern Data Management

Getting Started With Modern Data Management

I have spoken and written previously about the phenomenon that I call the data quake. The once stable world of data management has been shaken up by technology shifts such as big data, NoSQL, cloud computing, data lakes, self-service, data cataloging, machine learning, and more. Everyone involved in data management is affected by the changes. No person, role, or job is untouched. The scope of change is enormous, affecting many areas including:

  • Data architecture
  • Data quality management
  • Data modeling
  • Data governance
  • Data curation and cataloging

The urgency of modern data management is increasing for most organizations, yet many struggle with size and complexity and are uncertain about where and how to get started. There is no one-size-fits-all answer to the question, but a look at some of the best practices may spark ideas about where to begin. The topic is broad and complex, so I’ll only scratch the surface with this short blog, with the hope that it offers enough to help you get started.

Data Architecture

Today’s typical data management architecture is built on a foundation of 1990’s principles—relational databases, data warehousing, batch ETL, data latency, etc.—that don’t address more recent developments in big data, NoSQL, etc. Since the emergence of big data technologies, most organizations patched new concepts onto the surface of old architecture and they continue to add patches in a way that makes the architecture increasingly fragile. Data warehouses and data lakes have become the new data silos, and connecting the dots among them is especially difficult. The time has come to step back and rebuild data management architecture from the ground up. My recent posting about Modernizing Data Management Architecture offers a deeper look at this topic, with more to be found in Wayne Eckerson’s Ten Characteristics of a Modern Data Architecture.

Data Quality Management

The data quality practices and techniques that we’ve traditionally used when working with structured enterprise data don’t work well for big data. Many of the data quality rules used with structured data— referential integrity rules, for example—don’t apply when the data is not organized and managed as relational tables.

Data quality practices from BI and data warehousing are geared toward cleansing data to improve content correctness and structural integrity in data that is used by query and reporting processes. In the big data world, quality is more elusive. Correctness is difficult to determine when using data from external sources, and structural integrity can be difficult to test with unstructured and differently structured (non-relational) data.

With big data, quality must be evaluated as fit for purpose. With analytics, the need for data quality can vary widely by use case. The quality of data used for revenue forecasting, for example, may demand a higher level of accuracy than data used for market segmentation. The quality needed depends on the use case, and each use case has unique needs for data accuracy, precision, timeliness, and completeness. It is important to recognize that some kinds of analytics rely on outliers and anomalies to find interesting things in the data. Predictive analytics, for example, has greater interest in the long tails of a value frequency distribution curve than in the center of the curve. Traditional data cleansing techniques are likely to regard the outliers as quality deficient data and attempt to repair them through cleansing.

Data quality practices for enterprise data quality are distinctly different from those needed for big data and analytics. (See figure 1.)

Figure 1. Differences in Data Quality Practices

Data Modeling

There is a profound difference between data modeling for relational data stores such as data warehouses and modeling for NoSQL stores used with data lakes. Long-standing data modeling practices are based on the idea of designing structures to store data in relational databases. We work from conceptual modeling, through logical design, and ultimately to a physical model that describes how data will be stored.

The big data world stands this on its head. The data already exists; it is already stored without need or opportunity for us to design. The modeling purpose changes from design to understanding. Instead of conceptual → logical → physical, we begin with the physical and attempt to deduce a logical model. Instead of starting with entities, then proceeding to attributes and relationships, we begin with fields and try to deduce the things that they describe and the relationships among those things.

I blogged about this topic at TDAN.com more than 2 years ago with discussion of Big Changes in the World of Data Modeling. You’ll also find thoughts from data modelers Pascal Desmarets and Ted Hills at TDAN.

Data Governance

Moving from the old world of data warehousing and IT projects to the new world of data lakes and self-service brings several differences in how we work with data. There are four significant areas of pressure where data governance change is essential: agility, speed, self-service, and autonomy. To respond to these pressures we must rethink governance approaches to policy enforcement, complexity management, decision rights and authorities, and process rigor. Without compromising very important data governance goals, we must find effective ways to accelerate projects, embrace autonomy, and reduce or eliminate bureaucracy.

Natural tensions exist between the needs of self-service data consumers and the traditional practices of data governance. (See figure 2.) Complexity and rigor are in conflict with agility and speed. Decision rights and policy enforcement inhibit autonomy. Self-service analysis is often iterative, exploratory, discovery oriented, and not highly compatible with rigorous change management processes. These are but a few examples of the tensions between two data-related areas that are culturally and objectively misaligned.

Figure 2. Tension between Self-Service and Data Governance

Data governance practices must change to adapt to the realities of self- service reporting, BI, and analytics. However,
policy enforcement continues to be necessary. Autonomy is not anarchy. Data and analytics ecosystems are inherently complex. We can’t eliminate complexity. We must separate complexity from complication, then manage complexity and eliminate unnecessary complication. Decision rights continue to be needed, but decision models must change such that collaborative decision-making is common and authority-based decisions are the exception. Rigorous processes should evolve to become adaptive processes that focus on management instead of control. You can read more about this in my blog about The Next Generation of Data Governance.

Data Curation and Cataloging

In the not too distant past, nearly all of the data used by a business was created within the business – transaction data from OLTP systems subsequently transformed to populate data warehouses. Adoption of big data changed that reality with an increasing share of data coming from external sources and, at least in an informal sense, being curated data. The trend toward data curation accelerates as the proportion of external data grows, and as datasets that blend internal and external data (data lake, data sandbox, etc.) gain attention and interest.

Data curation encompasses the activities needed to oversee a collection of data assets and make data available to and findable by data consumers. Cataloging is an essential curation activity to create and maintain a vital and valuable data resource with all of the necessary metadata to make data findable, understandable, and useful. Curating and cataloging work together to meet the data needs of business analysts, data analysts, and data scientists. To learn more about data cataloging download The Ultimate Guide to Data Catalogs.

Summary

Modern data management is big, complex, and difficult. It is a multifaceted discipline that must attend to architecture, quality, modeling, governance, curation, and cataloging. Over the past 2 or 3 years, nearly every consulting project that I’ve encountered has been challenged with data management problems that are a direct result of the data quake. We must modernize data management practices, and we must get started now. Data and technology will continue to change. Don’t let the gap between data opportunities and your data management practices continuously grow wider.

Dave Wells

Dave Wells is an advisory consultant, educator, and industry analyst dedicated to building meaningful connections throughout the path from data to business value. He works at the intersection of information...

More About Dave Wells