Technologies for Data Governance in a Self-Service World

My January blog posting, The Next Generation of Data Governance, described the recent developments – agile, big data, cloud, and self-service – that challenge the practices of traditional data governance. With this post, I’ll zoom in on self-service as an especially challenging area and look at how the same technologies that bring the challenges also offer solutions.

The modern analytics world is rich with technologies, many of which include some data governance features and functions. Figure 1 shows the technology spectrum with several product examples in each category.

Figure 1 – The Analytics Technology Spectrum

Data Ingestion 

Data ingestion tools support the processes of bringing data into the analytics ecosystem. They are designed to move data from the original source to a “first stop” data store such as a staging area or landing zone. Common functions include:

  • Collection of data from sources.
  • Data stream ingestion.
  • Light filtering and cleansing.
  • Routing to data stores.

All ingestion tools have some level of data governance functionality, but they vary widely in the degree to which they are governance aware.

  • There are fewer governance functions in open source tools than in proprietary tools.
  • The primary governance capability implemented in all ingestion tools is capture of ingestion metadata. This is the start of data lineage tracking.
  • Interoperability with catalog & preparation tools is a valuable feature to get the advantages of machine learning for data discovery.
  • Proprietary tools that do more than simple data ingestion with discovery and profiling capabilities have more robust data governance features than basic ingestion tools.

Data Cataloging 

Data cataloging manages the inventory of data sets. Catalog tools help to collect and maintain the metadata that is the critical understructure of the analytics ecosystem. A data catalog connects people with data – finding, evaluating, understanding and acquiring needed data. The primary purpose of data cataloging is to help data consumers – especially self-service consumers – to find and access the datasets that they need. Data catalogs provide features to:

  • Collect and expose metadata about datasets.
  • Catalog search by keywords, tags, and facets.
  • Previews, ratings, and reviews.
  • Extensive data curation functions.
  • Self-service access to datasets.
  • Dataset usage tracking.
  • Catalog and expose data preparation procedures related to datasets or combinations of datasets.

Data cataloging is tightly coupled with data curation and is inherently a core component of modern data governance. From data profiling and metadata discovery to usage tracking at time of consumption, catalog tools support governance at all stages of cataloging activity:

  • Initial build of the data catalog.
  • Routine data catalog maintenance.
  • Catalog search and self-service data access by data consumers.

Data catalogs are the key to collaborative data governance. The culture change needed for self-service – a shift from centralized to community governance – is supported by collaboration, crowd-sourcing, and machine learning capabilities of data catalogs.

Data Preparation

Data preparation tools are used to improve, enrich, format and blend data to make it ready for reporting and analysis. These tools include data protection functions as well as lineage tracking features. Typical features include:

  • Visual, business-friendly, code-free user interface to design data preparation workflows.
  • Many built-in common data preparation functions available to use for data transformation and data blending.
  • Machine learning to suggest data characteristics and data preparation operations.

Reporting and Analysis

Reporting and analysis tools are used to explore, model, and visualize data to find patterns, trends, and insights – the keys to data value and impact. Tools in this category enable self-service reporting, dashboards, visualization and analysis for data consumers ranging from non-technical to statisticians and data scientists. Common features include:

  • Ability to connect to data sources of many kinds. The tools provide connectors of many types and some offer the ability to create custom connectors.
  • Work with many different data types ranging from structured data in relational databases to text, documents, and many kinds of big data sources.
  • Drag-and-drop visual user interface, often code-free, that allows non-technical people to perform advanced analysis and develop high-impact data visualizations.
  • Desktop, enterprise, and cloud deployments.

Reporting and analysis tools are ideally governance aware with support for security, privacy, and compliance constraints throughout their processes and functions including:

  • Interactive data visualization and profiling.
  • “In the box” analysis and visualization.
  • Machine-learning discovery and recommendations.
  • Scheduling and publishing.
  • Collaboration and sharing.

Tools that are data lineage aware with ability to identify certified and trusted data sources and to watermark high-trust reports and analysis are particularly valuable for governance in the self-service world.

Data Pipeline Management

Data pipeline management begins at the point of ingestion. Pipeline management tools support data flow mapping, change management, data lineage tracking, data availability, and data accuracy from point of ingestion to point of application. Perhaps most important, pipeline management enables data flow to be operationalized. It is important to note that these tools have as many differences as they have similarities. Snowflake, for example, is a cloud-based relational data warehouse with data pipeline functions. By contrast, Zaloni Bedrock is a Hadoop optimized data lake management tool.

Pipeline tools help BI and analytics designers, developers, and operators to build, execute, monitor, and manage movement of data through the ecosystem including:

  • Data flow and business logic.
  • Data integration.
  • Scheduling and task execution.
  • Resource and workload management.
  • Error handling and fault tolerance.

Continuity and cohesion of governance functions across the ecosystem helps to minimize complexity and simplify governance processes. Ideally these tools provide seamless governance capabilities across the range of functions including:

  • Pipeline design and development.
  • Scheduling and execution.
  • Monitoring and management.
  • Data security and privacy protection.
  • Data lineage tracking.

Self-service data governance is challenging. Finding the right mix of control and autonomy requires changes in organization, culture, processes, practices, and communications. Much of the difficulty is minimized when you understand and apply the governance features and functions that are built into your technology stack. For a deeper dive into these technologies download our report on the Modern Analytics Ecosystem: The Evolving Information Supply Chain.

Dave Wells

Dave Wells is an advisory consultant, educator, and industry analyst dedicated to building meaningful connections throughout the path from data to business value. He works at the intersection of information...

More About Dave Wells