Data Governance in the Era of Generative AI
ABSTRACT: With the increasing adoption of Generative AI, learn how data governance will add value to and benefit from Generative AI.
Over the last few years, data governance (DG) has expanded its focus from just risk and compliance to also include enabling analytics and AI and driving business value with trusted data. Enabling teams with diverse skills to find, understand, and access relevant and trusted data in a self-service model has become as important as ensuring compliance and privacy.
As a result, moving from a rigid, one-size-fits-all approach to a more agile, flexible, right-sized approach to data governance based on the needs of the business use case has become paramount.
With that broader lens for modern data governance, the key capabilities for DG can be summarized as follows: data discovery and cataloging; association of business context with data (business terms/definitions/policies); data quality monitoring and management; understanding data lineage; and policy enforcement (for privacy/security/compliance).
These capabilities support the following key use cases:
(1) Regulatory compliance
(2) Data privacy management
(3) Data security
(4) Trusted business reporting/business intelligence
(5) AI and analytics
(6) Data democratization and sharing
Use cases 1-3 are tied to the historical focus on risk & compliance, and use cases 5-6 support the new charters of driving AI, analytics, and business value. Use case 4 straddles the two categories.
As we look at DG market trends over the next couple of years, we have to start by looking at Generative AI (GenAI). And this in turn will be impacted by how GenAI impacts overall data management and consumption. Large Language Models (LLMs) and their conversational UIs have the potential to become the front-end for most data access and consumption. Given this, data teams must increasingly focus on fueling accurate and trusted data to LLMs.
GenAI accelerates trends already evident with traditional AI: the importance of data quality and privacy, growing focus on responsible and ethical AI, and the emergence of AI regulations. This will create both new challenges and opportunities for DG. To understand the implications, we have to look at this from two angles:
How DG will support/add value for GenAI
How GenAI will support/add value for DG
How Data Governance Supports GenAI
As organizations adopt foundational LLMs, their differentiation will come from their own data and knowledge base as inputs to the LLMs. The growing popularity of fine-tuning and Retrieval Augmented Generation (RAG) for incorporating domain-specific data underscores a few key points:
“Traditional” data governance will continue to play a key role in addressing data privacy, security and compliance.
AI brings a whole new set of challenges such as fairness, transparency and AI ethics, and the need to comply with emerging new AI regulations. To address these challenges, DG frameworks must rapidly evolve to support both traditional AI and GenAI.
Unstructured data like text files are the dominant inputs to LLMs. This makes data discovery and classification capabilities for unstructured data a foundational governance requirement.
As techniques such as RAG see more adoption, the need for real-time DG - for instance, dynamically applying policies to relevant data in an LLM-RAG workflow - will become more important.
Traditional DG processes provide a well-trodden path for proper management and usage of data across organizations: discover and classify data to identify critical/sensitive data; map the data to policies and other business context; manage data access and security; manage privacy and compliance; and monitor and report on effectiveness.
Similarly, as DG frameworks expand to support AI governance, they have an important role to play across the GenAI/LLM value chain. This chain includes:
Cataloging and inventorying models
Classifying risk level and mapping to associated policies and policy frameworks (e.g. EU AI Act, NIST AI Risk Management Framework)
Governing input data for accuracy, fairness and responsible use
Checking user access (e.g. a GenAI application for HR should only be accessed by HR teams)
Checking prompt inputs to prevent unintentional leakage of sensitive data or intellectual property
Checking outputs for unauthorized data and accuracy (including “human in the loop” workflows)
Checking for explainability and transparency
Auditing and reporting for compliance and other needs
Scaling these capabilities depends on policy and metadata-based automation.
How GenAI Supports Data Governance
Traditional AI/ML will continue to be critical for automating and scaling various DG processes. These include data classification; associating policy and business context with data; and detecting anomalies/issues and creating and applying data quality rules to fix them. Building on these capabilities, GenAI has the potential to turbocharge data democratization and drive dramatic gains in productivity for data teams.
Early applications of GenAI have focused on low-hanging fruit. These include delivering a natural language interface for data search, and auto-generating business glossary definitions and documentation for data. The generative aspect of GenAI has the potential to enhance and accelerate many other processes in DG. For example, GenAI can:
Explain lineage for a report or dataset to enhance trust
Classify and add metadata tags to unstructured data based on themes/type of content
Extract regulatory intelligence from policy documents to codify them as technical controls
Enable dynamic data access control based on policies, roles, permissions and usage context
Create synthetic data for model training, testing and development
A New Era for Data Governance
Over the last few years, the focus of data governance has broadened to include enabling AI/analytics and data democratization in addition to risk & compliance. This drives the need for a more agile and flexible approach to data governance. Now with the exploding interest in AI/GenAI adoption, data governance is poised for the next stage of evolution. This evolution will be driven by the need to address data quality and privacy for AI, enable responsible and ethical AI, address new AI-centric risks such as fairness and transparency, and comply with emerging new AI regulations.
While there are varied opinions on the adoption pace and trends for GenAI, there is universal agreement on one point: GenAI initiatives will not succeed without a strong data foundation to provide input data that is reliable and accurate—and to ensure responsible, compliant use of that data. Organizations will be hard-pressed to develop an AI strategy that is decoupled from its data strategy. As they move towards developing integrated data and AI strategies, data governance will need to evolve and expand to support integrated data and AI governance.