Mitigating the Risk of Bias in Synthetic Data for AI

ABSTRACT: Companies need to monitor and address biases that synthetic data generation techniques can introduce into artificial intelligence (AI) models.


As described in my prior blog, synthetic data is gaining popularity because it can help safeguard privacy and ease a shortage of training data for artificial intelligence (AI). However, some inherent limitations of synthetic data raise the question: does synthetic data increase or decrease the risk of bias in AI models? In this blog, I will elaborate on the nature of AI bias and discuss how to mitigate its risk in synthetic data.

Some data science teams use generative adversarial networks (GANs) to create synthetic data. A GAN is an AI model itself, and generates training data for another model. This creates the risk that synthetic data inherits the bias of both the GAN and the real-world data used to train it.

Many experts expect the use of synthetic data to increase and even exceed the use of real-world data for AI projects. Given this, data science, engineering, product development, and risk management teams need to understand 1) how synthetic data can perpetuate AI bias and 2) how they can mitigate AI bias from synthetic data.

Types of AI Bias 

Multiple forms of bias can creep into AI models. (See figure 1.):

  1. Societal bias refers to discrimination based on membership in certain groups, categories, or classes (e.g., race, gender, religious affiliation, age).

  2. Data bias derives from data collection,  processing, and usage. For example, the wording of  a survey question can skew people's responses toward certain opinion. And imputing missing fields can amplify the majority opinion at the expense of the minority opinion.

  3. Model bias occurs when an AI model produces different levels of sensitivity or predictive accuracy (e.g., false positives and false negatives) across subsets of a population. 

  4. Finally, interpretation bias points to the decision-makers who consume the output of AI models without understanding the assumptions and limitations of the model. When decision makers don’t examine the model to check for bias, they may incorrectly make decisions that further perpetuate biased outcomes.

Figure 1: Sources of AI Bias

Case Study for Systemic Risk: How Synthetic Data Can Augment AI Bias

Consider this hypothetical use case. A bank would like to automate a loan application process. The bank created machine learning models using historical customer data, which included sensitive categories such as race and gender. Unfortunately, the bank failed to properly clean the dataset, resulting in data quality issues and other factors that created two types of bias:

  • The data team imputed values in the “Race” attribute with the most frequent value, which happened to be “White”.

  • They imputed missing values in “Gender” by selecting the value from multiple data sources. Given inconsistent values (“male” in some systems and “female” in others), the imputation logic simply followed a predefined hierarchy of sources without verifying the accuracy of the selected value. It selected “male” more often than “female.”

The bank leveraged synthetic data using GANs in an effort to mitigate data quality and privacy concerns. However, the synthetic data further amplified biases in three ways. First, the ML model augmented the bias in the synthetically generated training data. Second, the synthetic data augmented the bias in the originating input data. And third, that originating data augmented social biases. (See figure 2.) 

Furthermore, due to pressures from budget and time constraints, the bank deployed the ML model to the entire population before the development team conducted due diligence and field testing. As a result, the risk of AI bias rose exponentially.

Figure 2: Systemic Risk of Bias Augmentation by Synthetic Data Plus AI

Practical Architecture & Governance Strategies

A common cause of AI bias is the lack of representation in the training dataset. Data scientists can create and use synthetic data to cover the underrepresented population(s). To reduce the risk of AI bias, teams need to embed controls to prevent, detect, and correct biases in synthetic data pipeline. The data architects, governance officers, and digital product developers need to collaborate to stamp out synthetically-induced bias in AI models. Here are some tips and strategies to consider.

  1. Use data observability tools to measure data bias. Data observability tools can provide a transparent view into the internal state of the data based on their outputs, including bias and fairness metrics. For example, the Synthesized team has developed fairlens, an open-source python library to discover and measure data bias. 

  2. Integrate additional tools. Test data management, automated generative models, and analytics-driven imputation all help provide visibility over your business. They enable teams to monitor and automatically refine their synthetic data to ensure it meets desired bias and fairness thresholds. For a more detailed example of this approach, see Brouton Lab's blog.

  3. Use MLOps to automate the remediation of bias in the data as detected by data observability tools. As explained in this DataRobot blog, MLOps can automate remediation of data, for example to address data and model drift. This helps when, for example, bias metrics in synthetic training data exceed acceptable thresholds.

  4. Innovate with user experience (UX) to reduce systemic bias. UX best practices on intelligent products and recommendation engines can provide trust and transparency. AI-enabled products and services should empower the "human in the loop" so users can continuously monitor and make informed decisions about model outcomes. This article by the AIXDesign team provides some examples of UX strategies for AI trust and transparency.

In figure 3 below, we present a holistic capability architecture for addressing the risk of bias due to synthetic data. To illustrate, here's an example use case of how these capabilities contribute to an AI-based credit assessment mobile app.

To start, real data feeds a GAN model that produces synthetic data. This in turn is scanned by the data observability component to measure the bias metrics in the model training, testing, and production data. An MLOps workflow ensures this measurement is continuous, and automatically alerts the data scientist or corrects the data if it exceeds the acceptable bias thresholds. 

If the data approaches the tolerance limit for suspected bias, the workflow tells the user interface (UI) to nudge users with a warning of a potential bias. The UI provides a link to relevant metadata, including the input sources, data lineage, assumptions, and explanations of the model intuition. In combination, the credit application may not necessarily eliminate bias, but it can help users become understand the sources of bias.

Figure 3: Data Architecture and Governance Framework to Counter AI Bias

Conclusion – Beyond Intelligence and Fairness

Synthetic data and AI are powerful complementary technologies. However, as they both can contribute to the risk of AI bias, enterprises should maintain a heightened level of control in the data pipeline. AI model development is inherently subject to biases, and synthetic data can help reduce or amplify those biases. Addressing these risks from architecture and governance perspective is not a panacea. However, it can help an intelligent product team design a more transparent and explainable AI output, which is more likely to lead to justified decisions.

David Hendrawirawan

David helps clients architect data lakes, optimize BI reporting practices, automate data quality and master data management, and engineer cybersecurity, privacy, and responsible AI/ML controls. Having previously been with...

More About David Hendrawirawan