Synthetic Data for AI: Definition, Risks, and Strategies
ABSTRACT: Many machine learning projects fail because data scientists don’t have the right data. Techniques such as synthetic data is a novel algorithmic approach to address algorithmic risks.
Many machine learning projects fail because data scientists don’t have the right data. They train their ML models on inaccurate, inappropriate, or non-compliant data. When they put those models into production, they generate incorrect, inconsistent, or biased predictions that raise business risk. In essence, the problem is one of “garbage in, garbage out.”
ML and other types of Artificial Intelligence (AI) therefore need a good dose of “good” data for training and testing purposes. Getting good data is the most difficult and expensive part of building effective AI products and services. Studies such as IBM’s Data Quality Report have shown that bad data cost trillions to the US economy, of which about 50% can be attributed to the amount of time that knowledge workers spend finding, correcting errors, and confirming that they can trust the data. While the costs of complying with data rules like PCI DSS and GDPR are hefty, the cost of non-compliance is 2.71 times higher than the cost of compliance.
This creates the opportunity for “good” data that:
Fits the right problem or question that the AI is intended to solve
Has sufficient quality (e.g., accuracy, completeness, and timeliness)
Avoids bias against certain genders, ethnicities, etc.
Obfuscates personally identifiable information
Synthetic data can help data science teams meet these requirements in a more cost-effective and scalable manner. Companies generate synthetic data using sophisticated algorithms that automatically discover and learn the statistical patterns and characteristics of real-world datasets—then mimic them. By passing the requirement to collect data from the real world, they prepare robust synthetic training datasets with less effort and at lower cost.
Synthetic data techniques do not require labeling or feature engineering, which is one of the most expensive, time-consuming, and error-prone activities in AI development. Deep learning models such as the Generative Adversarial Networks (GANs) can generate these features and labels as part of the synthetic dataset from the get go.
Synthetic data techniques do not require labeling or feature engineering, which is one of the most expensive, time-consuming, and error-prone activities in AI development.
The benefits might prove so compelling that synthetic data becomes the norm rather than the exception. Gartner predicts that by 2030, synthetic data will completely overshadow real data in the development of AI models.
Figure 1 – Gartner Prediction on the use of Synthetic Data for AI.
Data science teams can use synthetic methods to generate both structured (i.e., tabular) and unstructured data such as visual images, 3D spatial, video and audio. Then they train AI models for use cases such a autonomous vehicles, deepfakes, customer behavior, and fraud risk monitoring on that synthetic data.
Autonomous Vehicle (AV) companies such as Tesla use synthetic data to simulate various driving scenarios automatically. They accelerate and deepen the training process by incorporating many more condition permutations, such as switching pedestrian locations, varying the speed of surrounding vehicles, adjusting the weather, etc.
Some synthetic data can have unsettling implications. Websites such as https://thispersondoesnotexist.com/ generate synthetic media, also known as deepfakes, of human images and speeches. Deepfake can offer benefits such as greater accessibility, education, and artistic values. However, without proper ethical development framework and legislation, deepfakes create the risk of fake news, non-consented impersonations, and social engineering. This video shows how actor Jordan Peele impersonates US president Barack Obama using deepfake, which raises concerns about the risk of false information using this technology.
However, the positive use cases for synthetic data abound. Companies such as https://www.synthesized.io/ use synthetic data methods to create structured data for purposes of software testing, machine learning model development and training, and to detect fairness and bias in AI development datasets. The fairness and bias feature is increasingly becoming crucial as AI use becomes pervasive. In my interview with Marc Degenkolb and Don Brown of Synthesized.io, they asserted that “when AI models rely on limited, poor-quality or skewed data, its applications often fail to do the job for which they were created, and negatively impact under-represented groups.”
Risk and Compliance Considerations
For companies considering the use of synthetic data for AI, here are the key risks and legal compliance requirements to consider.
Privacy and re-identification: One way to anonymize / de-identify datasets containing personal information is to generate a synthetic dataset that possesses similar structure and statistical characteristics as the originating dataset, but excludes real people’s information. This is different from and can be more reliable than traditional anonymization techniques, which suppress, aggregate, or add noise to datasets to reduce identifiability.
AI Fairness and Bias: In 2015, Google’s image recognition algorithm was called out for mislabeling images of black people as “gorillas”, highlighting the risk that AI models may produce unfair outputs due to sampling bias in training dataset. Companies can use synthetic data to fill the population sample gaps with a better, more just, and ideal representation of the world. This helps train AI models to avoid gender, religious, and racial discrimination.
Intrusion, Safety, and Trust: AI models that are used in automated decision-making (e.g., credit scoring or fraud surveillance) or autonomous products and services (e.g., self-driving cars or delivery drones) carry higher risks of physical and psychological intrusions or unfair and unexplainable decisions. This leads to legal liabilities, customer and reputational loss, and systemic harm to the society. While synthetic data does not necessarily reduce all types of model risks, it can generate a massive amount of real-world simulation data at such scale and speed that you can increase the rigor of model training and testing, and create different what-if scenarios to stress the model and exposes its limitations.
Strategies and Principles for Success
While synthetic data for AI creates a multitude of opportunities, the technology requires a principled approach toward adoption and mastery. Some strategies and guiding principles include the following.
Understand that synthetic data is itself an algorithmic-based technique. Therefore, a full model risk management program should oversee its development and use. Adopting organizations should commit to investing in rigorous development processes, data quality management, technical validations, continuous output monitoring, and a sound architecture with risk safeguards from design through deployment.
Deployment of synthetic data for AI requires business, legal, and ethical prudence and oversight. Given its novelty, the full impact is not yet estimable, and a lot of assumptions are involved in its development. A comprehensive and contextualized threat modeling and risk mitigation plan is necessary to avoid unforeseen systemic harms, and false sense of compliance or safety.
Understand the limitations of synthetic data, depending on the use case. For example:
Privacy: While it is a powerful anonymization technique, synthetic data still has a degree of semblance and traceability to the original data. GDPR might not consider it to be anonymized data. A privacy impact assessment or privacy by design review is required to provide assurance that a synthetic dataset does not carry re-identification risk.
Bias & Fairness: While it can be used to reduce bias in data, without an objective human in the loop, synthetic datasets could produce equally unfair and discriminatory outcomes by overcompensating for the original data biases.
The age of AI is here, and it has moved algorithmic risks to the categories of being systemic, black swan, and potentially catastrophic. Techniques such as synthetic data is a novel algorithmic approach to address algorithmic risks. It is a risk-neutral technique in that its role in preventing AI issues comes from its scale, speed, and cost-reductive potential to produce better training and testing data. Like any other algorithmic techniques, it must be subjected to rigorous AI and data governance controls. In the next blog, we will discuss how we can measure and generate fair and unbiased synthetic datasets when the the real world datasets could be unfair and biased.