Should AI Bots Build Your Data Pipelines? Part II: Risks and Governance Approaches for Data Engineers to Use Large Language Models
ABSTRACT: This blog explores the risks of using LLM assistants for data engineering as well as approaches for governing them.
When our 10-year-old makes something up he gets dimples on his cheeks. It’s a cute sign of mischief. But when a large language model makes something up it flashes no such alerts. That's one of the not-so-cute risks with this emerging form of generative AI.
LLMs are hugely popular because they make humans more efficient and creative with fast, articulate responses to questions or instructions. But the risks to data quality, privacy, intellectual property, fairness, and explainability deserve equal attention. Vendors and companies must help LLM’s early adopters—including nearly half of data engineers at last count—maintain control of their businesses.
This blog, the second in a series about LLM assistants for data engineering, explores the risks of this new technology as well as approaches for governing them. The first blog defined LLMs and examined use cases for managing data pipelines. The third blog will dive deeper into LLM platforms and tools, and the fourth and final blog will recommend guiding principles for successful adoption. The good news: if governed well, LLMs offer much-needed productivity benefits for data teams that have long struggled to support modern analytics.
Well-governed LLMs boost the productivity of data engineering teams
Recap: how it works
An LLM is a type of neural network whose interconnected nodes share inputs and outputs as they collaborate to learn, summarize, and generate content. A trained LLM produces textual answers to natural language questions, often returning sentences and paragraphs faster than humans speak. While OpenAI’s commercial ChatGPT platform, whose “GPT” stands for “Generative Pre-trained Transformer,” gets most of the headlines, notable open-source LLMs include BARD by Google, LLaMA by META, BLOOM by HuggingFace, and StableLM by Stability AI. An exploding community of open-source contributors will only accelerate LLM development.
The speed, range, and sophistication of these platforms create an aura of magic. But if we look inside we see familiar elements: lots of training data, lots of compute, and a big fat number cruncher. An LLM breaks reams of text down into tokens, each representing a word, part of a word, or punctuation, and assigns a number to each token. During the training phase it studies how all those numbered tokens relate to one another in different contexts, and practices generating the next tokens in a string based on what came before.
After billions of attempts an LLM gets pretty darn good at creating strings of tokens that become logical sentences and paragraphs. But it doesn't “know” anything in the human sense of the term: rather, it understands the statistical interrelationships of tokens. So when you put an LLM to work in the real world, you create new risks for data quality, privacy, intellectual property, fairness, and explainability. Here’s a rundown on each of those risks and best practices for data engineers to mitigate them. As you’ll see, expert oversight, vigilance, and a healthy dose of common sense go a long way.
- Data quality. LLMs can give inaccurate responses based on outdated inputs. OpenAI, for example, trained ChatGPT-4 on data from September 2021 and earlier. They also “hallucinate” by returning falsehoods or nonsense based on gaps in training data or lack of real-world context. This can result in erroneous documentation, buggy pipeline code, or bad implementation advice for data engineers, who must find and fix such issues.
- Explainability. Most LLMs today cannot explain how they arrived at an answer. Ask ChatGPT for its reasoning and it gives vague responses, apologies, or disclaimers. Ask for sources, and you might get broken hyperlinks to outdated pages. Last month OpenAI explained how the 307,200 nodes in ChatGPT-2 spot patterns. But the current ChatGPT-4 version remains a black box for now, so data engineers must inspect and validate everything it tells them.
- Privacy. Like most online platforms, LLMs track user inputs. While providers such as OpenAI and Google do not share those inputs with other users, they do store them to help train the next version of software. This creates privacy and compliance risks because hackers might steal the data—or it might somehow appear in the next software version. Therefore data engineers must avoid sharing sensitive corporate data or personally identifiable information (PII) about customers, employees, or other stakeholders.
- Intellectual property (IP). Tracking user inputs also poses the risk that LLMs mishandle valuable IP, creating additional legal or regulatory concerns for users and LLM vendors alike. LLMs also can run roughshod over trademarked, copyrighted, or otherwise protected material as they scrape the Internet for training data. While data pipeline code is unlikely to create liabilities in this area, data engineers should keep an eye out for such risks.
- Fairness. Like all forms of AI, LLMs can perpetuate bias that exists in training data. If Internet content underrepresents or misrepresents minorities, for example, LLMs must balance that content with other inputs such as synthetic data to avoid generating unfair outputs. Data engineers as well as data analysts and scientists must mitigate this risk by inspecting and vetting inputs according to common standards of fairness. This matters, for example, when prompting an LLM to create transformation logic that describes customers.
Large language models pose risks to data quality, explainability, privacy, intellectual property, and fairness
Data engineers need consistent policies, tools, and techniques to apply these best practices and prevent bad or divergent outcomes. And companies can help them along by adapting their existing data governance programs.
Companies can adapt their existing data governance program to mitigate the risks of LLM usage by data engineers
So, what does this look like?
Governance teams should add LLM guidelines to their existing policies, train data engineers on those policies, and have data stewards enforce them. They should standardize on one LLM platform and build reusable templates for data team members to adopt. They also should consider implementing a center of excellence that shares and curates best practices for LLM-assisted data engineering along with other aspects of data management and analytics. Designed and enforced well, a governance program like this can balance openness with control to minimize risk, boost productivity, and create new business value with AI.
Journey, not a destination
Very few companies claim to have mastered the art of data governance, much less AI governance. Proliferating data sources, platforms, applications, and use cases make governance a moving target in the best of times. But rapid LLM adoption makes it all the more imperative for companies to help their data engineers reduce the risks that LLMs pose to data quality, privacy, intellectual property, fairness, and explainability. They can do so by defining and enforcing the right policies as part of their existing data governance programs.
The next blog in our series explores the tools and platforms that make LLM-assisted data engineering a reality. As we’ll see, vendors such as Informatica are devising smaller LLMs that rely on governed, curated inputs to tackle focused use cases such as data discovery.