Should AI Bots Build Your Data Pipelines? Examining the Role of ChatGPT and Large Language Models in Data Engineering
ABSTRACT: Many data engineers already use large language models to assist data ingestion, transformation, DataOps, and orchestration.
Data engineers view large language models (LLMs) such as ChatGPT or BARD with a mix of excitement and fear. LLMs’ fast, articulate answers to expert questions can help data engineers discover datasets, write and debug code, document procedures, and learn new techniques as they build data pipelines. But the fear is real too. LLMs can derail projects and undermine governance programs by giving answers that contain errors, bias, or sensitive data.
Given this ambivalence, data engineers and leaders of data engineering teams need reliable guidance about where and how it makes sense to use LLMs as part of a governed program. This blog commences a series that explores the emergence of LLMs and LLM-based tools from data pipeline vendors, and their implications for the discipline of data engineering. The series will define the technologies involved, assess current tools and techniques, and recommend ways to realize the productivity benefits of LLMs while minimizing risk.
In this first blog we define LLMs, data engineering, and use cases for combining them.
The second blog will explore governance strategies to handle the inherent risks and tradeoffs of LLMs.
The third blog will dive deeper into emerging platforms and tools, including proprietary LLMs, open-source LLMs, and LLM-enabled data pipeline tools.
The fourth and final blog will recommend guiding principles for successful usage of LLMs by data engineers.
Throughout the series we will describe LLMs in all their forms as “assistants” to emphasize the need for vigilant human oversight and management. LLMs are not equal partners; rather, they must remain subordinate to humans that make the key decisions and govern usage.
What is a Large Language Model?
A large language model (LLM) is a type of neural network whose interconnected nodes share inputs and outputs as they collaborate to learn, summarize, and generate content. A trained LLM produces textual answers to natural language questions, often returning sentences and paragraphs faster than humans can speak. OpenAI’s ChatGPT, whose “GPT” stands for “Generative Pre-trained Transformer,” gets most of the headlines. But other notable LLMs include BARD by Google and various open source platforms such as LLaMA by META and BLOOM by HuggingFace.
A large language model (LLM) is a type of artificial neural network that learns, summarizes, and generates content
The speed, range, and sophistication of these platforms create an aura of magic. But if we look inside we see familiar elements: lots of training data, lots of compute, and a big fat number cruncher. An LLM breaks reams of text down into tokens—each representing a word, part of a word, or punctuation—and assigns a number to each token. During the training phase it studies how all those numbered tokens relate to one another in different contexts, and practices generating the next tokens in a string based on what came before. After billions of attempts an LLM gets pretty darn good at creating strings of tokens that become logical sentences and paragraphs. In the professional sphere, LLM outputs might range from advice about corporate strategy to technical analysis to computer code.
After billions of attempts an LLM gets pretty darn good at creating strings of tokens that become logical sentences and paragraphs
Many data engineers, ever starved for time, already embrace LLMs as productivity tools. In a recent LinkedIn poll, 43% of 61 practitioners told Eckerson Group they already use LLMs to assist data engineering. In another poll, more than half (54%) of 57 practitioners said they use ChatGPT to help write documentation, while 18% use it for designing and building pipelines and another 18% use it for learning new techniques. (These results cleanse poll respondents by job title.)
Data pipeline vendors also embrace LLMs. Qlik demonstrated a prototype ChatGPT assistant at its QlikWorld user conference in April, and Informatica unveiled plans to integrate ChatGPT elements with its own AI engine at Informatica World in May. Other pipeline vendors are taking similar steps.
To understand how LLM assistants work in practice, let’s examine the four key segments of data engineering: data ingestion, data transformation, DataOps, and orchestration. In each segment, data engineers traditionally use a mix of tools, techniques, documentation, and tribal knowledge to manipulate data and metadata as it flows through data pipelines. The descriptions below cite example use cases for LLMs, with a focus on ChatGPT. In each segment LLMs can write documentation that describes the tasks involved.
Data engineers extract and load data, in both batch and real-time increments between platforms that reside on premises or in the cloud. They use homegrown scripts and commercial tools to design, execute, and monitor the data pipelines that ingest data across these on-premises, hybrid, or even multi cloud environments.
When prompted with specific instructions, an LLM assistant will create a detailed guide with starter code for the ingestion process. Senior data engineer Steven Russo blogged on Medium last month about his experience instructing GPT-4 to write PySpark code for streaming CSV sales records from Azure storage into an aggregated table in Databricks. While it took some fact-checking, error fixing, and re-prompting, Steven’s assistant generated many lines of raw code that saved him significant time.
While it takes some fact-checking, error fixing, and re-prompting, LLM assistant can save data engineers significant time
Along similar lines as described above, LLM assistants can generate starter pipeline code that handles transformation tasks such as merging, formatting, or filtering data. Steven Russo built his sales report by prompting ChatGPT to transform sales records across three zones in Databricks—bronze, silver, and gold—with increasing levels of refinement. Other data engineers tell me they use ChatGPT to write starter code, create python script templates for use in PowerBI, and prototype various pipelines and functions. “Usually the code is correct but from time to time it gets it wrong,” says one engineer. Because LLMs can act on natural language inputs, LLM assistants can help analysts start preparing data themselves rather than waiting on code-savvy data engineers.
DataOps and Orchestration
The emerging discipline of DataOps helps data engineers build pipelines with continuous integration and continuous development (CI/CD) of pipeline code, testing, and observability of pipeline performance as well as data quality. Data engineers and application developers also orchestrate automated workflows by scheduling, observing, and optimizing sequential tasks across pipelines and the applications that consume pipeline outputs.
LLM assistants can create step-by-step guides to help users apply DataOps principles to their teams, processes, and environments based on user inputs. They also can help inspect and debug pipelines or Apache Airflow workflows. Users can instruct ChatGPT to identify errors and recommend fixes to code that they enter into the prompt window. While they must inspect the responses for additional errors, ChatGPT and other LLMs do accelerate the overall debugging process.
Many data engineers welcome the help of LLM assistants as they struggle to handle exploding business requests, rising data volumes, and ever-changing data environments. While early evidence shows the most promising results with basic data ingestion and transformation work, LLM assistants also can boost the efficiency of DataOps and orchestration practices. All these use cases offer intriguing productivity benefits. The challenge, of course, is that LLMs can get things wrong and damage your business. Like a smart teenager, they need expert oversight to stay on track. The second blog in this series will examine such risks and explore strategies for responsible governance.