Why and How Data Engineers Will Enable the Next Phase of Generative AI

ABSRTACT: Data engineers and data scientists must manage pipelines for unstructured data to ensure healthy inputs for language models.

Sponsored by Datavolo

To streamline operations and enhance digital interactions, companies are starting to apply generative AI to their own emails, documents, video recordings, and so on. But first their data engineers must parse, chunk, and transform all that unstructured data into useful embeddings for GenAI language models. They also must support the process of retrieval-augmented generation and the discipline of language model operations. 

This blog, the first in a series, explores why and how data engineers contribute to the success of GenAI. The second blog will compare homegrown approaches and commercial tools for building new pipelines for unstructured data to support GenAI. The third and final blog will define product evaluation criteria for this market segment. 

What is Generative AI? 

Let’s start by defining the technology. Generative AI (GenAI) is a type of neural network that trains itself to interpret and create digital content such as text, images, or audio. Much of the current GenAI excitement centers on the language model (LM). An LM is a huge calculator that generates content, often strings of words, after studying how words relate to one another in an existing corpus of similar content.  


A language model is a huge calculator that generates new content  

based on what it learns from existing content 


This gives humans a natural language assistant with powerful capabilities. Knowledge workers use LM platforms such as Chat-GPT from OpenAI or Gemini from Google, as well as LM features within software tools such as Salesforce or GitHub, to get work done faster. And now companies are building GenAI applications to automate customer service, accelerate research-intensive processes such as drug development, and address myriad other specialized use cases.  

Limitations of GenAI 

But GenAI has serious limitations. LMs create content that they think is likely to answer a given question or “prompt.” They have no certainty about facts because they are probabilistic rather than deterministic. LMs might “hallucinate” and make up facts to fill gaps in the data they were trained on. They also might generate outputs that expose personally identifiable information (PII), mishandle intellectual property, or propagate bias towards certain populations.  

Data Overcomes the Limitations 

Data engineers can mitigate the risk of these bad outcomes by giving LMs trustworthy inputs that represent the nuanced reality of modern business. Because GenAI largely consumes unstructured data, data engineers must build new pipelines that effectively process and deliver this type of data. This represents a new challenge because data engineers historically focused on structured tables rather than unstructured documents, images, or video files.  

Here is how to get started. 

1. Build new data pipelines 

Today unstructured data sloshes through email systems, CRM applications, videoconferencing software, and other parts of the organization. Companies need to consolidate, parse, and prepare this data for GenAI. Here is an example of aunstructured data pipeline that does this with text files.  

  • Extract. First the pipeline parses and extracts relevant text and metadata from applications and files, including complex documents with embedded figures and tables. The metadata and structure include elements such as the document title, body, and footnotes.  

  • Transform. Next, the pipeline transforms the extracted documents. It divides the text into semantic “chunks” and creates vectors that describe the meaning and interrelationships of chunks. It also might enrich these document chunks with data from other systems and data platforms. (Some pipeline tools perform these transformation steps in an intermediate landing zone using an ELTL sequence.) 

  • Load. Finally, it loads and embeds these vectors into a vector database such as Pineconeand Weaviate or vector-capable platforms such as Databricks and MongoDB. These vectors are now ready to support GenAI 

Example of an Unstructured Data Pipeline 

Data engineers must design, deploy, and monitor these pipelines, and orchestrate how they interact with vector databases and GenAI applications. They also might need to orchestrate how GenAI applications integrate with predictive ML models or other analytical functions as part of larger workflows. In addition, data engineers need to observe both data quality—for example, to ensure no vectors are lost or duplicated—and pipeline performance. Tools such as Datavolo help manage data pipelines in these various ways. 

2. Support New Processes 

Data engineers must support the process of retrieval-augmented generation (RAG) and the discipline of LLMOps. 

Retrieval-augmented generation. RAG has become a preferred process for enabling GenAI applications to generate accurate, relevant responses. In this process, the GenAI application searches the vector database for content that relates directly to the user prompt. It retrieves various documents and files, ranks them by relevance, and injects the most relevant ones into the prompt.  

To make this work, RAG needs the vectors within the vector database to accurately represent the source content and its business context. Data engineers and data scientists or NLP engineers need to use the right transformation techniques in their pipeline. They need to divide their source content into the right-sized chunks, append the right metadata, and track lineage end to end. Data pipelines that meet such requirements, using the steps described earlier, improve the odds of success with GenAI. 

LLMOps. Data science teams use the emerging discipline of LLMOps to manage the lifecycle of the language model, including fine-tuning, deployment, monitoring, and optimization. Data engineers must design and continuously adjust data pipelines to support each of these stages. The transformation techniques described earlier help create accurate, relevant vectors to assist LM fine-tuning and production operations. Data engineers also must configure their pipelines and the underlying infrastructure to meet production requirements for latency, throughput, and reliability. By reusing modular pipeline elements, they can adapt to rapid changes in source data, LMs, or GenAI applications.  

Summary  

Data engineers have a lot of work to do, and data science teams are counting on them. They must learn to build and manage new pipelines for parsing, chunking, and embedding unstructured data, and must ensure those pipelines feed healthy inputs to language models. By doing so they can help companies enter a more challenging but more rewarding phase of GenAI adoption. The next blog in this series will explore how homegrown and commercial tools can help them achieve their goals. 

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie