Achieving Fusion: How GenAI and Data Engineering Help One Another
ABSTRACT: GenAI can help data engineers become more productive, and data engineering can help GenAI drive new levels of innovation.
Read time: 5 mins.
Sponsored by Informatica.
Data engineers have a problem. They struggle to prepare and deliver analytics data to support proliferating users, projects, devices, and applications. Fortunately, new generative AI tools can help them build and manage data pipelines more efficiently.
Data scientists also have a problem. They must build new pipelines that prepare semi and unstructured data objects—text, images, audio, and so on—for consumption by generative AI applications. Fortunately, new data engineering tools and techniques can help them build these pipelines.
So, data engineering needs the help of GenAI, and GenAI needs the help of data engineering. Companies that solve these dual problems can realize powerful benefits. They can improve productivity and drive business innovation. Cool stuff! But achieving this synergy requires companies to foster best practices and enforce governance policies. This blog explores how data teams can foster these elements of success and realize these benefits.
The Fusion of Generative AI and Data Engineering
Definitions
Generative AI (GenAI) is a type of neural network that trains itself to interpret and create digital content such as text, images, or audio. Much of the current GenAI excitement centers on the language model (LM). An LM is a huge calculator that predicts content, often strings of words, after studying how words relate to one another in an existing corpus of similar content. Knowledge workers use LM platforms such as Chat-GPT from OpenAI or Gemini from Google, as well as LM features within software tools such as Salesforce or GitHub that act as natural-language assistants.
How GenAI Helps Data Engineering
The problem. Data engineers design, test, deploy, monitor, and optimize the pipelines that deliver data for analytics. An explosion in SaaS applications, mobile apps, IoT sensors, data platforms, analytical tools, and business-oriented users makes their job busier than ever. They struggle to manage the data ingestion and transformation jobs that integrate these myriad components.
The solution. LM platforms and new LM features within pipeline tools help solve the problem. Based on natural-language prompts from data engineers, they generate starter code for data pipelines, suggest ways to debug that code, then document those pipelines and the related datasets for cataloging. LMs also recommend rules for checking data quality and evaluate different architectural approaches to designing pipelines. In these ways, LMs save time as they automate, accelerate, and simplify the complex and tedious tasks of data engineering. For example, Informatica helps do this with CLAIRE GPT.
How Data Engineering Assists GenAI
The problem. Innovative companies are starting to build their own GenAI applications. These applications include an LM (or API that connects to one), a conversation UI for users to engage the LM, and additional functionality that executes tasks based on LM outputs. GenAI applications support use cases such as customer service, document processing, and marketing. But to deliver usable outputs, these GenAI applications need usable inputs. They need their pipelines to transform unstructured data objects into numerical vectors that enrich user prompts and assist LM fine-tuning.
The solution. Data engineers help solve this problem by building those pipelines, which comprise the familiar stages of extraction, transformation, and loading (ETL) or ELT. The sequence of extract and load, transform, and load again can prepare companies’ domain specific data for usage by their GenAI applications.
Extract and load. These pipelines extract relevant text from applications and files, then load it into a landing zone on platforms such as the Databricks Lakehouse or Snowflake Data Cloud. To improve GenAI accuracy, this text should already align with master data and meet quality standards.
Transform. Next, they transform the data to prepare it for LM consumption. They convert words to numerical tokens, group the tokens into “chunks,” then create vectors that describe the meaning and interrelationships of chunks.
Load. Now they load the embeddings into a vector database such as Pinecone and Weaviate or vector-capable platforms such as Databricks and MongoDB.
Data teams then use the vectorized data to support GenAI applications in two primary ways. First, they implement retrieval-augmented generation (RAG), which finds relevant content within the vector database and adds it to user prompts so that the LM is more likely to give good answers. Second, they fine-tune the LM by adjusting its parameters to align with the vectorized text. Both RAG and fine-tuning increase the likelihood that GenAI applications will have accurate responses that reflect the right business context and reduce hallucinations. Data teams also help might help developers integrate these GenAI applications with other applications and agents.
Elements of Success
To make all this work, data science and data engineering teams must adopt best practices and strengthen their governance frameworks.
Best practices. The most knowledgeable data engineers should define best practices for using GenAI chatbots and features within pipeline tools. As they experiment, they should document their pitfalls and lessons learned. Which data sources, programming languages, and transformation scripts are most appropriate for GenAI assistance? What types of prompt techniques get the most accurate responses? And are the most effective ways to validate LM responses? The more your leading data engineers codify answers to questions like this, assembling templates and tips along the way, the better they can train those that follow.
By the same token, data engineers and data scientists should define and evangelize best practices for managing the data pipelines that support GenAI applications. Which tools and techniques work best for various datasets and GenAI use cases? How should data engineers and data scientists filter, label, and manage metadata for various file types? As they answer questions like these, data teams will build a knowledge base that helps both sides of the equation. They’ll build better pipelines for GenAI, and use GenAI to build better pipelines faster.
Governance frameworks. GenAI poses a number of governance risks. For starters, it can “hallucinate” or give wrong answers, expose private data, or mishandle intellectual property. Data engineers, data scientists, data stewards, and compliance officers must define and enforce policies to mitigate these risks. For example, they might restrict LM usage to certain datasets, users, and applications. They should require data teams to document hallucinations and the prompts that caused them. They also should require GenAI applications to cite their data sources and lineage when responding to user prompts. Perhaps most importantly, they should cleanse and validate all GenAI inputs. Governance controls like these reduce risk for both sides of the equation: GenAI assisting data engineering and vice versa. So, data and AI governance are critical.
Conclusion
Fusion releases positive energy provided it takes place in a controlled setting. And such is the case with GenAI and data engineering. The best practices and governance measures described here can help companies play this powerful technology to their advantage. GenAI can help data engineers become more productive, and data engineering can help GenAI drive new levels of innovation. To learn more about a vendor that contributes to this dual approach, you can check out this resource by Informatica.