DataOps for Generative AI Data Pipelines, Part III: Team Collaboration
ABSTRACT: Explore the reasons for data engineers to collaborate with data scientists, machine learning (ML) engineers, and developers on DataOps initiatives that support GenAI.
Read time: 4 mins.
Sponsored by Matillion
Data teams are like large families that juggle endless requests from needy children—and generative AI is the neediest child of all. But these families can thrive if they adopt DataOps and work as a team.
This blog, the third and final in our series, describes how data engineers should collaborate with data scientists, machine learning (ML) engineers, and developers on DataOps programs that support GenAI. The first blog applied the four pillars of DataOps—testing, continuous integration and continuous delivery (CI/CD), orchestration, and observability—to data pipelines that prepare text and other unstructured data for GenAI. The second blog explained how DataOps makes these data pipelines modular, scalable, robust, flexible, and governed.
DataOps for GenAI Data Pipelines
Data teams manage data pipelines for all types of analytics use cases. It's a busy job thanks to multiplying data sources, projects, users, applications, and devices. In fact, 85% of 298 respondents to an informal poll by Eckerson Group in January said demand for data engineers would only increase in 2024. Generative AI pushes this demand higher, forcing them to learn new data types and new preparation techniques.
And they need to get it right. To minimize governance risks such as hallucinations, GenAI requires timely, trustworthy inputs. Data engineers can use DataOps programs to deliver these trustworthy inputs to the vector databases that support language models within GenAI applications. These inputs support the critical processes of retrieval-augmented generation and model fine tuning.
Retrieval augmented generation, also known as RAG, retrieves relevant content from a vector database, then augments user prompts with that content. This enables the language model to generate responses that have trustworthy facts and contextual knowledge.
Model fine-tuning, as the name suggests, refines pre-trained language models such as ChatGPT from OpenAI or the open-source Llama 3 from Meta to better handle domain-specific tasks. Data teams iteratively change parameters such as the “weights” that measure the impact of GenAI inputs on outputs to make the outputs more accurate.
RAG and fine-tuning need a team effort to succeed. Data engineers collaborate with data scientists, ML engineers, and developers to ensure their DataOps programs provide the right support to GenAI projects. Here’s what that looks like.
DataOps Collaboration
Retrieval augmented generation
RAG depends on the preparation and retrieval of fresh, accurate, and relevant inputs to augment user prompts. Data teams collaborate to make this happen, using the DataOps pillars of testing, CI/CD.
Data engineer + data scientist
Data engineers collaborate with data scientists to understand how the language model within the GenAI application operates. What use cases does it support? What content was it trained and possibly fine-tuned on? Perhaps most importantly, what are the gaps in its training data? Based on the answers to these questions, data engineers use CI/CD to optimize pipelines that feed a trustworthy knowledge base. The GenAI application retrieves data from this knowledge base, enabling its language model to respond to domain-specific prompts with fewer hallucinations.
Data engineer + machine learning engineer
Data engineers collaborate with ML engineers to optimize RAG processes. They profile source datasets so that ML engineers understand their relevance and lineage as they configure the right search and retrieval logic. ML engineers test this logic by running the RAG process, then observe what it retrieves and how that impacts the ultimate response of the language model. Data engineers advise ML engineers throughout this process to ensure the language model delivers accurate facts and places them in the right domain-specific context.
Data engineer + developer
Data engineers also collaborate with developers as they build the GenAI application that contains the language model. They help developers understand what data to expect from the RAG process, when, and in what format. With this knowledge in hand, developers can orchestrate the language model with the chatbot or other application components as part of a unified user experience. They can test the overall GenAI application with human users, then huddle with data engineers to revise source content or GenAI inputs based on user feedback.
Model fine tuning
As with RAG, fine-tuning a language model requires fresh, accurate, and relevant data—and a collaborative approach to DataOps helps make that happen.
Data scientists find and retrieve the right inputs to assemble a fine-tuning dataset that represents “ground truth.” They apply the language model to this dataset, observe its responses to various responses, then adjust parameters within the language model to align with that ground truth. Data engineers advise them throughout the process and refine the dataset to ensure it supports the target use cases.
ML engineers manage the lifecycle of the fine-tuning process. They collaborate with data engineers to configure the environment, conduct fine-tuning tests, and observe operational metrics such as uptime. Data engineers refine data pipelines for them to ensure the fine-tuning dataset remains fresh and accurate.
Developers manage the lifecycle of the GenAI application. They orchestrate the fine-tuned language model with other application tasks, working with data engineers to ensure these tasks fit the ground truth of the fine-tuning dataset. Together they ensure the application user has an experience that aligns with ground truth.
Summary
Collaborative data families, using DataOps as a framework, can help GenAI initiatives deliver on their promise of enriching user experiences, streamlining operations, and spurring creativity. The key to achieving this is ensuring the language models within GenAI applications consume the right inputs from the vector databases underneath. Now that’s a happy, healthy, and well-fed family.