DataOps for Generative AI Data Pipelines, Part II: Must-Have Characteristics

Abstract: Companies that adopt DataOps increase the odds of success by making GenAI data pipelines what they should be: modular, scalable, robust, flexible, and governed.

Read time: 4 mins.

Sponsored by Matillion

We’re entering an ambitious new phase of adoption for generative AI (GenAI). Companies are doing more than just having their knowledge workers use OpenAI’s ChatGPT or other language models as productivity tools. To gain competitive advantage, they also are applying language models to their own domain-specific datasets—which requires the time-tested discipline of DataOps.

The data pipelines that support GenAI must be modular, scalable, robust, flexible, and governed. While those must-have characteristics are not new, GenAI introduces new requirements for each. This blog, the second in a series, explores how data/AI teams can use DataOps to ensure their GenAI data pipelines exhibit these must-have characteristics. It builds on the first blog, which defined DataOps and its applicability to GenAI. The third and final blog in our series will explore the prompting and fine-tuning of LMs, assisted by DataOps.

Recap 

The most popular form of GenAI centers on language models that interpret and generate content such as text or imagery in response to natural language prompts. GenAI often consumes text that has been transformed and loaded into a vector database. A common scenario is retrieval-augmented generation (RAG), in which an application retrieves relevant vectors and injects that content into a user’s prompt to make the response more accurate. Language models also can query data within the vector database as part of the fine-tuning process, so it better understands domain-specific data.

Both RAG and fine-tuning need timely, trustworthy data. And that’s where DataOps enters the scene. The discipline of DataOps adapts methodologies from DevOps, agile software development, and total quality management to improve the quality and timeliness of data delivery. It has four pillars: testing of data pipelines, continuous integration and continuous delivery (CI/CD), pipeline orchestration, and data observability. Many companies use DataOps to support rising business demands for analytics.  

DataOps for GenAI 

And now they need DataOps to support business demand for GenAI. Here is how DataOps ensures the data pipelines that feed GenAI applications are modular, scalable, robust, flexible, and governed.

DataOps and the Must-Have Characteristics of GenAI Data Pipelines

Modular

CI/CD is a methodology for frequently iterating software by branching, updating, and merging versions of code. CI/CD techniques make pipelines more modular by enabling data/AI teams to branch and merge different pipelines as they adjust individual elements. For example, they might change chunking techniques or add/remove data sources to optimize RAG or fine-tuning. Observability tools, meanwhile, assist modularity by isolating the root cause of issues with data quality or pipeline performance. Armed with this information, data/AI teams can fix or replace, then test and deploy the problematic module—perhaps a transformation script, server cluster, or runaway application. They also can reuse vetted modules.  

Scalable

Orchestration makes pipelines more scalableby automatically synchronizing events and tasks across elements—clusters, GenAI applications, LMs, and so on—as data/AI teams add them. Orchestrating these workflows reduces the effort of expansion. Data/AI teams can automate the addition of text files that help fine-tune LMs, documents that support RAG, or users that rely on a popular GenAI application. In addition, observability helps optimize workloads as they scale by measuring the utilization and performance of pipeline elements.

Robust

Testing makes pipelines more robustby enabling data/AI teams to check LM latency, accuracy, and other KPIs before going live. Once in production, they can observe application and pipeline performance to ensure uptime, reliability, and availability. This vigilance requires a mix of technical and human controls: observability tools can spot and fix IT errors, while humans can spot-check responses to assess relevance and contextual meaning.

Flexible

CI/CD improves flexibility by reducing the time and effort of changing pipeline elements. Data/AI teams might branch a pipeline out of production, revise a script for filtering inputs, then test the pipeline and merge it back into production. They also might use the branch-merge process to change vectorization processes or labeling techniques, thereby ensuring the right datasets are available for RAG and fine-tuning. Orchestration, meanwhile, improves flexibility by automating how new and old elements work together.

Governed

The primary governance risks of GenAI initiatives relate to accuracy, privacy, bias, explainability, and handling of intellectual property. These risks require new governance policies and new levels of oversight. DataOps helps enforce such policies. For example, data/AI teams can test the ability of GenAI applications and pipelines to meet SLAs for each risk—i.e., minimal hallucinations, no PII exposures, and so on—both before and after they go into production. They also can observe application and workload KPIs, standing ready to intervene when they cross red lines. In addition, data/AI teams can use CI/CD to reuse vetted pipeline elements, reducing the risk and effort of governing them.

From Hype to Reality

Hyperbolic predictions about the transformative potential of GenAI mask the reality that data fundamentals matter more now than ever before. Disregarding these fundamentals will ensure GenAI transforms businesses in the wrong direction. Companies that adopt DataOps increase the odds of success by making GenAI data pipelines what they should be: modular, scalable, robust, flexible, and governed. Click here to learn how Matillion assists this approach.

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie