DataOps for Generative AI Data Pipelines, Part I: What and Why

ABSTRACT: The success of Generative AI depends on fundamental disciplines like DataOps.

Sponsored By Matillion

Generative AI introduces new algorithms, use cases, and business opportunities. But its success depends on familiar and fundamental disciplines—such as DataOps.

The explosive rise of generative AI, whose language models interpret and create content using natural language prompts, makes data management more difficult than ever. GenAI requires diverse datasets, often text or imagery with minimal structure, rather than the tables that data teams typically handle. Data teams must prepare these inputs so that language models understand their nuanced meaning and context. If they get it wrong, they raise the risk of hallucinations or biased outputs.

This blog explains how the discipline of DataOps mitigates governance risks such as these. DataOps helps data engineers collaborate with data scientists, NLP engineers, and ML engineers to manage effective, efficient data pipelines in support of GenAI initiatives.


DataOps helps data teams manage effective, efficient data pipelines for GenAI 


What is DataOps?

The discipline of DataOps uses methodologies from DevOps, agile software development, and total quality management to improve the quality and timeliness of data delivery. It has four pillars: testing of data pipelines, continuous integration and delivery (CI/CD), pipeline orchestration, and data observability. Many companies have embraced DataOps over the last decade as they seek to meet voracious business demand for data that feeds operational applications, business intelligence, and now artificial intelligence/machine learning (AI/ML) initiatives such as GenAI.

Data pipelines for GenAI

While data pipelines for GenAI introduce new steps and data types, they comprise the familiar stages of extraction, transformation, and loading (ETL). A common sequence for GenAI pipelines is extract and load, transform, then load again.

  • Extract and load. Data teams use tools such as Matillion to extract relevant text from applications and files, then load it into a landing zone on platforms such as the Databricks lakehouse.

  • Transform. Next, they convert words to numerical tokens, group the tokens into “chunks,” then create vectors that describe the meaning and interrelationships of chunks. Pipeline tools, Hugging Face’s BERT tokenizer, and developer frameworks such as LangChain assist these steps.

  • Load. Finally, they load the embeddings into a vector database such as Pinecone and Weaviate, or vector-capable platforms such as Databricks and MongoDB. Once again they use pipeline tools such as Matillion to perform the loading.

Once the prepared data resides in the vector database, it is ready to support GenAI. A common scenario is retrieval-augmented generation (RAG), in which an application retrieves relevant vectors and injects that content into a user’s prompt so as to make the response more accurate. Language models also can query data within the vector database as part of the fine-tuning process, so it better understands domain-specific data.


Once the prepared data resides in the vector database, it can support retrieval-augmented generation (RAG) or fine-tuning of language models


DataOps

Now let’s examine how the four pillars of DataOps, testing, CI/CD, orchestration, and data observability, make these GenAI data pipelines more efficient and effective. We base our example on text data.

Test

Data engineers use testing to examine and validate data pipelines. They inspect code, execute functions, then assess behavior and results. They also perform A/B tests that compare pipelines so they can choose the best version. By testing pipelines at each stage of development and production, they can spot and fix errors before they affect downstream users, models, or applications.

Data engineers, data scientists, NLP engineers, and ML engineers should test GenAI data pipelines using the same methodology. For example, they might run A/B tests of GenAI outputs to compare chunking techniques, and find that overlapping chunks convey contextual meaning more effectively than chunks that have no overlap. They also can compare vectorization techniques and find they improve accuracy of vectors by accounting for different word frequencies in different source documents.

Continuous Integration / Continuous Deployment (CI/CD)

Data engineers iterate both pipelines and datasets to maintain quality standards while relying on a single version of truth for production. They might branch a version of pipeline code into a development platform such as GitHub so they can fix errors or make enhancements, test it again, then merge the revised code back into production. In a similar fashion, they might branch versions of data, then revise and merge as needed.

CI/CD assists GenAI data pipelines in a similar fashion. Data teams can branch a pipeline version out of production if they see performance slipping, for example with customer service chatbots that start to hallucinate. They run A/B tests to compare different pipelines that have different source data, then merge the pipeline with the most accurate answers into production. These steps also can assist A/B tests that compare chunking and vectorization techniques as described above.

Orchestrate

Data teams automate data delivery and consumption by grouping their tasks into workflows that transform and move data between various stores, algorithms, processors, applications, and micro services. Orchestration tools such as Apache Airflow help them build, schedule, track, and reuse workflows. By orchestrating these various workflows and elements, data engineers reduce the repetitive work of managing data pipelines. 

Orchestration plays a critical role with GenAI initiatives because they rely on multiple data pipelines and comprise an ever-growing number of applications and analytical functions. For example, suppose a GenAI chatbot consumes data from a pipeline as part of a RAG process. Then the application assesses the customer’s satisfaction level with a different ML model, adjusts the chatbot response, and finally delivers the response to the customer. This type of workflow requires careful orchestration. Data teams must schedule and monitor the execution of each discrete task across all these elements.

Observe

Data teams observe both data quality and pipeline performance. They use data quality observability tools to assess data accuracy, completeness, consistency, and timeliness—then resolve issues that arise. They use data pipeline observability tools to monitor and optimize the performance, availability, and utilization of data pipelines as well as the underlying infrastructure.

GenAI data pipelines also require observability. Data teams must ensure the vectors that  leave the landing zone match the vectors that arrive in the vector database, and that updates arrive on time. In addition, data engineers and business experts must append the right metadata to source files, documents, and even chunks, then work with NLP engineers and data scientists to ensure the vector database index accurately reflects the contextual meaning of that metadata. This aspect of data quality observability, even if manual, requires careful attention.

While GenAI seems to usher in an unprecedented era, DataOps gives companies data teams well-established and adaptable methods for operationalizing this powerful new technology. I recommend learning more about how Matillion enables the methodology described here.

Kevin Petrie

Kevin is the VP of Research at Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data...

More About Kevin Petrie