The New Data Pipeline for Generative AI: Where and How It Works

ABSTRACT: Generative AI initiatives will require new data pipelines that prepare text files for querying by language models.

While a duck glides smoothly on the surface of a pond, it paddles hard underneath. Such is the case with Generative AI: it’s smooth on the surface but works hard underneath. This blog explores that hard work, which takes the form of new data pipelines that process data for language models.

The Rising Tide of Generative AI

After decades of false starts, the field of artificial intelligence has gained traction in recent years thanks to advances in algorithms, computing power, and data collection. These advances especially contribute to generative AI (GenAI), a type of artificial intelligence that generates digital content such as text, images, or audio based on what it learns from a corpus of existing content.

Within the sphere of GenAI, language models garner much of the attention. A language model is a type of neural network that interprets, summarizes, and generates text. Once it is trained the language model (LM) produces smart answers to natural language prompts from humans. It quantifies how words relate to one another, then generates its own strings of words for humans to read.


A language model quantifies how words relate to one another, then generates strings of words for humans to read


Vendors such as OpenAI, Google, Microsoft, Hugging Face, and Anthropic offer various commercial and open-source LM tools. People rushed to experiment with these LM tools after OpenAI’s release of ChatGPT-3 in November 2022, with OpenAI alone amassing 100 million active users in just two months. Many employees now use generative AI tools such as ChatGPT, Microsoft 365 Copilot, and Salesforce Einstein GPT to boost their productivity. Users must inspect outputs for accuracy and ensure compliance with governance policies.

The Next Wave

To improve accuracy and gain competitive advantage, some  companies are working to embed GenAI functions—that is, LMs—into proprietary workflows that process their own domain-specific data. This example of what Eckerson Group calls small language models (see my earlier blog) requires data pipelines that prepare text files for querying by LMs. Data team members collaborate to design and implement the pipelines that make this happen. The pipelines span text sources, tokens, vectors, vector databases, and LMs. 

Data Pipelines for Generative AI

  1. Text. Data scientists and data engineers assemble unstructured text from source files such as emails, customer service records, or audio transcripts from various applications, using tools such as Unstructured.io, Airbyte, or Fivetran to connect those sources and ingest their content into a cloud platform such as Databricks. Data scientists and engineers also transform the text, for example by converting it to a common format such as Delta, JavaScript Object Notation (JSON), or Comma Separated Values (CSV) for efficient access and manipulation. 

  1. Tokens. Natural language processing (NLP) engineers, data scientists, or ML engineers work with data engineers to convert the text into tokens, using tools such as Hugging Face’s BertTokenizer, Unstructured.io, Airbyte, and Prophecy. Each token represents a word, character string, or punctuation mark, for example as words separated by commas in a CSV file. These stakeholders also “chunk” tokens together to help vectors understand how they relate to one another.

  1. Vectors. ML engineers, perhaps with the help of data scientists or NLP engineers, now transform the data further. They convert tokens and chunks into numerical vectors that describe their meaning and grammatical characteristics. Techniques such as Word2Vec and GloVe help with this process, as do tools from Airbyte, OpenAI, Hugging Face, and the LangChain framework.

  1. Vector database. Now the data engineers, ML engineers, or DBAs load the vectors into the vector database for storage and indexing. This might be a specialist platform such as Pinecone, Weaviate, and Qdrant, or a feature of broader platforms such as Redis and SingleStore. Again tools such as Airbyte perform the loading. 

  1. Language model. Game time! ML engineers, NLP engineers, and possibly software developers implement the language model on top of the vector database. When users enter prompts, the language model starts searching and querying the vector database. For example, it performs a nearest-neighbor search of vectors that describe past answers to similar prompts. It pulls up those vectors, possibly ranks or filters them, then gives a fresh and specific answer to that user’s question. This process is also known as retrieval augmented generation (RAG) or grounding because it augments or “grounds” the responses with more specific data than what the LM was trained on.

Vendors such as Prefect, Zipstack, and Vectara help configure, execute, and monitor workflows that span many or all of these pipeline steps.  These new data pipelines for generative AI require close and extensive collaboration. Many initiatives will require a project manager that assigns and monitors execution of all these steps. In addition, a data analyst might help the data scientist or data engineer explore and interpret the available data. 

We can expect these pipelines and generative AI elements to integrate with traditional structured data pipelines in many cases. When a customer poses a question in the service portal, the LM might need to query both the vector database to generate a natural language response—and a tabular database to insert the correct purchase information. Along these lines, Google recently announced querying capabilities that apply to both vector and tabular datasets. 

Pipeline to the future

AI will fast become a competitive differentiator for innovative companies. But that innovation depends on careful design, implementation, and integration of these new data pipelines—then adaptation to meet future needs. That’s the best way to build a future pipeline of business in the age of AI.

Kevin Petrie

Kevin is the VP of Research at Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data...

More About Kevin Petrie