The Data Leader’s Guide to Generative AI, Part I: Models, Applications, and Pipelines
ABSTRACT: Data leaders must prepare their teams to deliver the timely, accurate, and trustworthy data that GenAI initiatives need.
As with all disruptive technologies, generative AI offers both upsides and downsides to early adopters. In this case, the upsides include rich digital interactions and a healthy productivity boost. The downsides, however, range from confused customers to operational errors and privacy breaches.
Timely, accurate, and trustworthy data makes the difference between these two outcomes. Chief data officers (CDOs) and their teams therefore have a big role to play. To make generative AI (GenAI) initiatives successful, data teams must modernize their environments, extend governance programs, and collaborate tightly with their data science colleagues. This blog starts a three-part series that examines these requirements in the following areas.
Blog 1: language models and GenAI applications, as well as the data pipelines that support them.
Blog 2: the LLMOps lifecycle of building, training, deploying, and optimizing LMs and GenAI applications.
Blog 3: prompt enrichment, also known as retrieval-augmented generation (RAG).
Generative AI (GenAI) refers to a type of neural network that humans train to interpret and create digital content such as text, images, or audio. In 2017, GenAI researchers at Google introduced the idea of a “transformer” that converts sequences of inputs into sequences of outputs. This gave rise to the language model (LM), which is essentially a huge calculator that predicts content, often strings of words, based on what it learned from existing content. The LM uses an “attention network” whose parameters quantify how tokens—e.g., words or punctuation marks—relate to one another. This attention network enables the LM to generate fast, intelligent responses to human prompts.
The LM is a huge calculator that predicts strings of words
based on what it learned from other words
OpenAI’s release of Chat-GPT 3.5 one year ago triggered today’s arms race among open source communities and vendors, such as Google, Microsoft, Hugging Face, and Anthropic to build ever-more powerful LMs. The recent chaos with OpenAI’s board and its investor Microsoft illustrates the high stakes—and the risks—as tech gorillas wrestle to capitalize on this technology.
To gain competitive advantage and deliver trustworthy results with GenAI, companies need to feed their LMs domain-specific data rather than high volumes of public Internet content. Such domain-specific LMs have the following three implementation options. (Eckerson Group also calls these domain-specific models “small language models” because they’re customized to process smaller datasets.)
Build from scratch. Data science teams design a new LM and train it on their own domain-specific use of language as well as their own facts.
Fine-tune. They take a pre-trained LM such as Llama or BLOOM and fine-tune it on their domain-specific language and facts.
Enrich prompts. They inject domain-specific data into user prompts to ensure the LM gets the facts right.
Data leader requirements
Data leaders must train their teams, especially data engineers and data stewards, on GenAI technology and implementation options for domain-specific LMs. As part of this research process, data engineers should use these platforms to assist their own work and start exchanging ideas with their data science colleagues. They should evaluate (1) the ability of their environment to support each implementation option, and (2) the ability of their data governance programs to reduce LM risks related to hallucinations, privacy, bias, or intellectual property. Based on these evaluations, data leaders can identify what people, processes, and tools must change to support the GenAI initiatives that business leaders and data science teams are considering.
The GenAI application combines an LM, a user interface such as a chatbot, and functionality that executes tasks based on LM outputs. Together these elements assist functions such as customer service, document processing, and specialized research. For example, this summer Priceline announced plans for a GenAI application that helps customers book travel, as well as internal applications that help employees develop software and create marketing content. Health providers at Meditech, meanwhile, use GenAI applications to summarize patient histories, auto-generate clinical documents, and place orders.
Data leader requirements
Developers must build GenAI applications that are modular, scalable, and configurable to ensure data science teams can rapidly iterate as they learn. Data teams should collaborate with developers, data scientists, and ML engineers to help them meet these requirements. To start, they must maintain an open architecture with easily adjustable elements, including open APIs, data formats, and tools. However, they also must implement safeguards to govern how GenAI applications access and consume data. This might include masking of personally identifiable information (PII), checks for IP-restricted content, or role-based access controls for developers.
Now we come to the bread-and-butter responsibility of data teams, especially data engineers: managing data pipelines. While pipelines for GenAI introduce new steps and data types, they comprise the familiar stages of extraction, transformation, and loading (ETL). The sequence of extract and load, transform, and load again can support GenAI processes such as LM fine-tuning and RAG.
Extract and load. Pipeline tools such as Prophecy, Matillion, or Airbyte help extract relevant text from applications and files, then load it into a landing zone on platforms such as the Databricks lakehouse.
Transform. Next, they transform the data to prepare it for LM consumption. They convert words to numerical tokens, group the tokens into “chunks,” then create vectors that describe the meaning and interrelationships of chunks. Various pipeline tools, tokenizers (for example, within Hugging Face’s BERT model), and AI developer frameworks such as LangChain assist these steps.
Load. Now they load the embeddings into a vector database such as Pinecone and Weaviate, or vector-capable platforms such as Databricks and MongoDB. Once again they use pipeline tools such as Prophecy, Matillion, or Airbyte to perform the loading.
Data leader requirements
Data leaders must have their data engineers collaborate with data scientists, ML engineers, and/or natural language processing (NLP) engineers to design and build these data pipelines. These data science colleagues have the right expertise with the new unstructured data types, transformation techniques, and vector databases that GenAI requires. Data leaders also must have their teams ensure accuracy, for example by assigning the right metadata to text files and ensuring those files align with master data management processes. Perhaps the most important requirement for data leaders is to institute phased pilots with rigorous testing and rapid iteration. These cross-functional teams should get as many mistakes out of the way as possible before they go into production.
2024 will be the year in which GenAI demonstrates whether it’s worth the hype. And CDOs hold the answer to this question. Data leaders must prepare their teams now to deliver the timely, accurate, and trustworthy data that GenAI initiatives need to ensure they deliver results rather than disappointment. They can do so by modernizing their environments, extending data governance programs, and fostering tight collaboration with data science teams. The next blog in this series will apply these principles to the LLMOps lifecycle.