Enterprise Data and the Taming of the Generative AI Frontier
ABSTRACT: US frontier history had races, risks, and rewards. Generative AI's future will follow a similar path.
Sponsored by Prophecy
At noon on a spring day in 1889, thousands of settlers raced across the prairie to grab a stake in the Oklahoma Land Rush. Farms, schools, and churches soon followed, taming the frontier.
In November 2022, OpenAI opened a new frontier by releasing ChatGPT-3 and demonstrating the possibilities of generative AI. As early adopters rush to embrace AI language models, companies are rapidly devising ways to tame them. The answer might lie in their own enterprise data—and the governance programs they use to control it.
Let’s start with the definition of a language model, the heart of generative AI.
A language model (LM) is a type of neural network that summarizes and generates content by learning how words relate to one another. A trained LM produces textual answers to natural language questions or “prompts,” often returning sentences and paragraphs faster than humans can speak. While ChatGPT gets headlines, other notable LMs include BARD by Google and various open source platforms such as LLaMA by META and BLOOM by HuggingFace. These “large” language models derive from massive training inputs with billions of “parameters” that describe the inter-relationships of words.
A language model is a neural network that learns, summarizes, and generates content
Language models might unleash a new world of innovation. Knowledge workers of all types already use LMs as productivity assistants. For example, 43% of 61 data practitioners told Eckerson Group in a recent poll that they already use LMs to help document environments, build starter code for data pipelines, and learn new techniques. In another poll, 73% of 40 early adopters said LMs make them up to 30% more productive. (These results cleanse respondents by job title.) It’s no surprise, then, that Databricks acquired the startup MosaicML for $1.3 billion to help companies put language models into production.
This frontier, of course, has a wild side: LMs make things up—i.e., “hallucinate”—when they don't have the right inputs. An LM generates strings of words that become logical sentences and paragraphs. But it doesn't “know” anything in the human sense of the term: rather, it takes guesses based on the statistical interrelationships of words it has studied. This can become a big problem when users pose detailed questions to LMs that lack enterprise-specific context because they were only trained on public data.
So when you put an LM to work in your enterprise environment, you create new risks for data quality, privacy, intellectual property, fairness, and explainability. Companies must adapt their data governance programs to mitigate these risks. They must feed accurate data into the LMs, as part of both the training process and natural-language prompting—or both.
Companies must feed accurate and domain-specific inputs into their language models, as part of both the training process and natural language prompting—or both
Time to civilize
To capture the upside and minimize the risk of language models, companies are training and prompting LMs with their own domain-specific data. I call these types of LMs small language models; others refer to them as private language models. Whatever you call them, they aim to civilize the frontier of generative AI. They’re different from standard ChatGPT or Bard implementations in the following ways (although they might use its starter code).
They enrich user prompts by finding, retrieving, and injecting domain-specific data into a user’s question to make the response more accurate.
They are more fine-tuned because vendors or companies train them on detailed, domain-specific data, for example to assist complex data engineering tasks.
They augment outputs by having multiple models generate outputs. For example, an LM might have a complementary model that recommends additional reading so users can validate facts and gain contextual knowledge.
Companies can train small, domain-specific language models in a matter of hours or days rather than weeks or months provided they have governed and accurate training inputs.
In our Eckerson Group webinar last month, Prophecy co-founders Raj Bains and Maciej Szpakowski explored how domain-specific data holds the key to generative AI. By enriching user prompts with their own governed, domain-specific data, they said, companies get better outputs, with a lower likelihood of hallucinations and all the attendant risks. In the world of generative AI, that sounds pretty civilized.
Maciej demonstrated how companies can build vector databases that store their text documents—from Slack, Hubspot, or other popular applications in their environments—much like data warehouses store tables. They can use Prophecy to build streaming pipelines that combine vector databases such as Pinecone, Weaviate, or Vespa with LMs. These pipelines take user questions, find relevant documents (perhaps Slack messages or support tickets) that companies embed into their vector databases, and inject data from those documents into the prompts.
With this specific guidance, LMs provide domain-specific natural language responses. This can work for a number of use cases. For example, a company might have a language model answer customer questions in a Slack channel based on their own product documentation and customer support records.
The history of the US frontier involves races, then risks, and finally rewards. The future of generative AI will follow a similar pattern. Check out our webinar to learn more.