Should AI Bots Build Your Data Pipelines? Part III: The Emergence of Small Language Models for Data Engineering
ABSTRACT: An emerging approach to generative AI will help data engineering teams achieve much-needed productivity gains while controlling risk.
An emerging approach to generative AI will help data engineering teams achieve much-needed productivity gains while controlling risk. This approach centers on what Eckerson Group calls the “small language model.”
This blog, the third in a series about language models for data engineering, describes this technique and an early offering from Informatica. It builds on the first and second blogs, which define use cases as well as risks and governance practices with language models. The fourth blog will conclude the series with guiding principles for data teams to achieve much-needed productivity benefits. Together these blogs explore the opportunity for language models to assist many aspects of data engineering, including data discovery, quality checks, ingestion, transformation, and documentation.
Let’s start by reviewing the definition of a large language model, a term that quickens the pulse of techies everywhere. A large language model (LLM) is a type of neural network that learns, summarizes, and generates content. Once trained, the LLM produces textual answers to natural language prompts, often returning sentences and paragraphs faster than humans speak.
A large language model (LLM) is a type of neural network that learns, summarizes, and generates content.
Despite the aura of magic, such capabilities boil down to lots of basic number crunching. An LLM breaks reams of text down into “tokens,” each representing a word, part of a word, or punctuation, then assigns a number to each token. During the training phase it studies how all the numbered tokens relate to one another in different contexts, and practices generating the next tokens in a string based on what came before. When OpenAI debuted ChatGPT-3 in November last year, the world started to comprehend the astonishing potential of its outputs.
Enter the small language model
A small language model (SLM) uses the same techniques as an LLM, but applies them to a specific domain. It might use pre-trained LLM logic to start—either from a vendor or open source community—and customizes that logic further. The following chart compares LLMs and SLMs in terms of governance and specialization. This builds on a concept that founder Alexander Rattner of Snorkel.ai shared at The Future of Data-Centric AI Conference earlier this month (although he did not use the SLM term).
A small language model applies LLM techniques to small, specific domains
We can view LLMs and SLMs as two ends of a spectrum with overlap in between. Overall SLMs distinguish themselves from LLMs in one or more of the following ways.
SLMs are more fine-tuned because vendors or companies train them on detailed, domain-specific data, for example to assist complex data engineering tasks.
They enrich user prompts, for example by injecting domain-specific data into a user’s question to make the response more accurate.
They augment outputs, for example by having multiple models generate outputs from different datasets to give users more contextual knowledge.
Data pipeline vendors are building SLMs with these capabilities now, often alongside LLMs, to help companies tackle specialized data engineering problems with better governance. This will help data teams boost productivity while reducing risks related to data quality, fairness, and explainability. We should get ready for a boom of small language models in data engineering and many other fields.
Get ready for a boom of small language models
While it’s early days, data vendors such as Informatica are building SLMs into their data management platforms. They fine-tune them on small, curated, and labeled training inputs that consume limited compute resources and minimize privacy risks for customers. This diagram illustrates Informatica’s current plans. In some cases it might also incorporate fine-tuned LLM code such as ChatGPT-4 if it proves sufficiently accurate. Informatica plans to integrate its CLAIRE GPT offering, in private preview now, into its product portfolio later this year. Let’s work through the diagram from the bottom up.
Governed training inputs. Informatica created its training inputs by cleansing, anonymizing, and transforming users’ catalog searches to build metadata that describes potential user dialogs with CLAIRE GPT. Its team further enriched those inputs by adding synthetic metadata that mimics the real stuff. The result: a curated repository of SLM training inputs that describe how companies discover, validate, and explore datasets. This repository gives Informatica the governed inputs it needs to start training and fine-tuning four custom SLMs.
Small language models. Here we have some fun. Informatica is training CLAIRE GPT to assist the use cases of data discovery, data lineage and quality, data exploration, and pipeline design. While Informatica’s platform already assists users with AI, these four SLMs will make it easier for them to enter natural language commands, understand metadata, and manipulate datasets. CLAIRE GPT will therefore help business users and data teams do the following.
Discover data. Find and classify relevant data by querying the catalog for various types of tables, models, or analytical outputs.
Lineage and quality. Describe lineage, including data sources and how it changed, with inferencing to fill gaps. Then observe its quality by checking factors such as completeness, consistency, and accuracy.
Explore data. Explore data to understand its context and suitability for analytics. Also explore data structures, for example by auto mapping and describing relationships between columns, tables, or databases.
Design pipelines. Configure pipelines, with a focus on extract, load, and transform (ELT) sequences that load data into a target before transforming and preparing it for analytics.
Natural language interface. On top of these SLMs sits the natural language interface that receives, interprets, and responds to user prompts. Rather than clicking onto a graphical interface—or in addition to doing so—users can just enter natural language questions or commands. An LLM within the interface then classifies the user’s intent and directs queries to the right SLM. If a data analyst asks for statistics about customer churn in Europe last quarter, the interface converts her natural language question to a graph query that instructs the discovery SLM to describe how the most relevant tables relate to one another.
Informatica designs its platform to support iterative sequences of user prompts. For example, if a data engineer now asks about the lineage and quality of those customer churn tables, the interface converts his question to a GraphQL query of the catalog that shows the tables’ provenance from source to target. Users also can enter longer instructions into this interface than they can with ChatGPT, which helps them get specific answers to address their specific metadata and requirements.
As with many new technologies, language models deliver better results when they get specific. Small language models, coupled with natural language interfaces and the right training inputs, aim to deliver results in the form of a productivity boost for data engineers. Informatica and peers such as TimeXtender and Illumex.ai are moving in this direction. In addition, vendors such as Predibase help companies create their own SLMs to address various data science or data engineering use cases.
Stay tuned for our final blog in this series, which defines guiding principles for data leaders to take advantage of both today’s LLMs and tomorrow’s SLMs.