Should AI Bots Build Your Data Pipelines? Part IV: Guiding Principles for Success with Language Models and Data Engineering

ABSTRACT: This blog recommends guiding principles for successful implementation of language models to assist data engineering.

The irony of adding robots to your team is that they need lots of care and feeding, as do the humans that manage them. This holds true for data teams that use language models to build and manage data pipelines.

This blog, the fourth and final installment of our series, recommends guiding principles for successful implementation of language models to assist data engineering. The first blog defined language models and use cases; the second explored risks; and the third described the emergence of “small language models” that reduce those risks. Together these blogs explore why and how language models—the most popular form of generative AI—make data engineers more productive as they discover, validate, ingest, transform, and document data for analytics.

To recap, a large language model (LLM) is a type of neural network that learns, summarizes, and generates content based on statistical calculations about how words relate to one another. Once trained, the LLM produces textual answers to natural language prompts, often returning sentences and paragraphs faster than humans speak. Examples of LLMs include ChatGPT from OpenAI, Bard from Google, and BLOOM from Hugging Face. A small language model (SLM) applies the same techniques as an LLM, but also uses fine-tuning, enriched prompts, or augmented outputs to support more specialized use cases in a more governed fashion. Data vendors such as Informatica, TimeXtender, and Illumex offer SLMs. We can expect SLMs to become the standard approach to language models for data engineering.

Data leaders and engineers should adopt five guiding principles to achieve their intended results with language models, including LLMs and SLMs. They should manage these models like assistants; compare costs and benefits; embrace prompt engineering; train their teams, and adapt data governance programs to address language models. Let’s walk through each principle in turn.

Guiding Principle 1. Manage your language model like an assistant

A startup founder recently told me he has a team of six but it feels like more because they use LLMs to assist tasks such as brainstorming and content creation. He has the right mindset. Language model tools are employees that we manage to improve team productivity. These tools do not replace humans; on the contrary, they need expert human oversight. An LLM is akin to a 20-year-old savant, long on cognitive powers but short on real-world judgment. He can do great things provided his manager trains him well, inspects his work, and incorporates his work into approved organizational processes. Data engineers and other practitioners should manage their language models in a similar way.


Language models are assistants that need human oversight


Guiding Principle 2. Compare costs and benefits

Any data engineer that has tried LLMs or early-stage SLM offerings understands well the primary benefit: productivity. LLMs can document data environments, build starter code for pipelines, find relevant public datasets, and so on, all of which helps overwhelmed data teams get more work done in less time. Other potential benefits include education on new techniques and elevation of team members to be more strategic. For example, data engineers might become more like data architects as they enter natural language commands to design pipelines rather than writing scripts from scratch. 

Then we come to costs. Open source software and free or discounted vendor prototypes minimize the upfront costs of language models. However, teams must spend time reducing the risks posed to data quality, privacy, intellectual property (IP), bias, and explainability—and the time it takes your team to mitigate those risks. Practitioners must inspect the quality of language model outputs to ensure they don’t break pipelines, deliver bad data to the business, or compromise privacy. They must ensure they don’t mishandle IP or propagate bias. Given the opacity of language models, teams also might need more time to explain how they handle data so business owners and external stakeholders such as investors, auditors, and customers have peace of mind.

So, what’s the net result? On the positive side of the ledger, data leaders should estimate the productivity boost their teams get from using LLMs and/or vendors’ initial SLM offerings. On the negative side, they should gather estimates of the time needed to reduce risks to an acceptable level—and predict the likelihood and cost of things going wrong. Tabulate all these estimates to assess the net benefit of language models before your data team and company make a formal commitment to a language model tool.


Compare the productivity benefits and risk mitigation costs before your data team makes a formal commitment to language models


Guiding Principle 3. Embrace prompt engineering

ChatGPT defines prompt engineering as “designing and refining prompts for artificial intelligence (AI) language models to generate desired outputs. It involves carefully crafting instructions or queries to elicit specific responses or behaviors from the model.” That’s not a bad definition. Data teams should adopt the following best practices for effective prompt engineering.

  • Give explicit commands. Users should enter precise prompts, for example by telling the language model to “play the role of a data engineer,” “compliance officer,” and so on. Then they can instruct them to complete a task associated with that role, providing as much context and domain-specific detail as possible.

  • Guide their outputs. Users should align the model’s outputs with their company’s processes and templates. When requesting a pipeline design, they might tell the language model “your summary will start with an overview, then explain the sources, transformation logic, target system, data validation techniques, and error handling methods, in that order.”

  • Iterate. One-shot prompts rarely get the job done. Users should ask the same question in different ways, compare the answers, and interrogate the model about gray areas. Sequences of prompts like these help fix erroneous outputs, fill logical gaps, and build multiple layers of understanding so the user can take confident action. Bill Schmarzo of Dell Technologies recommends creating a Socratic dialogue in which users test assumptions, inspect evidence, and explore implications.

By adopting these best practices for prompt engineering, data teams can get more accurate and reliable results with language models.


Prompt engineering makes language model outputs more accurate and reliable


Guiding Principle 4. Train your team

Data teams need training on what works and what doesn’t. They can start by learning from the successes and failures of early adopters within the organization, then codifying those lessons for all to see. To that end, early adopters should contribute peer presentations and user templates, reinforced by corporate guidelines, that teach others how to swim. Data leaders also should consider implementing a center of excellence that trains business users on governed self service. All these training activities should become part of a formal corporate education program that has executive endorsement and oversight.


Data teams should learn from early adopters’ successes and failures with language models


Guiding Principle 5. Adapt your data governance program to address language models

As explained in blog 2 of our series, data engineers need consistent policies, tools, and techniques to prevent wrong outputs or negative outcomes. This means adapting your existing data governance program to address generative AI. Governance teams should enlist early adopters to identify and scope risks related to data quality, explainability, bias, privacy, or IP handling. What tools, techniques, behaviors can reduce these risks? By answering questions like this, governance teams can build or adapt the right policies for risk mitigation and have data stewards enforce them. Designed and enforced well, a blended data/AI governance program like this can balance openness with control to minimize risk, boost productivity, and help drive new business value with AI.


A blended data/AI governance program can minimize risk and boost productivity


TLC for LLMs and SLMs

Language models, in particular the emerging standard of SLMs, need tender loving care from their human bosses. By adopting the guiding principles described here, data teams can capture the upside of generative AI and get more work done in less time. This includes managing models like assistants; running a cost-benefit analysis; embracing prompt engineering; putting a training program in place, and incorporating language models into existing data governance programs. With the right TLC, language models might give data engineers finally get on top of their workloads.

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie