Analyst Series: Should AI Bots Build Your Data Pipelines?
Kevin Petrie, the Vice President of Research at Eckerson Group, and Dan O’Brien, research analyst, discussed large language models (LLMs), which are neural networks that analyze text to predict the next word or phrase. These models use training data, often from the internet, to understand word relationships and provide accurate answers to natural language questions.
Dan and Kevin discussed the use of LLMs as assistants in various data engineering tasks. They found that LLMs were most useful in tasks such as writing documentation, building sequences of tasks, and assembling starter code for data pipelines, but emphasized the need for careful inspection of their outputs.
Kevin discussed the costs and benefits of large language models. He mentioned that while the productivity benefits were significant, the costs included risks such as lack of explainability, privacy concerns, data quality issues, handling of intellectual property, and potential bias.
Dan and Kevin discussed the concept of small language models (SLMs) compared to large language models. They concluded that small language models can be fine-tuned on domain-specific data, enriched with context and information, and augmented with outputs from other models to achieve accurate and efficient results for specific tasks.
Dan thanked Kevin for his insights and recommended readers to check out Kevin's blog series on their website. Kevin suggested following him on LinkedIn for daily updates and expressed interest in hearing the stories of software startup founders in the space.
My name is Dan O'Brien, and I'm a research analyst at Eckerson. Today I'm speaking with Kevin Petrie about his recent work on generative AI and large language models in his blog series and ebook “Should AI bots build your data pipelines?”
Kevin is the Vice President of Research at Eckerson Group, where he manages the research agenda and writes independent reports on topics including data observability, machine learning, and cloud data platforms.
Hi, Kevin. Thanks for speaking with me today.
Hey, Dan. Thanks for having me.
Let's jump into it. So what is a large language model?
A large language model is a type of neural network that studies text in order to predict the next word or phrase in a string of words and phrases. It does this by taking a lot of words, a large corpus of training data–it might come from, oftentimes, public data on the internet–and tokenizing every individual distinct word or phrase or punctuation point. These tokens are numbers.
Then it starts statistically understanding the relationships, the relative importance of various words to another. By studying all those interrelationships across millions and billions of words, it can start to predict with a fair amount of certainty and accuracy what an appropriate answer is to a natural language question.
It's a pretty fascinating and powerful effect that took the world by storm when ChatGPT was released by OpenAI in November. The speed, the articulate answers, the range of questions that it could respond to really surprised people. There are obviously some downsides too that we can talk about as well.
Yeah, I mean they seem really, really impressive from what I've seen. But you do emphasize that these large language models should be used as assistants. Don't we want them to do our jobs for us?
We want their help, but we need to watch them very carefully. A comparison that I think makes sense is if you picture a 16 or 18 year old savant that has amazing cognitive powers, but very little real world context or judgment.
You want to harness that energy and have it help you with cognitive tasks, but you need to watch it like a hawk. Give it very specific instructions and make sure that you inspect the quality of its outputs.
Great. So you describe that you can use these LLMs in the data engineering process. So that's data ingestion, data transformation, data ops, and orchestration. In which of these four areas are LLMs most useful today?
It's interesting to see the amazingly fast adoption of ChatGPT and in turn, Bard and Bloom and others. This has given us a fair amount of data about what the early adopters are doing. We've done some surveys on LinkedIn and found among 60 or more data practitioners that 43% of them are using chat GPT, for one, to support engineering use cases.
One of the most popular use cases is writing documentation. So documenting pipelines or documenting their environment, that is the kind of tedious work that many, many data engineers don't like to do and they're very happy to get the assistance. ChatGPT can also help with building sequences of tasks to approach a project. It can also assemble starter code to help develop data pipelines. It all needs to be checked very carefully, but it's very helpful for productivity.
You look at a couple guiding principles in your blogs, and you suggest that you compare costs and benefits of large language models. Everyone touts the benefits. But what are some of these costs of large language models besides the direct compute costs.
Yeah. Great question. I'll first give a nod to the benefits. We found three quarters of respondents to a recent poll, the early adopters, saw productivity benefits were up to 30%, in terms of their ability to complete their jobs, and the rest was more than that. So it is profound what you can achieve.
The costs are very real. The cost is really in the form of risk and the time spent trying to reduce your risk to a manageable level. There are a number of different risks, all of which are familiar to the practitioners.
One is explainability, you don't understand. This is the very definition of a black box in some sense. You're not able to explain to other stakeholders why you got a certain answer. And so that can introduce different concerns, including related to compliance.
Privacy is another concern. There are a number of folks that believe they've inadvertently exposed proprietary, confidential, maybe personally identifiable information to a public platform such as ChatGPT or BARD, which can introduce concerns in future terms of compliance with PII regulations.
Other concerns relate to data quality. That's a big one. Hallucinations very much happen. And the real risk with ChatGPT is that it's very articulate in giving you an answer that could be complete nonsense or made up because it wants to please. It wants to create statistically delightful strings of words and phrases. And so the data quality risk is a significant one.
Another one is related to handling of intellectual property and making sure that you're not mishandling anything and breaching any responsibilities that you have there.
And the final one I'd point to is bias. This probably applies less when it comes to data pipeline building, unless of course you're taking transformation code that you got from a large language model and then applying it to data and coming up with outputs that treat certain groups of individuals unfairly. But bias is very much a risk in other cases as well.
You've coined a term: small language models. And what are these in comparison to large language models?
For folks there's an aura of magic. Let's look at ChaGPT4 for example. The estimates are that it took a trillion parameters, which are values that represent the relationships between different tokens that you study. And so, ChatGPT, it's fair to say, went out and studied a whole lot of data to produce its results.
But there are more recent studies to find that you actually don't need a ton of training data in order to come up with some pretty sophisticated results. And the reality is dawning on a lot of companies and a lot of software vendors that it's economical for them to build their own models. And these models can be what I would call small. They're small in three ways.
One is that they're going to be fine-tuned on domain-specific data, often internal enterprise data. Because if you really want to do some hard work related to processing of documents, related to all kinds of functional use cases, you want to train a model on your data. And make sure you’re getting the most relevant results possible and reducing the risk of hallucinations.
Another is prompt enrichment. The more you can really feed domain-specific information and context into your prompts, the more you can increase the likelihood of having a safe, fair, accurate, intellectual IP responsible output. That prompt enrichment is a critical piece.
And then the third aspect of what makes a small language small is that you're getting domain specific by potentially augmenting outputs. You could have outputs from one language model and outputs from another language model, or from a different type of AI. That starts to triangulate outputs so that you can make sure that your outputs are actually accurate.
So those three ways, even if you're starting with large language model code, you can get you down into a small language model context and get domain specific to get real work done. It's through prompt enrichment, through output enrichment, and through fine-tuned domain-specific training.
Fantastic. Thank you for your insight, Kevin. I really appreciate it. If you want to learn more, you can read Kevin's blog series “Should AI Bot Build Your Data Pipelines?” on our website, EckersonGroup.com.
And Kevin, what's the best way for listeners to follow what you're doing?
Look me up on LinkedIn. I post there daily. I love to engage with practitioners. We do periodic surveys of what adopters are up to with these new technologies, and that's a great way to find me. If you're a founder of a software startup in this space, I'd love to hear your story.