Register for "The Cold, Hard Reality of Selling Data: 7 Pitfalls You Need to Avoid" - Wednesday, April 30, 1:00 pm EST

Why and How Streaming Data Drives the Success of Generative AI

ABSTRACT: This blog defines streaming data, explains why companies need it, and explores how streaming data pipelines feed multi-faceted GenAI applications.

Read time: 4 mins.

Sponsored by Striim

Defying typical patterns, GenAI adoption is accelerating more than two years after it started with the debut of ChatGPT. Nearly all (98%) respondents to the latest Executive Benchmark Survey are increasing their investments in data and AI, and nearly half (46%) have achieved “significant” or even “transformational” business value. In addition, 94% say their interest in AI has increased their focus on data, underscoring a strong desire for clean and trustworthy model inputs.

This blog explains why and how streaming data pipelines deliver such inputs. We define streaming data, explain why companies need it, and explore how streaming data pipelines feed multi-faceted GenAI applications. The bottom line: these bleeding edge models will fail unless they tap the real-time stream of modern business events.

What is streaming data?

Let’s start with the definition. A streaming data pipeline is a workflow that captures a digital “event”—perhaps a credit card purchase, website login, or factory sensor alert—and replicates that event to a target platform. For example, a streaming data pipeline might use change data capture (CDC) technology to identify and replicate a new transaction record from a source relational database such as Oracle to a target lakehouse such as Databricks.


A streaming data pipeline captures digital events and replicates them to a target platform


Streaming pipelines typically relay events in “real-time” increments—every few minutes, seconds or even milliseconds, whatever the business requires. They might perform light transformations along the way, for example by reformatting or merging streams of events to enable rapid analysis upon arrival at the target. Modern streaming data pipelines draw on a diverse ecosystem that includes commercial pipeline tools such as Striim and open-source elements such as the Apache Kafka distributed messaging system.

Why streaming data?

Many of the popular use cases for GenAI language models, including customer service, document processing and supply chain optimization, depend on instant access to the very latest business facts. To engage meaningfully with customers, employees or partners, GenAI chatbots need the latest customer purchases, market interest rates and local currency rates. This is especially true for “agentic” applications, which make decisions and take action with little or no human oversight. They cannot afford to get things wrong.

Streaming data pipelines help meet this real-time requirement. They offer a lightweight method of manipulating and delivering myriad events to the data stores that underlie GenAI language models. Unlike legacy pipelines that process batch loads, streaming pipelines can mix and match different sequences of events before arriving at the target. This results in a granular, sophisticated and real-time view of fast-changing business conditions.

To understand how it works, let’s consider retrieval-augmented generation (RAG).

How streaming data supports RAG for GenAI

Probabilistic by design, GenAI language models make guesses when prompted with questions or instructions outside their training. To minimize these damaging “hallucinations,” many GenAI adopters implement RAG. RAG workflows retrieve domain-specific data, use that data to augment user prompts and thereby help the language model generate more accurate responses. They often get that domain-specific data from streaming pipelines.

Case study: WeMoveIt

We illustrate this with a fictional case study for a container shipping company called WeMoveIt. As climate change leads to more disruptive storms, WeMoveIt’s customers have started demanding real-time shipment tracking and arrival estimates to help them adjust supply chains in response. WeMoveIt’s data team implements a new chatbot-enabled routing application, assisted by GenAI, RAG and machine learning.

Streaming Data Pipelines for RAG and GenAI

The workflow begins with event sources. These include an SAP database that stores cargo records, a proprietary SaaS application that handles customer orders, an Elasticsearch log store that tracks RFID tag scanners and a third-party service that emails hourly weather updates for shipment routes. WeMoveIt’s data team configures a streaming pipeline to capture real-time events from these diverse sources, then reformat, filter and deliver them to Azure Synapse. This streaming data pipeline complements the batch pipeline that transforms static documents into embeddings within a vector database. Together, these pipelines support RAG and GenAI. 

The consolidated tables and files on Databricks become the foundation for RAG. When a customer enters her natural-language request for a shipment update, or a fleet manager requests a re-routing, the application retrieves the appropriate records and injects them into the user prompt. Armed with this latest information, the GenAI language model within the application can have a responsible and reliable conversation with the user. The retrieval workflow also supports a predictive ML model that analyzes weather indicators to anticipate delays, notify customers and suggest alternative routes.

Enriched by streaming data pipelines, WeMoveIt’s AI initiative improves customer satisfaction and streamlines operations.

Streaming ahead

The value of data streaming goes further. Streaming pipelines also can help fine-tune GenAI language models–that is, adjust their parameters–to better interpret domain-specific data and thereby support specialized use cases without relying exclusively on RAG. For example, data teams might stream database records into a training corpus to help the language model “learn” customer behavior based on the most relevant, current purchase and service records. This makes the language model more accurate with its responses and further reduces the likelihood of damaging hallucinations.

While GenAI models represent a massive advance in cognitive power, their ultimate business success depends on a return to the fundamentals of data management. This starts with reliable streaming pipelines that weave together trusted real-time facts. To learn more about their options for managing such pipelines, data and AI leaders can request a demo of Striim’s platform.

Kevin Petrie

Kevin is the VP of Research at BARC US, where he writes and speaks about the intersection of AI, analytics, and data management. For nearly three decades Kevin has deciphered...

More About Kevin Petrie