Register for "A Guide to Data Products: Everything You Need to Understand, Plan, and Implement" - Friday, May 31,1:00 p.m. ET

Streams Everywhere: Towards Streaming-First Architectures

The fact that data is more valuable, the more current it is, seems to be a no-brainer. However, processing data in (near) real-time is not as easy as it sounds. Traditional data architectures were mostly built to support strategic business decisions, where timeliness is not a critical factor and periodic batch scripts that extract and integrate data from various sources are perfectly sufficient. In today’s fast-moving and data-driven world, however, having information as fast as possible is essential to compete and the calls for real-time data are getting louder. This trend puts many analytics architectures into a predicament, as traditional batch-oriented approaches cannot meet business requirements anymore, and many system landscapes somehow need to incorporate streaming components. 

There are various approaches to integrate streaming data that span from isolated real-time systems to hybrid combinations of batch and streaming components to streaming-first concepts that introduce a whole new way of building data pipelines.

Figure 1. Different processing models

1. The traditional Way: Batch-First

In most traditional data architectures, batch jobs periodically extract, transform and load (ETL) data from various data sources. This data is then stored in a persistence layer (e.g. a data warehouse), which provides an integrated view for downstream analytics systems. This approach is straight-forward and ensures consistency, as all data is checked before entering the data warehouse and analytics systems can access a single data source. For most non-time-critical use cases, such a traditional batch architecture is sufficient, but as soon as there is a need for more up-to-date or even (near) real-time information, these batch-oriented concepts often hit a wall.

Here, it must be said that the need for real-time systems is not entirely new. For instance, operational analytics systems with real-time monitoring capabilities have been around for a long time. However, in traditional analytics architectures, such streaming systems mostly had very specific use cases and were often implemented in isolated environments outside the data warehouse landscape

2. The Workaround: Hybrid “Two-Speed” Architectures

With the increasing importance and value of data, calls for an integration of the real-time and batch world became louder, and various hybrid approaches emerged. One of the most popular hybrid concept is the Lambda architecture [1], which distinguishes between a batch and serving layer that stores all records and provides pre-aggregated views and a speed layer that processes incoming streaming data in real time. The pre-aggregated and streaming data are then combined at query-time to provide real-time results. 

These hybrid structures solve many issues by connecting the traditional batch world with real-time streaming concepts. However, this comes with the price of increasing complexity and a lot of redundancy, as code and systems on both – batch and speed layer – have to be developed and maintained. 

3. Shifting Mindset: Streaming-First Architectures

There is a lot of hassle with increasingly complex hybrid infrastructures, and at the same time, there has been a lot of progress in the area of streaming technologies. Cutting the redundant structures out of the equation and using streaming as the primary approach, therefore, seems natural – and the idea of the streaming-first architecture was born. 

Streaming-first architectures do not actively pull data with regular batch jobs, but rather passively subscribe (e.g. with techniques like Change Data Capture [2]) to data sources and extract changes and new records to a log (aka message queue, event hub) from where the data is then pushed to downstream analytics systems. As an evolution of the hybrid Lambda architecture, some also refer to this approach as the Kappa architecture [3].

Within this approach, data is directly processed as soon as it is available and not just when a batch job runs. Moreover, it provides a great deal of flexibility as one can build sophisticated, event-based data pipelines, easily integrate new systems and also incorporate existing components, e.g. by pushing incoming data to an existing data warehouse. 

Streams Everywhere – Towards an Asynchronous Analytics Architecture

It is important to see that the streaming-first idea is not only about integrating “fast data,” but about transforming the way analytics architectures are built. It does this by enabling asynchronous analytics data pipelines that put an end to traditional layer thinking. 

Figure 2. A stream-only analytics architecture

Figure 2 illustrates how such a streaming-first approach can look like. Basically, every component consumes and produces data streams. For this, an event hub usually provides queues where components can push their information into and other components can subscribe to in order get informed as soon as there are new records. This concept makes it easy to integrate various systems and build arbitrary data pipelines. For instance, an event-based data pipeline can immediately cleanse and transfer orders coming from an ERP system into a data warehouse, which is faster than a batch job that runs in periodically. Moreover, a real-time dashboard could hook into the queue with the cleansed order data and directly integrate and visualize this information.

Despite the flexibility and acceleration through immediate processing, such an approach also complements new technological concepts like microservices and serverless computing [4] that help to break complex monoliths into manageable parts that can be managed individually and run on-demand.

However, there are also challenges that come with this way of building data pipelines. For instance, distributed systems issues, such as exactly-once delivery, security, failure tolerance and semantic dependencies (e.g. you receive order data before the related customer information), become more pressing. However, there has been a lot of progress lately, and many streaming products become more mature and already handle many of these issues very well.

Conclusion

For certain, stream processing is not always necessary and there are many situations where traditional batch-oriented concepts are perfectly fine. However, there is a shift towards streaming-oriented analytics architectures which overcome many weaknesses regarding speed and flexibility that before only could be handled with complex hybrid constructs. Moreover, streaming-first introduces a new mindset that helps to build manageable and scalable analytics architectures that enable entirely new analytics use cases. 

You want to learn more: Stay tuned for our upcoming Report “Streaming-First Architecture - Building the Real-Time Organization”

Having the right data at the right time is essential to compete. In today’s fast-moving and data-driven world, this simple maxim is truer than ever. The upcoming report “Streaming-First Architecture - Building the Real-Time Organization” provides an introduction into what data streaming is about and why it might be relevant for your organization. It illustrates the evolution of data architectures and gives hands-on advice for implementing streaming-first architectures in real-world use cases.


Further reading

[1] Marz, Nathan: “Big Data: Principles and best practices of scalable realtime data systems”, 2015

[2] Petrie, Kevin; Potter, Dan; Ankorion, Itamar: “Streaming Change Data Capture”, 2018

[3] Kreps, Jay: “Questioning the Lambda Architecture”, 2014

[4] http://kappa-architecture.com

[5] Ereth, Julian: “Serverless Computing: The Next Step Towards Analytics in the Cloud?”, 2017

Julian Ereth

Julian Ereth is a researcher and practitioner in the field of business intelligence and data analytics.

In his role as researcher he focuses on new approaches in the area of big...

More About Julian Ereth