Planning Your Data Architecture for Streaming Analytics
Preventive maintenance, cross-selling and potential fraud are just a few of the many business events that need fast attention. Whether to capture revenue or minimize risk, we increasingly need to understand and act on data in real-time.
More and more organizations are starting to experiment with streaming analytics. This means analyzing data as it is produced and before it is stored. It yields higher business value and is more efficient than the tradition of analyzing batches of historical data stored in a data warehouse or data lake.
But you cannot just buy and turn on streaming analytics at the enterprise scale. No single vendor solution can answer all technical requirements, address all business use cases, and integrate with all existing systems.
To succeed, organizations need to design an architecture that addresses requirements for each component: data producers, change data capture and/or APIs, event brokers, stream processors, data stores and visualization tools. They need to evaluate whether commercial or open-source software best addresses their needs, and where to customize. By selecting multi-component commercial offerings, data teams can reduce the number of vendors they need to manage.
The following reference architecture illustrates key components that enable streaming analytics. (See figure 1.) Commercial solutions often bake in open source software like Kafka or Spark, but still provide support, debugging and training. Pure open source software, downloadable from Apache or other community sites, includes no commercial support. Inhouse developers also can customize components such as the stream processor with homegrown code.
Figure 1. Data Architecture for Streaming Analytics
Here is a look at key architectural components and planning considerations for each.
Data Producers. Whatever the industry, most enterprises have a plethora of data to analyze. Database and mainframe applications process sales, supply chain, CRM and other traditional enterprise records. Sensors track operational KPIs for all types of machinery and equipment. Clickstream and social media trends signal customer behavior and preferences. Some of this data, like traditional enterprise records, already exists, while IoT and clickstream/social media data might not yet be captured. But most organizations already capture far more data than they actually use. They need to integrate and analyze what they already have.
Change Data Capture. Change Data Capture (CDC) technology can be an ideal method of publishing data such as database records to the stream for eventual streaming analytics. CDC identifies and replicates real-time changes from a source. Organizations can minimize disruptions to the performance of production applications by using CDC solutions that identify changes by scanning backup logs. Such CDC solutions have a lower impact than those that query the database itself or track changes with extra “shadow” tables. Data teams also can reduce the burden on ETL developers by using automated CDC solutions that minimize manual scripting.
APIs. Extensive APIs are available from open source communities and vendors to connect to a wide range of data sources, including IoT sensor data, SaaS applications, NoSQL stores or IoT sensor feeds. Organizations can implement APIs without much labor or complexity.
Event Broker. Here we get to the core value of streaming. Apache Kafka provides a broker that asynchronously collects, buffers, and disseminates data events (a.k.a. messages). These brokers enable many data producers and consumers (such as processors) to share individual streams of data (called topics) while still operating independently. Kafka has proven more scalable, higher performance and more fault-tolerant than many commercial predecessors. This simple diagram illustrates an example of how brokers connect producers and processors via granular topic streams. (See Figure 2.)
Figure 2. Kafka Broker Architecture
(To review the advantages and workings of Kafka, see Assessing the Kafka Data Streaming Opportunity).
Kafka has prompted architecturally similar offerings from the big three Cloud Service Providers, AWS, Azure, and Google. Because each cloud offering tends to favor its own tools, Kafka creates the most open and flexible environment to address current and new requirements.
Stream Processor. Once Kafka or another broker publishes data to a topic stream, that data can be processed for analytics. A variety of decision factors come into play at this stage. First, what type of stream processor should be used? Spark Streaming, one of the most common Apache open source options, is integrated and supported in many commercial offerings.
For larger and more sophisticated streaming architectures, component modularity and interoperability become critical requirements. This is less of a challenge with data collectors and brokers: there are readily available commercial solutions and API libraries to collect data from all major sources and integrate with Kafka. When it comes to processing, however, things get tricky as soon as you start to customize code. You will need inhouse developers to build, test and integrate components, ensuring the data flow can continue and scale reliably. Organizations will want to standardize code as much as possible across lines of business and use cases.
Data Store. Often times streaming analytics practitioners will enrich their results by correlating real-time data with historical records from a data store such as NoSQL or a cloud-native data lake. Whether these repositories receive their feeds from a broker like Kafka or a batch ETL tool, they often contain data useful to streaming analytics. Analysts and data scientists can answer new questions. Is a retail store purchase unusual for that customer? Has the merchant processing a credit card transaction have a suspicious history? Typically, data teams can take advantage of storage repositories they already have. They can use tools like Elasticsearch to pull relevant historical data for correlation with real-time data.
Visualization. Visualization solutions generate real-time dashboards and charts by analyzing data with integrated algorithms or potential ML options. These graphical solutions can help your organizations democratize data and increase data literacy by delivering intuitive graphical results to business-oriented users. Where data teams have the need and inhouse expertise, they also can create custom visualizations with either open-source or commercial software. Generally, the larger, more mature organizations, or younger, Kafka-savvy ventures are most likely to have these inhouse skills.
Organizations need to weigh classic tradeoffs of simplicity and customization with streaming analytics. Here are basic guidelines for IT leaders and data teams to follow, based on the complexity of the use case, organizational maturity, and available skills.
Phase One: Start simple. Data teams should select commercial offerings whenever the functionality is sufficient to meet requirements and address basic business-driven use cases. Commercially supported offerings, including those involving open source code, will help minimize inhouse developer work, maintenance, troubleshooting, and the risk of downtime. This works for basic, often departmental use cases that span fewer types of components and users. Organizations can further simplify things and reduce work by selecting multiple offerings from a single vendor.
Phase Two: Innovate and customize. As organizations seek to address increasingly sophisticated use cases, they will need to consider specialized open source components and potentially customize with homegrown code. They also will need to try new types of commercial producers, processors and visualization products. Organizations should carefully scope the staff, skills, and training required for these more sophisticated initiatives. Some organizations, especially younger ventures with a strong cloud presence, already have Kafka-savvy developers and therefore are able to customize out of the gate. They also might be able to use the very latest open-source software without waiting on a fully supported release from a commercial vendor.
The following table summarizes example offerings, both open source and commercial, and notes. (See Table 1). Organizations choose to customize streaming processors and visualization tools more frequently than other components, leveraging open source offerings in particular.
Table 1. Example Offerings for Streaming Analytics
Attunity Replicate (Qlik), Oracle GoldenGate, Striim
Tibco APIs, Confluent Kafka Connect
Confluent Kafka, Amazon Kinesis, Azure Event Hub, Google PubSub
Tibco Streaming, Amazon Kinesis Data Analytics, Azure Stream Analytics, Confluent KSQL, Striim
Spark*, Flink*, Samza*
Amazon S3, Azure Data Lake Storage, Google Dataproc
Cassandra*, Hbase*, MongoDB
Tibco Spotfire, Microsoft PowerBI, Striim
*Apache Software Foundation project
Another critical dimension to consider is the role of machine learning (ML). ML packages, available through open source libraries or from vendors, can be used to create processors and test them on historical training data before being applied to a stream. Organizations also can use ML to have processors continuously learn and adjust themselves while analyzing data streams. ML can offer great advantages, provided organizations have sufficient in-house expertise and data quality to yield reliable results. But architects and developers should tread carefully so as not to automate bad real-time decisions.
Bringing It All Together
The most effective streaming analytics initiatives will start simple, using commercial solutions to address narrowly defined use cases. The broader-scope, higher-value projects will require careful assessment of both in-house developer capabilities and specialized open-source options. To succeed with any streaming analytics initiative, you need to plan the supporting data architecture end to end.