Assessing the Kafka Data Streaming Opportunity
Many enterprises are adopting Apache Kafka to capture and process higher volumes of data faster and thereby realize more value from business events. At last week’s Strata conference in New York, Kafka users and enthusiasts were out in force. While “Hadoop” is out of fashion, appearing zero times on the agenda, Kafka is still the belle of the ball, featuring prominently in sessions and booth promotions.
Numerous conversations I had with data engineers at Strata reinforced that, despite some asserting that Kafka can become a central, persistent data store, this open source streaming system really serves more of a supporting role to other platforms. It enables multi-zone data pipelines by streaming data between production sources, data lakes, NoSQL stores, and various processing engines. The data manager with a Fortune 500 healthcare diagnostics firm described to me how Kafka streams many patient and audit records from many sources to many users at his firm. Like most practitioners I spoke with, he is happy with Kafka.
Here is a look at why Kafka matters, how it works, and where it should be used. The principles generally apply to vendor-driven variants such as Amazon Kinesis and Azure Event Hub, although those tend to be less scalable and focused on more narrow sets of use cases. Kafka is by far more widely adopted.
Why Kafka Matters
First, let’s consider the pain Kafka addresses. IT is under pressure to address several critical requirements in order to serve business clients:
- Real-time event response: Business events increasingly demand immediate attention.
- Data distribution: Many users need to address many use cases on many platforms.
- Asynchronous communication: We need to capture data as it is created, but enable applications to consume it at their own pace.
- Parallel consumption: Multiple parties often need copies of the same data for different uses.
- Service modularity: A key principle of microservices architectures is that one service does not depend on another, for data sharing or anything else.
With these pain points in mind, we can easily see the value of a highly distributed, highly scalable streaming platform that asynchronously collects, buffers, and disseminates data records (a.k.a. messages). The producers and consumers of these records can be independent, working with an intermediate broker rather than one another. And multiple consumers can each process their own record streams as they need it.
This is the opportunity that Jay Kreps and his colleagues Neha Narkhede and Jun Rao at LinkedIn sought to address when they created and open-sourced Apache Kafka earlier this decade. (They went on to found Confluent in 2014 to provide software and services for Kafka platforms)
How It Works
Like most great technologies, the messaging concept was hardly new. IBM developed message switching and routing systems for the financial services industry in the 1960s. Its MQSeries (now IBM MQ), first released in 1992, provides heterogeneous messaging for service-oriented architectures. More recent message brokers include Apache ActiveMQ and RabbitMQ, leveraging the Java Message Service (JMS) API and Advanced Message Queueing Protocol (AMQP) standard.
But these systems lacked the necessary scale, performance, and fault-tolerance for modern environments such as LinkedIn, whose platform shares billions of updates between millions of users every day. In 2010, Jay and his team specifically sought to monitor user activities, resource consumption and performance, and correlate those metrics. And the Kafka system they developed and open sourced, which surpassed 1 PB consumption/day at LinkedIn by 2015, had much broader applications.
In a nutshell, Kafka “producers” publish data records to brokers that persist those records to file systems on disk until they are read – either real-time or later on – by “consumers.” Records are divided into topics (a.k.a. streams) to which consumers can subscribe for selected use. Topics also can be partitioned to improve throughput via parallel consumption and enable redundancy via partition replicas across multiple clustered brokers. Here is a sample architecture for one topic. While this shows multiple examples of producers, one topic often maps to a single producer.
Kafka Architecture
So we have a highly scalable cluster that can serve many applications across large enterprise environments. Brokers store records for configurable time periods, including forever.
Key Components of Kafka
|
Where To Use It
Kafka addresses a wide range of use cases. Here is a summary of the most common applications, listed in descending order of frequency and suitability. A given Kafka implementation often addresses more than one use case.
Streaming ingestion: Kafka has become a very popular and effective method of ingesting data in real-time from databases, social media systems, IoT, and other data sources to consumers such as data lakes. Kafka’s scalability and high performance make it an ideal ingestion platform.
Record broker: Data flows can quickly become very complicated in today’s enterprise environments, with many consumers contending for overlapping subsets of data from the same set of sources. Kafka is an ideal “traffic cop” (a term I took from Claudia Imhoff), as topics enable granular and high scale sorting of records among all these endpoints.
Machine learning: Kafka helps transform data (cleansing, formatting, sampling, etc.) for machine learning by feeding processing engines such as Spark Streaming, KSQL, and Apex. Here again, the scalability and high performance are critical advantages, as is Kafka’s ability to guarantee each record is processed exactly once.
Microservices enablement: Microservices must be isolated and autonomous, each serving a single purpose in modular and often large environments. (Picture the dozens of individual services that a commercial bank makes available to its online customers.) Kafka brokers enable all those services to share data on demand, real-time when necessary, while remaining completely independent of one another.
Event stream processing: Events such as online transactions, stock trades, website traffic spikes, and mechanical equipment signals often require real-time scoring and alerting. For example, each transaction can be rated for its fraud risk, and each IoT sensor KPI can feed into preventive maintenance programs. Kafka assists these applications by integrating data points at near zero latency and very high scale.
Data Integration: Kafka can work with KSQL, Spark, and other stream processors to accelerate Extract, Transform, and Load (ETL) processes. The advantage here is that you can perform ETL and integrate application data on the same system rather than separately.
Video/image processing: Kafka also can buffer and persist video or image streams, enabling applications to consume those streams based on CPU and memory availability to avoid data loss.
Persistent data storage: Because Kafka can persist records indefinitely, there have been suggestions that it can become a new data store in its own right. Today, this seems a stretch because an unbounded record stream presents new challenges in terms of governance and security.
Perhaps the best endorsement of the Kafka concept is its primary open source imitator, Apache Pulsar, whose creators have launched Streamlio. Streamlio has packaged Pulsar, Apache Heron for analytics, and Apache BookKeeper for storage into a solution it positions as higher throughput, lower latency (see recent GigaOm and Streamlio tests), easier to manage, and more fault-tolerant than Kafka. For example, BookKeeper reduces data loss risk by writing record data to multiple nodes, on both storage and memory, so data can be recovered in the event of power, hardware, or partition failures.
For now, Kafka has the time to market advantage and interoperates with a wide ecosystem of APIs and processing engines. It merits close consideration as the #1 enterprise data “traffic cop.”