Applying the ACID Test to Apache Kafka
Trips to San Francisco, my home for six years before house prices forced a move east, are always exciting. The ocean views, great Indian food, and memories of surfing make me nostalgic for married life before children. Even more stimulating is the buzz, urgency, and hyperbole about the latest hot technology.
On this trip, Confluent’s Kafka Summit was the source of excitement. As I wrote on Eckerson.com last month, the ultra-low latency, highly scalable, distributed Apache Kafka data streaming platform has ushered in a new era of real-time data integration, processing, and analytics. With Kafka, enterprises can address new advanced analytics use cases and extract more value from more data. They are implementing Kafka and streaming alternatives, such as Amazon Kinesis and Azure Event Hub, to enable data lake streaming ingestion, complex message queues with many Big Data endpoints, microservices data-sharing, and pre-processing for machine learning.
All good stuff. But does Kafka displace data architecture pillars, such as the database?
Martin Kleppmann, Researcher at the University of Cambridge, posed what he termed a “slightly provocative question” in his session last week, “Is Kafka a Database?” Kleppmann narrowed in on one defining feature of a database, namely its ability to support ACID-compliant datasets.
ACID, as many readers know, refers to four database properties – Atomicity, Consistency, Isolation and Durability – that are generally considered to be essential to transactional data validity. ACID databases maintain data integrity through errors, power outages or other component failures. Considered another way, a valid transaction requires that all its associated DB operations are ACID compliant. ACID has been the gold standard for databases for 35 years.
During a thoughtful 25-minute session, Kleppmann logically applied the ACID test to Kafka. I recommend watching the video. Here is a summary of his thinking.
Durability
Kleppmann started here because durability seems the simplest to address. Durable data remains available and committed even when systems crash, software fails, etc. Typically, this means the data and one or more copies are written to non-volatile memory such as a disk. Kafka brokers arguably meet this requirement by persisting records, typically in multiple copies across replicated partitions, to a disk-based file system. As with other data systems, those records also can be backed up to a remote location for additional durability in the event of a disaster.
Atomicity
An atomic database transaction operates as a single unit, either succeeding or failing completely. What does this mean? A transaction can never be partially committed. If all its writes are not fully completed, then everything must be rolled back. Kafka satisfies this property, Kleppmann reasons, because producers write each record to an immutable log in an all-or-nothing fashion. Database, cache and search indices all independently consume that record from the log, without interfering with one another. As another example, a transaction that includes a debit from one account and credit to another accountc can be processed atomically. This is accomplished by using a stream processing algorithm that emits these two inter-linked events, the credit, and the debit, for coordinated processing on the Kafka platform.
Isolation
Transactions are often executed concurrently, for example, being written to and read from multiple locations at the same time. This creates the need for isolation, or ensuring that concurrently executed transactions reflect the right sequence. Kafka can achieve serialization, the highest level of isolation, by building transaction logic into a stream processor that ensures transactions do not overlap or otherwise impact one another’s sequence of records. Serializable isolation effectively means transactions can behave as if the database is all theirs, with no interfering activity.
Consistency
The final ACID test for Kafka is consistency, which means that transactions are valid according to defined constraints and other rules. For example, a rule might specify that a username must be unique. Once again Kafka relies on streaming processing algorithms to clear this hurdle, for example to check the validity of any transaction request.
The Metamorphosis of Kafka
So with a lot of help from stream-processing algorithms, Kleppmann concludes Kafka can indeed be ACID compliant. This gives new confidence in the integrity of Kafka-managed data and effectiveness of downstream usage. However, while Kafka can persist data indefinitely, very few architects or developers expect Kafka to replace databases anytime soon. Indeed, Kleppmann suggested that ad-hoc queries are best left to the database and data warehouse realm, and stopped short of definitively stating that Kafka is in fact a database.
Kleppmann’s thesis relates to a larger tendency to view Kafka as much more than a data streaming platform.
Confluent CEO Jay Kreps suggests Kafka brings event processing to its rightful position as the foundation of data architectures.
A business, in fact, is essentially a series of events and reactions to those events. Data warehouses are based on fact tables, and facts are events, which makes the data warehouse “a pretty slow event stream” in contrast to Kafka. Here Kreps is following a longstanding Silicon Valley tradition of positioning new technology as a disrupter of the established order.
There is merit to the argument. As your ACID-compliant, scalable, ultra-low latency data streaming platform, Kafka can serve as a central enterprise enabler of microservices, event monitoring/analytics and real-time applications of all types.
Indeed, enterprises are adopting Kafka to supply data streams for use cases, such as streaming data lake ingestion, message queueing, machine learning pre-processing and microservices enablement. Some data-integration products automatically publish production database transactions to Kafka record streams to address these use cases. Kafka has assumed a critical role in these organizations’ data architectures.
Takeaways for Practitioners
For the foreseeable future, data streaming practitioners – including architects, developers and their managers – do not expect Kafka to supplant the database or data warehouse anytime soon. Instead, Kafka will play a critical role as a real-time canal system, moving data between platforms and across pipelines in today’s increasingly heterogeneous environments. You can configure one database producer to send topic streams to dozens of different consumers, ranging from Spark-driven data lakes to microservices platforms to various NoSQL repositories. In most cases those consumers are best equipped to manage the analytics.
So Kafka is a real-time canal system for data, and a darned good one, but not a database. If the Kafka Summit buzz is any indication, Kafka canals will be a compelling option for many organizations to invest in for years to come.
Kevin is a contributing analyst at Eckerson Group as well as Senior Director of Product Marketing at Attunity. To learn more about the role of CDC in Kafka, data lake and cloud environments, check out the book Kevin co-authored, Streaming Change Data Capture: A Foundation for Modern Data Architectures, O’Reilly 2018.