Streaming, Spark, and Governance Top Themes at Strata+Hadoop World NYC

The east coast confab for big data—otherwise known as Strata+Hadoop World in New York City—was abuzz with the digital literati who were treated to hundreds of sessions and exhibitors. Out of the noise and excitement, three major trends emerged: streaming, Spark, and big data governance

Streaming

Big data is quickly morphing into the internet of things (IoT). Organizations are busy attaching sensors to everything. To avoid drowning in a tsunami of machine-generated data, companies are turning to streaming and edge analytics to process and analyze data in real-time.

At the event, it was clear that the so-called Lambda Architecture, which seeks to blend streaming/real-time and batch data processing into a single query and analytics environment, is gaining momentum. The landscape of open source streaming data platforms (including Storm, Spark Streaming and Flink) as well as the message processing platform beneath it (including Kafka, RabbitMQ and a host of proprietary on-premises and cloud-based solutions) is crowding the market. Perhaps by next year, there will be an industry shakeout and we’ll have a clearer picture how to manage IoT data.

Spark

The big data industry has moved rapidly to meet enterprise data processing requirements. Five years ago, everyone was satisfied with batch processing using MapReduce. But soon, the big data airwaves became filled with interactive SQL technologies, led by Cloudera’s Impala. Last week at the Javits Center, all eyes were on Spark, the in-memory open source computing environment that promises to make everyone’s big data dreams come true.

The big question with Spark is whether it will run predominantly on Hadoop or in a standalone mode. Cloudera sees Spark and Hadoop as inseparable partners, and it's hard at work in making that partnership stronger, technologically. Databricks, the company whose founders created Spark, sees the technology's independent identity as the more important one. In fact, Databricks last week released results of a survey it conducted that the company says show the number of standalone Spark clusters has exceeded the number of Hadoop-based clusters running Spark.

Data Governance

Finally, this year, we’re beginning to see the words “big data” and “governance” used in the same sentence. Data lineage, fine-grained security, data quality management, metadata management, and the ability to audit administration of these features as well as general query and data manipulation activity, have become a priority for vendors, and for buyers.

The emphasis on enterprise governance has fueled numerous startups. Podium and Zaloni are peddling good, old-fashioned, end-to-end data management tools that govern the flow of data from source to target in a big data environment. These new so-called data pipelining tools ingest, validate, clean, transform, merge, secure, and format data for analytical purposes. Most support both batch, near real-time (i.e., micro-batches), and real-time (i.e., streaming) updates as well as change data capture (CDC) to load just deltas and minimize data volumes. Most importantly, they both collect metadata every step of the way, so both technical and business users can track data lineage, understand how and when data was transformed, and evaluate the impact of any changes on downstream applications.

Given the pace of innovation in the big data industry, it’s hard to imagine what we will be talking about twelve months from now. But it’s safe to say that there will be a bewildering array of new technologies and products to see and evaluate.

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson