Big Data Part II: Don't Bash the Data Warehouse

This is the second in a five-part series on the impact of Hadoop on business intelligence.

Last year, we could write off Hadoop as a giant, low cost, data processing pump for refining multi-structured data and delivering it to the data warehouse. No more. Hadoop 2, and in particular the Apache Yarn project, changes everything.

Released in October of 2013, Hadoop 2 turns the open source data management platform into a multipurpose operating system for big data. Rather than supporting just one type of data processing, Hadoop 2.0 supports any data processing application written to the YARN interface. As such, Hadoop 2 can not only support batch processing (i.e. MapReduce), but also real-time queries, search, in-memory computing and whatever else anyone dreams up and writes to Yarn.

The upshot is revolutionary: rather than move data to specialized applications and systems for processing, companies can process the data in Hadoop without moving it.

This message was trumpeted recently at an analyst day hosted by Cloudera, the first vendor to commercialize Hadoop services. In his opening remarks, Cloudera CEO Tom Reilly, said that Hadoop 2 will change how companies architect analytic systems: “Rather than move data to compute resources, companies will move compute resources to data, saving enormous amounts of time and money.”

The Data Lake

This new approach gives rise to the notion of a data lake, in which Hadoop not only stores all the data but processes it as well. Cloudera is one of the first companies to commercialize the concept of a data lake, which it calls an Enterprise Data Hub (EDH). With an annual subscription, Cloudera EDH customers can access five premium components (or data processing engines), including batch processing (MapReduce), analytic SQL (Impala), search (SOLR), machine learning (Spark), and stream processing (Spark Streaming) and operational processing (HBase) with a raft of third party applications on the way.

Converged Applications. The data lake spawns a new breed of “converged applications”, according to Reilly, that deliver enormous business value. For instance, a company can use Spark streaming to stream data from a sensor network into an in-memory database (Spark) where it is analyzed and turned into a model that gets embedded in a high-volume Web application (HBase). All the while, the data never leaves Hadoop, which greatly simplifies data processing and reduces costs.

“EDH enables customers to build new types of applications that weren’t feasible or cost-effective before,” says Reilly.

Although many claim that Hadoop is not yet ready to support enterprise-caliber, production applications, Cloudera says demand for EDH is high. In fact, it sold eight subscriptions within six weeks at the end of the first quarter in which EDH was commercially available. Clearly, some leading-edge companies are jumping headfirst onto the Hadoop bandwagon, clearing the trail for everyone else.

Evolving into Hadoop

However, most companies are adopting Hadoop gradually, says Amr Awadallah, co-founder and CTO of Cloudera. Their initial motivation in adopting Hadop is to improve operational efficiency. Either they want to reduce the cost of storing large volumes of data, accelerate ETL processes which are being squeezed by shrinking batch windows, or optimize a data warehouse by offloading ETL workloads or moving unused data to archival storage.

After organizations squeeze the cost-efficiencies from their data architectures, Awadallah says they implement Hadoop strategically to deliver greater business value. At first, they use Hadoop to give business analysts, data scientists, and lines of business quicker access to data so they can solve pressing business problems. Rather than wait for the IT department to move data from Hadoop into the data warehouse or other downstream systems, business users query data directly in Hadoop using SQL-like data access and analytics tools.

Once business users are comfortable accessing data directly in Hadoop, organizations typically consolidate Hadoop clusters into a data lake and implement Yarn-compliant engines so they can build converged applications, as described above, that deliver outsized competitive advantage.

Stages of grief. David McJannet, vice president of marketing at Hortonworks, Cloudera’s closest rival, reinforces Awadallah’s depiction of the Hadoop journey. He says most companies go through several “stages of grief” from denial to acceptance when confronted with the fact that Hadoop storage is 30 to 50 times cheaper than traditional systems.

But rather than take a bold leap into the unknown with a startup company, McJannet says Hortonworks customers usually recruit a trusted partner from the commercial world to help them navigate the new terrain and blend the new world with the old. This evolutionary approach to implementing Hadoop is the centerpiece of Hortonworks’ strategy. (See “The Battle for the Future of Hadoop”).

McJannet also says that Hortonworks customers typically implement Hadoop to support new applications with multi-structured data, not to achieve operational efficiencies. “About 70% of our deals are for net new applications and 30% focus on data warehousing optimization,” he says.

Summary

Whatever the starting or ending point, it’s clear that Hadoop is shaking up the data management and analytics marketplace. In fact, during my rounds of Hadoop vendors last week, including Cloudera, Hortonworks, and MapR, all said they have experienced a rapid uptake in the number of inquiries and deals in the past six to nine months.

If these claims are true, then customers are quickly moving beyond the “tire-kicking” stage and into production with Hadoop. If so, 2014 could be the year in which Hadoop went mainstream. And this is even before most customers have implemented Hadoop 2.0.


Read the next article in this five-part series: "Hadoop 2.0 Changes Everything"

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson