Big Data Part V: Hadoop: The Next Spreadmart?

This is the fifth in a multi-part series on the impact of Hadoop on business intelligence.

For decades, data analysts of all types have used low-cost, self-service analytical tools to access and manipulate data, identify trends and anomalies, and present insights. Although the types of tools have changed over the years, the results are almost always the same: a spreadmart or data shadow system built with unique rules, metrics, and definitions.

Most companies have tens of thousands of spreadmarts, each answering important local questions at a point in time. While invaluable to individual business units, spreadmarts drive CEOs and CFOs crazy. They ask a simple question, such as, “How many customers do we have?” and they get conflicting answers from spreadmart-touting line analysts and their business unit heads. Like Helen of Troy whose face “launched a thousand ships”, the spreadmart phenomenon has caused thousands of executives to launch data warehousing initiatives to restore data consistency and enterprise order.

Spreadmart Tools

Spreadmart tools vary by type of analyst. For example, data analysts in finance, marketing, and sales rely primarily on Microsoft Excel and Access, earning them the nickname “Excel jockeys.” But recently, many have shifted to visual discovery tools from vendors such as Tableau and Tibco that replace pivot tables with interactive visualizations.

Another type of analyst—statisticians and data miners—have long opted for SAS and SPSS, which are more complex and pricier than desktop analyst tools, but are still better than the alternative: waiting for the information technology (IT) team to create a suitable data environment for data mining.

Likewise, business units, frustrated by the slow pace of corporate IT development and its perpetual backlog, hire their own technologists or local system integrators to build independent data marts using low cost self-service BI and data management tools, such as Qlikview and Microsoft SQL Server.

Hadoop

Today, there is a new spreadmart on the block and it’s called Hadoop. The software is free, the hardware cheap, and analysts don’t have to know SQL or data modeling to use it. Analysts can dump the data into Hadoop and then use a high-level language, such as Hive or Pig, or a Hadoop-compliant BI or ETL tool to access, manipulate and analyze the data. And although there are many reasons to implement Hadoop, a primary one is to foster self-service data analysis without IT intervention. As a result, Hadoop is fast becoming the spreadmart of choice for sophisticated analysts and department heads.

Until now, there’s been minimal talk about how to ensure data governance inside Hadoop environments. The terms data quality, data consistency, conformed dimensions, metadata management have yet to enter the Hadoop lexicon. That’s partially because Hadoop is so new and most companies are evaluating its ability to support production applications. It’s also because its primary users—business analysts—have never been overly concerned with enterprise data governance and consistency and don’t require high levels of data quality to generate estimates and analyze trends.

So if Hadoop is a self-service free-for-all where analysts and business units can dump and access data willy nilly without governance, what’s to keep Hadoop’s proverbial data lake from becoming a bunch of data puddles? In other words, will Hadoop further proliferate spreadmarts or help consolidate them?

The answer to that question is “both”.

Spreadmart on Steroids

First, many companies are using Hadoop to create a data lake—a low-cost repository to hold all data in an organization. As such, a data lake becomes a giant analytic sandbox that provides one-stop shopping for every analyst and business unit in an organization. Rather than hunt for data in multiple applications and systems, analysts can get everything they need by tapping into the data lake, saving time and money. Thus, the data lake makes it easier to create spreadmarts.

But, rather than proliferating tens of thousands of wholly ungoverned spreadmarts on various file servers across a company and beyond, Hadoop consolidates all analyst work in a single place on a single system. Of course, analysts can still download data from Hadoop and squirrel it away on spreadsheets, but by centralizing the data from which analysts draw, organizations can ameliorate the more deleterious side effects of spreadmarts. Moreover, consolidating analyst activity on a single platform offers greater economies of scale and sizable cost savings.

More importantly, the data lake gives IT and business managers visibility into what analysts are doing. One way to think about spreadmarts is to view them as instantiations of business requirements. Hidden spreadmarts make it difficult for IT managers to see what’s important to the business and support those requirements in future versions of the data warehouse and enterprise reports. By centralizing analytical activity in a data lake, Hadoop exposes analyst activity and makes it easier for IT departments to partner with business users and proactively meet their needs.

Enterprise Analytical Ecosystem

However, Hadoop is much more than a collection of spreadmarts. It’s a scalable, flexible data processing platform that can meet most enterprise analytical requirements. It’s like the Swiss Army knife of data processing: it’s a generic tool that can do almost anything, although nothing optimally, at least at the moment.

As a scalable, data processing system, Hadoop can store all enterprise data, not just a subset like a spreadmart or data warehouse. And with the advent of YARN (Yet Another Resource Negotiator) released last fall, Hadoop is now a highly flexible data processing system. (See my prior column “Hadoop 2.0 Changes Everything”). With YARN, Hadoop can support any type of data and analytical processing engine written to the YARN interface, ranging from real-time SQL queries and graphing engines to in-memory computing and streaming analytic engines. Although Hadoop 2.0 needs time to mature, the future is clear: companies can store all their data in Hadoop and process it there, too.

This is revolutionary. Astute IT and data warehousing managers quickly recognize the implications. With Hadoop 2.0, their future analytical architecture revolves around Hadoop, not a relational database. By extension, their existing analytical systems become specialty databases that eventually disappear as Hadoop matures and subsumes their functionality.

“If we put all our data into Hadoop, we no longer have to move the data to process it,” think savvy IT managers. “This simplifies our environment, while speeding speeds delivery times and saving money.”

The Future

At least that’s the vision. Lots of development and experimentation needs to happen before organizations transform their current analytical ecosystems into a data lake fueled by Hadoop 2.0.  Moreover, existing analytical systems have a long shelf life— even after their costs have been fully depreciated. Plus, embedded skill sets and corporate inertia make it difficult for companies to jettison existing data platforms. Moreover, Hadoop may never live up to its promise or another technology may take its place as the analytical heir apparent.  

But things happen fast in the Hadoop world. Today, Hadoop is quickly becoming the defacto enterprise data repository and preferred spreadmart platform (or analytical sandbox).  Soon, it could be the predominant platform for building analytical applications and the centerpiece of most analytical ecosystems.

A version of this article was published August 29, 2014 on searchbusinessanalytics.techtarget.com.

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson