Big Data Needs Structure: Optimizing Scalability and Performance with OLAP on Hadoop
Abstract: Hadoop is a work in progress, a distributed file system trying to match and surpass the analytic capabilities of relational systems. SQL-on-Hadoop technology turns Hadoop’s batch environment into an interactive one. Now, OLAP-on-Hadoop products dimensionalize big data, optimizing both performance and scalability at scale, allowing users to query and analyze large volumes of information at the speed of thought.
In my last blog, I discussed the tradeoff between performance and scalability faced by business analytic vendors. (See “The Big Data Analytics Tradeoff”.) I said vendors either take a “direct query” approach where the tools query source data directly to achieve high scalability, or an “extract and query” approach where tools extract and load data into a local, high-performance data store to accelerate queries. I then outlined optimization strategies vendors use to minimize the downsides of each approach.
OLAP on Hadoop
This blog focuses on a subset of analytic tools that revive an old concept—online analytic processing (OLAP)—and retrofit it for big data. In the process, these products—from vendors, such as AtScale, Kyvos Insights, Platfora, and SAS (Visual Analytics)—have not only achieved higher levels of both performance and scalability than other solutions, but they’ve partially rewritten one of big data’s chief calling cards—schema on read.
These so-called OLAP-on-Hadoop products dimensionalize data and present it in a business-friendly format. With OLAP, business users view metrics by common dimensions. For example, executives might examine sales by product, region, and time. With a mouse click, they can swap metrics, add or filter dimensions, pivot axes, and drill from summary views of business performance to the raw data. In other words, OLAP makes it easy for business users to analyze data the way they view the business.
Most OLAP on Hadoop products not only model the data in advance, but materialize it.
To dimensionalize data, OLAP-on-Hadoop products require designers to model the data—parse, combine, integrate, clean, and validate—before users query it. This is certainly not schema on read. In fact, most OLAP on Hadoop products not only model the data in advance, but materialize it. They create new aggregate data structures that get loaded into memory or high-performance columnar databases. This is schema on write, which if you listen to the big data community, is no longer fashionable, but it’s certainly useful for querying big data, as we will see.
Optimizing Scalability and Performance
By modeling, calculating, and storing dimensional aggregates in advance, OLAP-on-Hadoop products achieve both scalability and performance in a big data environment. They solve the scalability problem by keeping data in Hadoop where storage is cheap, enabling them to generate huge dimensional cubes with terabytes or more of data. And they solve the performance problem by pre-aggregating data in high-speed data caches, providing speed-of-thought analysis against big data.
Many companies turn to OLAP-on-Hadoop products once they discover that other approaches to big data analytics don’t work. “We get a lot of customers that want to use Hadoop to replace a data warehouse. But they quickly discover that running Tableau against a Hadoop cluster, even with a SQL-on-Hadoop product like Cloudera Impala, doesn’t work. The tools just don’t respond when you create a dashboard pick list on 480 billion rows of data,” says Dave Mariani, CEO of AtScale.
Tableau recently published a case study of Yellow Pages, which uses AtScale to query a database in Hadoop where one table contains 57 billion rows. “When you’re trying to do analysis, you don’t want to slow people’s line of thought,” says Richard Langlois, director of enterprise data management at Yellow Pages. With AtScale, queries ran 10 to 20 times faster than using Cloudera Impala alone, says Langlois.
Dissecting OLAP-on-Hadoop Products
Much like the ROLAP versus MOLAP wars of the 1990s, there is a split in the OLAP-on-Hadoop community about the best way to dimensionalize data in Hadoop.
Physical Cubes. Most products use a MOLAP approach where they create a physically distinct dimensional data structure—or cube—inside Hadoop. Once the cube is created, performance is consistently fast. But the downside is that it can take hours to calculate the cube, and queries only run against the cube, not the raw data that formed the cube, which means users don’t get the freshest data possible.
Virtual Cubes. Other products, namely AtScale, create virtual cubes against the raw data, which is more of a ROLAP approach. This gives users access to real-time data and avoids the upfront cost of building cubes. But it potentially torpedoes query performance since it builds dimensional views on the fly. To ensure high performance, AtScale dynamically generates aggregates as users query the system, and it lets designers predefine aggregates for commonly used queries to avoid initial performance hits.
Besides architectural approaches, OLAP-on-Hadoop products differ in other ways. Some have their own visualization tools (Platfora and SAS), others don’t (AtScale) or give customers the option of using a third party tool (Kyvos). Some materialize cubes in memory (Platfora, SAS), while others store aggregates in disk-based data structures (AtScale). Some store cubes in proprietary formats (Platfora and Kyvos), while others use open source formats, such as Apache Parquet or ORC (AtScale). Some have their own metadata repositories and analytic engines (Platfora, Kyvos, and SAS) while others use open source tools, such as HCatalog, Hive, Presto, and Drill (AtScale).
Is OLAP the Future of Big Data?
OLAP is anathema to the big data crowd, redolent of old school approaches to data management and analytics. But schema on read and other big data techniques only go so far if everyone in an organization is going to benefit from big data. Sure, data scientists (if you have them) might prefer writing Java or SQL against raw data in Hadoop or Spark, and some analysts might like running SQL on Hadoop products against raw data. But most business people will prefer to access dimensional structures designed and built in advance by a data architect.