The Big Data Analytics Tradeoff
Big data poses a thorny problem for traditional reporting and analysis tools. If the data volumes are truly large—say in the terabyte or petabyte range—queries can’t resolve within seconds, as users increasingly expect.
Therefore, business intelligence (BI) vendors have to make a tough choice when querying big data. Either they query all the data and suffer slow queries, or they query a subset of data downloaded into a local data store that delivers fast performance. In other words, they must choose either scalability or speed. (See figure 1.)
This tradeoff is not a new phenomenon that is unique to big data. The BI market has pivoted on this architectural hinge for decades. Remember the OLAP wars in the 1990s that pitted relational OLAP (ROLAP) vendors, such as MicroStrategy with its multi-pass SQL, against MOLAP vendors, such as Arbor Software (now Oracle) with its Essbase multi-dimensional database? Big data merely magnifies the scalability and performance issues of this perennial debate.
Figure 1. Big Data Query Tradeoffs
Direct Query Architecture
On one side of the debate are vendors that favor a direct query approach. These include Zoomdata, Datameer, and most traditional BI vendors that use a ROLAP architecture, including MicroStrategy, SAS, SAP BusinessObjects, and IBM Cognos Analytics.
Benefits. The primary benefit of a direct query approach is that it queries all the data. Business users get the freshest data possible and don’t have to settle for a pre-selected view of data. They get raw, real-time data unmediated by a data architect or data engineer.
Drawbacks. On the downside, query performance slows down as the size and complexity of queries increase. And there is little incentive to apply complex transformations to the data since this dramatically slows query performance. Consequently, direct query tools work best against a single source of big data that is relatively clean.
Extract and Query Architecture
At the other end of the spectrum are vendors that extract subsets of data and load them into high-speed, local data stores. If necessary, data architects can model and transform the data to ease data access and speed query performance. Vendors such as Platfora, Tableau, SiSense, GoodData, Qlik, and Birst, all extract and load data into a local database that end users query.
Benefits. The primary benefit of the extract and query architecture is speed. Moving a subset of data into a high-speed database guarantees faster, more consistent queries. Plus, data architects can model, clean, combine, and aggregate the data to speed queries and simplify data access.
Drawbacks. The primary downside of this approach is that users query a snapshot of the data rather than actual data. First, a data architect has to create and curate the data replica, which costs time and money and creates potential synchronization issues. Second, users don’t get real-time data. Depending on the snapshot interval, the data may be minutes, hours, or days old.
Hybrid Approaches
Truth be told, most vendors pursue a hybrid approach. Most start at one end of the scalability-versus-speed spectrum and modify their query architectures to minimize the inherent downsides.
Optimizing Direct Query. Direct query vendors use a number of strategies to optimize performance when running queries against large, complex data sets.
For instance, the most popular approach is to leverage in-memory engines, such as Spark, to process queries at high speed and cache the results, further accelerating queries that request the same data. The in-memory cache is an interim data store that is transparent to users.
In addition, some vendors use native database APIs rather than generic ODBC/JDBC connectors to accelerate query performance and exploit native features of the source application or database. Others create virtual or materialized views in the source databases to accelerate commonly used queries.
Zoomdata has one of the most innovative solutions offered by a direct query vendor. Its patented “data sharpening” feature works like video streaming technology. It visually projects query results in real-time and “sharpens” or dynamically updates the results until the query completes. Meanwhile, users can view and interact with the visual representation of the data as the query resolves, providing real-time access and analysis against petabytes of data.
Optimizing Extract and Query. Vendors that use an extract and query approach use a variety of strategies to increase the scalability of their systems.
The simplest, if not most expensive, is to throw more hardware and memory at the problem. Some also link or daisy chain databases to give users a complete view of all data, rather than a subset of it. Some also only populate the local database with aggregate data and provide a drill through mechanism that enables users to query atomic data stored in the source system. In addition, to improve the freshness of local data stores, some use continuous or trickle loads to update local data in micro batches every few seconds or minutes.
Tableau straddles the fence between the two approaches described above. It gives customers the option to query data directly or download data into a persistent local data store that users can refresh on demand or at pre-defined intervals. However, for big data, most Tableau customers download a subset of data into the tool’s local data store and query it from there.
Conclusion
Before wading into the waters of big data, organizations should understand the query architecture of their preferred reporting and analysis vendor. Does it use a direct query or extract and query architecture? If the former, what strategies does it use to optimize performance? If the latter, what techniques does it employ to increase scalability?
Understanding the architectural nuances of business intelligence tools is critical to selecting a product that best meets the information requirements of your target users. In some cases, you may need two tools to cover the spectrum of users in your organization.