Whither the Data Warehouse? Reflections From Strata NYC 2017

What’s in a name? That which we call a rose by any other word would smell as sweet.

From Romeo and Juliet

Our industry won’t rest until it finds a suitable replacement for the venerable, if ailing, data warehouse.

I attended the Strata Data Conference in New York City last month and found that the sub-text of most of my briefings was a desire to reinvent the data warehouse (DW) using new modern technology.

Of course, most vendors I spoke with never said the word “data warehouse”. For vendors, that term is old-school; at worst, it’s a liability. The few who mentioned the data warehouse said their product eliminates the need for it, as well as its sidekick, extraction, transformation, and load (ETL) tools.

Truth be told, the traditional data warehouse has its problems: too slow, too costly, hard to change, and never has or will deliver a single version of truth. (See my colleague Steve Smith’s litany of DW ailments in his provocative article, “The Demise of the Data Warehouse.”) But many DW practitioners, including myself, see value in the data warehouse, although we admit it needs a significant overhaul. (See  Dave Wells’ articles,  “The Data Warehouse is Still Alive” and “The Future of the Data Warehouse” plus various other responses. This is a hot topic at Eckerson Group!)

What I discovered at Strata NYC is that most vendors today focus on overhauling the part of a data warehouse they control—the technology—but mostly ignore the most valuable part—its conceptual and logical constructs.

Re-Imagining the Data Warehouse

All this begs the question, “What is a data warehouse?”

In my humble opinion, a data warehouse is a business view of data that supports decision making. At a logical level, it’s a data model that is subject-oriented, integrated, time-variant, and non-volatile—Bill Inmon’s definition from the 1990s, which is still valid. (See figure 1.) Organizations create multiple business views from a DW data model. This lets different groups in an organization view a common data set in unique ways. This fosters reuse, avoids data silos, and eliminates duplicate effort. The mantra is model once, view in multiple ways.

Figure 1. A Data Warehouse

The vendors exhibiting at Strata don’t focus on the conceptual or logical layers of a data warehouse. However, if you dig deep, you might hear them say, “Well, yes, someone still has to create the schema” (a.k.a. data warehouse data model) or “create the business view” (a.k.a. BI semantic layer) or “create aggregations” (a.k.a. materialized views) or “move and integrate the data” (a.k.a. ETL, DW automation, replication, streaming) or “add reference data” (a.k.a. master data management).

Most vendors assume data scientists or skilled data analysts are the principal users of the data. So, they can feed these skilled data users the raw data. But most business users get lost in that morass. So, someone has to model the data so it makes sense to business users. In the past, IT did this, but now data scientists and data analysts can do it using powerful, self-service tools. But the real question is: does a data scientist or analyst think locally or globally? Do they create a model that supports just their use case or do think more broadly how this data set can support other use cases?

Use Case Driven Development. Most developers (whether IT or analysts) today follow the path of least resistance. They build only as much as they need to support a specific use case. This is reminiscent of the early DW days when we said, “Build a data warehouse one data mart at a time.” But we instructed early developers to think globally, and make sure they pulled more data than they needed and use conformed dimensions and metrics instead of creating new ones. Today, developers need to balance requirements for here and now with the need for reuse and conformance. From where I sit, the global view is in short supply.

DW Technology

The technology layer is where vendors are aggressively reinventing the data warehouse. A traditional data warehouse runs on a relational database (with data marts often running on OLAP databases). But relational and cube technology early on proved insufficient to handle the volume, variety, and velocity of data or meet modern business requirements for self-service data access, preparation, and analytics.

The vendor and open source communities have breathed fresh ideas into how to construct a data warehouse—more specifically, assemble the data, execute queries, and store the data—using a lot of new technologies, ranging from Hadoop and Spark running on scale-out architectures to streaming and real-time environments running on in-memory grids. The new technologies have given birth to new ideas and ways of building data warehouses.

  • Never move or copy the data. There are different levels of zealousness here. Hadoop vendors, such as Cloudera and Hortonworks, and data lake vendors, such as Podium and Zaloni, advocate loading the data into one giant data lake (i.e., Hadoop repository) and create virtual views (i.e., Hive tables) on top. Data virtualization vendors, such as Dremio, argue that you should never move data from its source system or application, unless performance issues require a materialized view.
  • Decouple data storage from processing. Modern data architectures store data in a low-cost, scale-out environment, such as HDFS or S3, and then run query and other processing engines in separate servers, clusters, or virtual machines to gain infinite scalability without bottlenecks. Snowflake does this for cloud data warehouses, and most Hadoop vendors do as well.
  • Go to the Cloud. The easiest way to get the value of the above two approaches is migrate your data environment to the cloud. The cloud also makes it easier to support dynamic elasticity, so you can spin up/down processes on demand.  
  • Analyze data at ingest. As data volumes increase and business velocity accelerates, organizations need to design data architectures for real-time processing. Streaming vendors, such as Striim, advocate doing integration and analytics in flight at the point of ingest. They make a good case that a streaming-first data architecture is the future. Of course, they still need a persistent data store (which Striim provides) but also modeled historical data (i.e., a data warehouse) and reference data (i.e., MDM) to do more complex analytics.
  • Catalog and Prepare. Preparing data is now a shared responsibility between IT and data users. Data lake vendors, such as Podium and Zaloni, data catalog vendors, such as Alation and Waterline, data preparation vendors, such as Paxata and Trifacta, and traditional data integration vendors, such as Informatica and Talend, are reinventing data pipelines. They’re using data catalogs and machine learning to auto-tag and categorize incoming data and make it available to data analysts and sciences to transform, clean, and analyze data for their own projects, and then promote or publish their output back into the pipeline.

Some experts think the modern data warehouse is a real-time, in-memory layer running on a scale-out architecture along with a persistent storage layer running on a scale-out architecture. While they see the near-term demise of analytic databases, such as Netezza, Greenplum, and Exadata, that power many traditional databases, this doesn’t mean the data warehouse is dead.

Beyond Analytics

Hybrid Databases. Today’s database vendors aren’t satisfied with just addressing the analytics. Most see an opportunity to become the platform for ALL data processing. They usually do this by supporting two data processing models: row-based for transactions and column-based for analytics.

Splice Machine, MemSQL, Actian, and other database vendors represent a new breed of hybrid database that can process both orders and machine learning models simultaneously with fast performance. Hadoop vendors are also moving into this realm. MapR offers MapR DB that supports all analytical and transactional workloads, while Cloudera’s hybrid database is Kudu, which does updates, inserts, and deletes on Hadoop data. (It also offers Impala for complex SQL queries and Hbase for operational lookups). Hortonworks is banking on Hive and its new ACID/Merge functionality to support mixed workloads.

Whither the Relational Data Warehouse? 

So where does this leave the traditional relational data warehouse? As relational databases scale up, support non-standard data types like JSON, and run functions written in Python, Perl, and Scala, they won’t be easy to replace. Migration costs are substantial. Cloudera even admits that it “never replaces Teradata.” Although it offloads Teradata ETL, which is a series of SQL queries, it can’t run the queries that support hundreds or thousands of Teradata reports.

Today, a relational database still beats most competitors when performing complex, multi-way joins. Given that most analytic queries are just that, a traditional data warehouse still might be the right choice. In most cases, migrating a data warehouse to the cloud can deliver most of the benefits of big data without retooling technology and skills.

However, it’s clear that many companies are subsuming data warehouses inside data lakes. Here, the traditional relational data warehouse disappears, replaced by a database view running on Hive tables and powered by Spark, Impala, Kudu, or other engines. Others are virtualizing data warehouses using tools, such as Composite or Dremio, while others are running some (and someday maybe all) of their analytics in flight using streaming.

Conclusion

No matter what technology is used to power the data warehouse, someone, somewhere still needs to think about the business users who will consume the data. That person needs to model and refine data into products suitable to different types of business users. And, I’d argue that 90% of business users need highly refined data that is clean, integrated, subject-oriented, and time-variant.

Now the question becomes, who delivers data to the 90% of business users? In other words, who creates the data warehouse? In the old days, IT did all the work, but that led to huge bottlenecks that forced most business users to build their own data sets. And vendors followed suit, giving those users more powerful tools to service their own needs.

The best organizations today make the creation of a data warehouse a joint responsibility between IT and the business. The key is whether organizations do this in a systematic, governed way with a minimal amount of duplication and maximum amount of reuse. The secret is to create federated organizational models and bi-directional development processes that promote self-service and governance simultaneously. But this a topic for another day.

It’s In the end, a data warehouse not just about the technology; it’s about people working together and communicating effectively. Vendors would be wise to remember this while pushing their technologies to customers and prospects.  


Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson