May Data Warehouses Live In Peace: A Conversation Between Analysts

May Data Warehouses Live In Peace: A Conversation Between Analysts

Last week, Steve Smith, practice director of data science at Eckerson Group, Dewayne Washington, CIO at The Business of Intelligence, and Praveen Kankariya, CEO of Impetus Technologies, participated in a webinar hosted by Eric Kavanagh, CEO of The Bloor Group, titled 'Is The Traditional Data Warehouse Dead?' The webinar was a joint effort between The Bloor Group and Eckerson Group.

During the webinar, Smith argues that the traditional data warehouse is dead because it failed to meet its initial promise: giving users speedy access to a single version of the truth. Smith concludes that a data lake and meta data management combo could deliver on the original promise of the data warehouse.

Washington argues the data warehouse is alive and well. However, it takes considerable time and effort for it to work, which makes many give up too quickly and look for workarounds (that eventually fail). And Washington counters Smith, saying sole reliance on data lakes is dangerous because data in data lakes is still in transit and not clean.

Kavanagh offers a compromise, saying the traditional data warehouse and data lake environments compliment each other: data lakes are for broader analytics and exploration and data warehouses are for deterministic numbers used in reports and for informing board members.

Kankariya concludes by saying, "The destination is more important than the means." Each organization needs to evaluate itself and determine the best way to get as close as possible to a single version of the truth.

The webinar inspired Ted Hills, a data management consultant at Concepts and Objects LLC, to chime in via email, which resulted in an enlightening email chain, published below. Maybe surprisingly, both sides found common ground, enabling them to take the discussion farther than it had gone before. 

From: Ted Hills

Sent: Wednesday, March 21, 2018 5:47 PM

Hi Guys,

Great Webinar with the dueling duo, with a vendor on the side, and Eric as referee! And thanks for the shout-out, DeWayne.

I am motivated by the discussion to take a few minutes to share my thoughts. 

Steve, I agree with DeWayne that your last slide was the perfect starting point for DeWayne and that it undercuts your message that the traditional DW is dead. In reality, the DW is still alive and well; it is just sharing space with data lakes, MDM, and other data facilities. 

DeWayne, I think your marriage analogy is spot on. The world is littered with bad marriages, which has led some to give up on the idea of marriage altogether, but there are many of us who enjoy good marriages and realize full well that marriage is not the problem, and that in fact there’s nothing like a good marriage to deliver real value over the long term. It’s the same with data warehouses: done badly, they can be enough to make you give up on the concept altogether, but those who have been fortunate to see them done well understand the value that they deliver and keep on delivering.

Data Lakes Are Great, However...

As for data lakes, I think that many of the motivations for them are spot on:

  • Ease of access to original data: By copying data in its raw form from systems of record where it’s often trapped, to a central place where everyone can play with it, a lot of fast progress can be made in understanding and harnessing data. And this can be done without a big initial investment in modeling, cleansing, ETL, etc. That’s why ELT works better here.
  • Adaptability: Since the data lake just holds a copy, it’s cheap to enlarge it and it’s OK to throw useless parts away. That means it can respond quickly to changing needs. 

However, a data lake does not usually hold conformed data: data that has been validated to use official corporate master / reference data, so that data from disparate systems can be meaningfully combined. In my mind, this is the chief job of the warehouse: to bring data together around conformed dimensions.

Even a data warehouse can be built in an agile fashion, with a focus on delivering the biggest business value items as early as possible and building out value incrementally. 

So I think the DW naysayers need to understand the bigger picture; that a DW coupled with a data lake and strong MDM is the best recipe for success.

What Needs To Change

So far, I think I might just be channeling my inner Dewayne. What I really wanted to get around to is what I think should be done differently.

Upstream data quality. I think that the traditional DW approach has been to try to improve the quality of data after it’s arrived at the DW. Instead—and I think someone said this—the quality problem should be pushed upstream all the way to the systems of record. Systems, processes, and thinking should be reengineered so that the detection of bad data arriving at the DW is considered a major data system failure, and responded to accordingly.

Commitment to single version of the truth. There needs to be a strong MDM program that’s focused on making it easy to get the official, high-quality master / reference data, and that drives out any unofficial copies or bad sources. This needs to be done with a tie to business value, or it becomes a playground for control freaks. But the DW, the data lake, and everybody else needs to have a commitment to single version of master / reference data truth. The DW, data lake, and other systems should be able to reconfigure data analysis in many different ways, however they shouldn’t be allowed their own private view on the master data / conformed dimensions that make all the data joinable.

I think the traditional DW set of slowly changing dimension types never really worked. I favor dimensions built on the bitemporal model, which covers every possible need for historical data, and supports strong audits too. 

Keep historical data in master data sources. But while we’re on the topic of historical data, it’s critical that master data sources keep the history of their own data. That’s another problem that should be pushed upstream. In the bad old days, source systems couldn’t handle the extra storage and consequent overhead of keeping history. These days, that’s not a problem, and master data sources should be the single source of truth for their data both past and present. It probably still makes sense for the DW to keep the history of transactional data, but in some cases even that responsibility should move upstream. 

The DW’s core mission remains intact, which is to support unique combinations of data for reporting and analysis purposes. The data lake’s role is as the sandbox to figure out what reporting should be institutionalized. Combining the data lake with a DW landing area is, I think, quite feasible.

Build better models in a better way. What about data modeling? After all, I think we should @ModelItAll (Ted Hill's Twitter handle). I think that models should be built in an agile way, just like warehouses and everything else—but models should be built! Even though I think everything should be modeled, I don’t think it should be modeled all at once, and certainly not in a waterfall way, where the whole company is put on hold until the data modelers figure everything out. I also think that it should be possible to have a single contextual (conceptual) model that covers the concepts and objects that are common across the enterprise. Such a model should include semantics—the meaning of the data. Only the COMN notation can document semantics and data in a single graphic in a way that’s applicable to SQL, NoSQL, knowledge graph, property graph, and software production.

With that, I’ll end here. I enjoyed the interchange. Now I’ll retreat to my data lakehouse and enjoy the view.  

                                                                                                                                             

From: Stephen Smith

Sent: Thursday, March 22, 2018 6:58 AM

Thanks Ted - Great minds are in agreement.  The only thing I’d quibble with is whether the "traditional" data warehouse should collect all the credit for data cleansing, MDM, governance etc. It is a bit like the stone soup fable where you keep throwing all the good stuff into the data warehouse pot just because it is the incumbent. 

I like your idea of making the data source responsible for keeping history. 

In some ways the negotiation and enforcement of SLAs between DL/EDW and data sources both internal and external might go a long way to solving this problem. GDPR might in fact make this more possible (force this) within the organization. Big philosophical change is that accountability needs to be pushed upstream and internal data sources must be viewed as enterprise-wide corporate assets not just the by-product of some business unit that gets dumped into the data tributary flowing into the data lake …

                                                                                                                                              

From: Dewayne Washington

Sent: Thursday, March 22, 2018 10:03 AM

To put it to an analogy, if all water flows downhill and let’s say it flows from the mountain to the lake: the snow on the mountain is the source, the stream is the data lake and the lake is the EDW (stay with me on this one). 

If your lake (EDW) is contaminated, it may be from the source, and when you convince the company on the mountain to stop polluting the air, it, by default, will clean up your stream and lake. 

So fixing things as close to the source as possible is key, not only for data quality but MDM functions as well, to make conforming the data later easier. 

                                                                                                                                             

From: Ted Hills

Sent: Thursday, March 22, 2018 11:12 AM

It sounds like we’re all in agreement about where the focus should be on data quality—at the source. 

So, Steve, I’m not sure I get your quibble, since I think we agree that data quality is an upstream responsibility. I do think that the DW is a convenient place to check whether incoming data is of high quality, but as I said, in a properly functioning data ecosystem the data warehouse should NEVER detect bad data coming in. If it does, that means that an upstream data quality processes have failed. So, the data warehouse gets no credit for being part of the data quality ecosystem, other than as a final check on the upstream data quality processes. 

                                                                                                                                             

From: Stephen Smith

Sent: Friday, March 23, 2018 8:20 AM

Hi Ted – Yes - I think that is a big idea that most companies need to change their thinking about – not battle over whether they call it a data warehouse or a data lake – but rather that data quality accountability be pushed upstream via data governance and SLAs that enforce it. Agreed that no data should enter the DL or DW without undergoing some data quality testing – basically treat internal data just like getting data from an external data provider. 

If you ascribe to that thinking then one of the biggest arguments against a data lake (i.e. that it might have corrupt data) is taken off the table and the data warehouse is mostly about pre-compiling and moving the data. 

Do you think it's part of a data lake's job to perform data quality assurance before it is accepted into the lake?

                                                                                                                                              

From: Ted Hills

Sent: Friday, March 23, 2018 8:46 AM

Hi Steve, 

It sounds like we’re more aligned than it appeared at first. 

To your last point first: I think that it is essential that a data lake NOT perform any checks on the data coming in to it. As I see it, the real value of a data lake is to make source data more accessible. It’s a physical thing. A data lake makes it easy for analysts to obtain data. It’s superior to having each analyst create a point-to-point data provisioning system for each data set they want to analyze. 

To achieve its mission, a data lake should reflect source data in all its glory (or messiness), exactly as it is in the source system. That’s because the data lake makes no assumptions about how data is to be used. Analysts can go crazy doing whatever they want with the data, so there’s no value in re-organizing the data for unknown multiple purposes.

The reality is—and I know that Dewayne has made this point in the past—once analysts have figured out what analyses they want to run repeatedly, then the data needs to be organized to make those analyses efficient and repeatable. Many analyses that have little to do with each other will want to use the same master/reference data, organized in dimensions, and—voila!—all of a sudden you get something that looks like a data warehouse.

I think that traditional data warehouse landing areas look a lot like data lakes, but they were built to accommodate a single consumer—the data warehouse. I do think there’s an opportunity to replace data warehouse landing areas with data lakes, and take some overhead out of the ecosystem.

                                                                                                                                             

From: Stephen Smith

Sent: Friday, March 23, 2018 2:27 PM

I’d still say that good practice would be to have some form of SLA in place for every data source coming into the enterprise even if it just says: “No guarantees to data quality” – at least that would be a start. Do you know of any companies that use SLAs with their internal data sources? 

To your point about looking at data usage and then optimizing retrieval for that data / queries.  Why are there no compilers that do a better job of this? This would effectively be an adaptive “data model compiler” – similar to what Akamai does for internet queries – builds fast caches locally to support lots of similar requests.

                                                                                                                                             

From: Ted Hills

Sent: Friday, March 23, 2018 12:40 PM

Steve, great question on the need for an adaptive data model compiler. Vendors haven’t seen the need yet in the proper light. What you describe is similar to data virtualization, but unfortunately data virtualization’s performance was not stellar and it has set us back from pursuing this goal. Additionally, each NoSQL vendor is so bullish about their one approach to physical data organization being superior to all others that they’re reluctant to pursue a logical data layer that can work across multiple physical data models.

I have a vision for the future that involves a data language that can describe data purely logically, but then can also map that logical data description to any physical data organization—tabular (traditional, key/value, columnar, or wide-column), document, knowledge graph, or property graph. In fact, those are the principles underlying the Concept and Object Modeling Notation (COMN), which is sort of a down payment on the language. Given such a language, DBMS vendors would stop competing on who has the “best” physical model—since that depends on the use case—and instead compete on who can most effectively translate one logical design to the most efficient physical organizations, complete with caching. That would be the “data model compiler” you’re envisioning. So, I’m totally on board with your suggestion.

                                                                                                                                              

Keep The Conversation Going And Ask Questions Below!