Aug 14, 2016 / by Julian Ereth Cutting-Edge Analytics

Data Vault Modeling Increase Data Warehouse Agility

What is Data Vault?

When talking about data modeling for data warehousing, most organizations implement either a dimensional (Ralph Kimball [1]) or normalized (Bill Inmon [2]) modeling techniques. Both approaches have been around for more than 20 years and have proven their practical use over time. In the last several years, however, market forces have made it imperative for business intelligence and analytics processes to become more agile. This trend comes with many challenges. One major issue is that dimensional and normalized data models are not built for rapid change. For example, a BI manager I talked to recently told me that his company can take a few months to add a new field in the data warehouse due to governance processes and extensive testing.

These types of problems have spurred interest in the Data Vault approach. Dan Linstedt, the inventor of Data Vault, describes his method like this:

“… It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of enterprise data warehouses.” [3]

The underlying idea of Data Vault is to separate the more stable part of entities (business keys and relations) from their descriptive attributes. For this purpose it introduces the concepts of hubs, links and satellites

A hub contains the actual business key (one or more fields that uniquely identify an entity for the business, e.g. a customer number) and a surrogate key that is used to connect this table with other structures (equivalent to a primary key). Moreover, it can also contain metadata like timestamps or information about the data source. Links then connect hubs with a simple many-to-many table consisting of the respective surrogate keys.

By this means, hubs and links represent the more stable part of a model and are enriched with so-called satellites. A satellite is connected to a hub with its surrogate key and contains one or more descriptive attributes that are usually grouped by a source system, business context, or change rate. Additionally, a satellite table can also comprise different types of metadata, such as valid date periods and information about its source.

Figure 1 shows a Data Vault model that depicts a very simple order hub and a customer hub as well as related links and satellites. In this model, the customer hub has two satellites: one with master data that possibly comes from the CRM system, and one with an attribute called smartphone that may have come from a web analytics system.

Figure 1. Sample Data Vault Model

Why Data Vault?

The main driver for Data Vault is the need for agility while keeping data consistent and preserve changes. The idea of separating the stable and the temporal parts of a data model provides a great solution for that issue by making the model updateable and extensible. In addition, with a few extra metadata fields, such as “load_date”, “valid_from” or “valid_to”, it is easy to keep track of volatile data and manage slowly changing dimensions without deleting data.

Another benefit of Data Vault is the simplification of data integration. On the one hand, the idea of satellites provides a flexible way to manage changing data schemes. For instance, new fields can simply be added to a new satellite and there is no need to change or modify existing data structures. Besides, the isolated character of satellites means that they can be updated without mutual interferences. On the other hand, Data Vault models can easily change their data structure by joining varying satellites depending on the requirements. Thereby, Data Vault increases flexibility by moving the actual integration from the loading phase to the reading phase (also known as Schema-on-Read).

Since the flexible idea of hubs and satellites comes with a massive amount of joins when reading data, the main criticism of the Data Vault is performance. Consequently, most Data Vaults are used to support data lakes and staging areas that feed other structures designed to support fast, interactive user queries. Moreover, this underlying layers often use massively parallel processing (MPP) databases and scale-out infrastructure where joins are not an actual performance factor, due to the way MPP works [4].

A data warehouse architecture based on Data Vault

Figure 2 illustrates a data warehouse architecture entailing Data Vault that is based on my observations in companies and discussions with BI experts.

On the lowest level, the architecture comprises the actual data warehouse (also referred to as operational data storage, staging area, data lake) that is based on Data Vault. On top, a business warehouse provides a clean subset of the data that is highly governed and modeled in the third normal form according to Inmon. On the highest abstraction level, data cubes contain application-dependent data extracts modeled using Kimball’s dimensional approach.

The interesting part of this approach is the distinction between a data and a business warehouse that enables a new level of agility. In the data warehouse (Data Vault), new data sources can easily be attached and existing data schemes can be changed without impact the business warehouse. After changes have been checked and approved by governance rules, they can be migrated to the business warehouse (when necessary).

Figure 2. Data warehouse architecture entailing Data Vault

The separation of a data and a business warehouse is also useful in the context of big data where it can be pointless to load massive data sets in a third normal form business warehouse. Another question that often arises in relation to big data is what to do with data that is not needed immediately, but could be of use sometime in the future. With the depicted architecture you can store such data in your data warehouse and think about integration later.

When it comes to implementation, I have seen various approaches regarding the interaction between the layers. Most common are batch-based ETL that transfers data from a Data Vault storage to a business warehouse. However, there are also approaches where the business warehouse is also implemented with Data Vault (discussed under the term Business Vault [5]) or where all superior structures are created on-the-fly with data virtualization based on the Data Vault warehouse (what increases agility even further [6]).

Summary

I am convinced that we will see more organizations adopt Data Vault modeling in the next years. And we’ll also see Data Vault modelling used in other contexts besides data warehousing, such as streaming analytics and the Internet of Things.

Further reading

[1] Kimball, Ralph (1996). The Data Warehouse Toolkit. Wiley.

[2] Inmon, Bill (1992). Building the Data Warehouse. 1st Edition. Wiley and Sons

[3] Dan Linstedt: Data Vault Series 1 – Data Vault Overview
http://tdan.com/data-vault-series-1-data-vault-overview/5054

[4] Dan Lindstedt: Data Vault has Many Joins: Part 1
http://danlinstedt.com/allposts/datavaultcat/data-vault-has-many-joins-part-1/

[5] Ken Graziano: The Business Data Vault
http://www.vertabelo.com/blog/technical-articles/data-vault-series-the-business-data-vault

[6] Bob Eve: Data Vault and Data Virtualization: Double Agility
http://blogs.cisco.com/datacenter/data-vault-and-data-virtualization-double-agility

Previous post by expert Next post by expert

Julian Ereth

Julian Ereth is a researcher and practitioner in the field of business intelligence and data analytics.

In his role as researcher he focuses on new approaches in the area of big...

More About Julian Ereth