Data Virtualization in Business Intelligence and Analytics
What is data virtualization?
The main idea of data virtualization is to provide a unified, abstracted, and encapsulated view of data coming from a single or heterogeneous set of data stores [1, 2]. For this, it facilitates the concept of virtualization, which is very common in the IT industry where you find all kinds of virtualized resources, e.g. multiple virtual machines that share one physical infrastructure.
In the domain of business intelligence and analytics (BIA), data virtualization and data federation are often used as synonyms. However, since data federation presupposes multiple data sources, it is rather one aspect of the more general idea of data virtualization, which can also be applied to a single data source. Data virtualization has a broad spectrum and is also closely related to topics like Enterprise Information Integration, logical data warehousing, data encapsulation and service-oriented architectures. (If you want to learn more about the interplay of these terms, I recommend Rick van der Lans’s book [1].)
From an architectural perspective, data virtualization is usually located in an additional layer between data consumers and data sources (see Figure 1). BIA data consumers are usually reporting or analytics tools. Data sources can be data warehouses, data marts and even web services or classical spreadsheets. The virtualization layer abstracts and integrates these data sources with the result that it becomes a unique entry point for queries coming from data consumers. In order to do so, the virtualization layer has to first process and distribute incoming queries (from data consumers) and then integrate and transform the returning results (from data sources).
Figure 1. Typical structure of a data virtualization setup
For these tasks, data virtualization servers usually contain a design module to define the underlying mappings and transformation processes as well as a runtime that actually does the processing later on. The design module is usually a straight-forward modelling tool to define data mappings as well as data transformation routines, similar to other ETL tools with some specialties regarding data virtualization (e.g. the possibility to define different caching strategies). The runtime module uses caching and query optimization methods to enable efficient processing of queries and data. For instance, it decides what join-method to use, or if certain results or slowly-changing dimensions can be temporarily saved to increase the speed of recurring queries.
Depending on the functionality-richness of the data source, a data virtualization server has more or less work to do. If, for instance, the source is a full-fledged data store with sophisticated query capabilities, the virtualization server can delegate a lot of processing to the source system. However, if a source has no or less functionality (e.g. a spreadsheet) the processing has to be done in the virtualization layer itself. One issue with this is that often (large) data sets have to be transferred entirely before they can be processed. To deal with this, some products offer the possibility to place subsidiaries that process data near the source and transfer only filtered subsets to the virtualization layer (similar to the idea of Analytical Atoms discussed in my last article [3]).
In addition to the actual integration aspect, many data virtualization products provide further functionality such as advanced security for authorization and authentication or tools for monitoring and load balancing.
Applications of data virtualization in business intelligence and analytics
Besides the obvious advantage of integrating heterogeneous data sources, the main benefit of data virtualization in BIA is the simplification of existing environments. The rationale is that with a virtualized structure, BIA environments become more agile and easier to manage. However, as always there are two sides of a coin.
At first sight, manageability and cost seem to benefit from data virtualization. Firstly, maintenance of metadata can be simplified with a central metadata repository in the virtualization server that is shared across all data sources. Moreover, an adequate data virtualization architecture usually makes some ETL scripts, data marts or even whole systems redundant and thereby reduces complexity as well as license and maintenance costs. Lastly, the whole environment becomes more agile as it becomes easier to change data structures or migrate technology because all processing happens on a logical level in the virtualization layer. However, it has to be considered that data virtualization also adds an additional layer in the BIA architecture, and that a data virtualization server has to be managed, maintained and probably comes with license costs. Furthermore, because of a lack of standards, most data virtualization products come with proprietary tools and development languages. This again leads to migration costs as well as the risk of a vendor lock-in.
Another heavily discussed point in data virtualization is performance. As already shown above, data virtualization adds an extra layer that processes data and of course requires certain resources. How much, indeed, depends on various factors. First it has to be noted that most of the ad-hoc processing happens in-memory, which is very fast, but also requires a certain infrastructure and enough RAM. Secondly, the distribution of work is important: If the virtualization layer can push most work down to the data sources, there should not be a big difference to the classical ETL processing. However, if there is a lot of processing in the data virtualization layer, the query performance depends on how sophisticated the transformations are, how many and what data sources are affected and how efficient caching methods can be exploited. For instance, it can be a big difference if a query needs real-time information, and therefore all data has to be loaded from the source systems again and again, or if there are certain parts that can be preprocessed and cached. Nevertheless, the performance influences are usually not that critical, especially with regard to the benefits that come with data virtualization.
There are many uses for data virtualization in the domain of BIA. Below there are some common use cases that span from very specialized sandboxes up to entirely virtualized logical data warehouses:
1. Virtual analytical sandbox
A sandbox represents an isolated environment for developers or data scientists. It can be used to explore data, develop or test new models and prototypes, or any other reason. Often such sandboxes are temporary, e.g. in the course of a certain project. With data virtualization, such sandboxes can be quickly created and also rapidly be changed without reloading all data.
2. Virtualized data marts
The main reason for data marts is to provide an analytics-ready and application dependent data excerpt (e.g. from a data warehouse). With data virtualization, such data marts can be easily set up and managed. The main benefit is that the loading scripts become unnecessary, and depending on the caching strategy, a data mart can even be up-to-date at any time.
3. Virtual/Logical data warehouse
If you think about virtualized data marts, it is close to considering a completely virtualized data warehouse. There are different approaches to such logical data warehouses.
One is to directly attach the virtualization layer on top of its source systems. So if a query needs some data, the virtualization layer directly accesses the source’s database, e.g. ERP- or CRM-systems. In practice, this approach can lead to some problems: firstly, the risk of performance interference of source systems by analytical queries is high. Secondly, some systems may not keep a longtime history of their data (e.g. operational manufacturing execution systems), which prevents analyses over time and can cause inappropriate values.
This is why another approach is to persist all raw data coming from all sources in a central repository (often referred to as data lake), and then to use data virtualization to access and transform that data it in an analytical way (to accomplish that there are often new modelling techniques in use such as Data Vault [4,5]).
4. Aggregated views across multiple systems and external parties
Data virtualization also supports analytics scenarios where data has to be shared with external systems and across company borders. On the one hand, most data virtualization products can easily create custom web services that allow external access to virtualized views. On the other hand, data virtualization can be used to integrate data from various heterogeneous data sources distributed over multiple organizations, e.g. to enable comprehensive process-mining or analytics in a group of companies.
Summary
This article outlined what data virtualization is and how it can be used in business intelligence and analytics. However, this was just a brief overview of the underlying idea and common use cases. There is much more in-depth technology as well as many other scenarios for data virtualization.
Due to the advantages coming with virtualization and the fast moving technological developments, I think that data virtualization will establish itself as a fixture in most BIA architectures. Furthermore, it can be a key technology for specific future use cases, e.g. process mining in the Internet of Things. However, even though there are great approaches to completely virtualized data warehouses, I think there are still certain obstacles to overcome before we see entire virtualized environments – especially in more complex BIA scenarios.
Further reading
[1] van der Lans, Rick F. (2012): Data Virtualization for Business Intelligence Systems.
[2] White, C. (2005): Data Integration: Using ETL, EAI, and EII Tools to Create an Integrated Enterprise. Business Intelligence Journal, 10(I).
[3] Ereth, J. (2016): Edge Analytics in the Internet of Things, http://eckerson.com/articles/edge-analytics-in-the-internet-of-things
[4] Ereth, J. (2016): Data Vault Modeling Increase Data Warehouse Agility, http://eckerson.com/articles/data-vault-modeling-increase-data-warehouse-agility
[5] Linstedt D. & Olschimke, M. (2015): Building a Scalable Data Warehouse with Data Vault 2.0.