Data Virtualization’s Extending Reach

Data Virtualization’s Extending Reach

In today’s highly distributed, multi-platform world, the data needed to solve any particular decision making need is increasingly likely to be found across a wide variety of sources. As a result, traditional manual approaches requiring prior collection, storage and integration of extensive sets of data in the analyst’s preferred exploration environment are becoming less useful. Data virtualization, which offers transparent access to distributed, diverse data sources, offers a valuable alternative approach in these circumstances.

Data virtualization is not wholly new, but is today’s preferred moniker for the broad set of function allowing access to disparate and distributed data. It has existed under various names and extents since relational databases and distributed computing environments emerged in the 1980s. Large enterprises, in particular, soon began to store data in different systems, usually for better update or access performance, only to realize rapidly that such data often needed to be used together. Federated data systems, both homogeneous (among instances of a single RDBMS) and heterogeneous (among disparate RDBMSs or even different types of DBMS), were developed by the early 1990s and even featured in IBM’s Information Warehouse announcement of 1991. However, such approaches were generally dismissed by the broader Data Warehouse industry for almost 20 years, due to issues with data consistency and cleanliness, as well as security and performance impacts. Enterprise Information Integration (EII) represented a second wave of interest in the topic in the late 1990s and early in the new millennium. Again, IBM invested in the technology under the banner of DB2 (later WebSphere) Information Integrator, extending the scope of integration to file stores and other “non-DBMSs”. Information Integrator was withdrawn in 2006 and replaced by WebSphere and later InfoSphere Federation and Replication Server products.

Of course, data virtualization function is today offered widely by various database and non-database vendors. My focus above on IBM’s history in the space—other than my personal involvement in it at the time — is because IBM has this year introduced another product offering virtualization function: IBM Fluid Query, based on the IBM PureData System for Analytics (based on Netezza technology) and IBM BigInsights (based on Hadoop) platforms.

My new white paper (also available on my website) describes the problems of multi-source data discovery through the eyes of a typical business analyst, highlighting the difficulties encountered when using traditional manual methods of prior data integration. A modern, high-level architecture—the integrated information platform—is introduced that positions and explains data virtualization. The value of this technology in delivering business insight from data is then presented.Data Visualization figure

The architecture shown in the accompanying figure is based on that developed in my book “Business unIntelligence”, emphasizing that there exists at least three very different data pillars in which today’s data resides: the relational environment containing core business data for operations and reporting, fast and specialty machine-generated data (think Internet of Things), and deep analytic, human-sourced information (think social media). These three storage and processing environments comprise an integrated information system that demands data virtualization function to allow easy access to and use of all the data and information resources by business users and applications alike. 

The value of data virtualization technology is summarized in five broad use cases:

  1. Light-speed business. From the data warehouse, a business analyst sees a snapshot of the business as of end of yesterday, due to business requirements or technical limitations. The analyst arrives at a problem determination and remedial action, but cannot instantly see the results of the action in the operational environment. Data virtualization allows querying and combining operational data with the data warehouse, without adding real-time loading capabilities and costs to the data warehouse.
  2. Bridging relational data islands. Most enterprises run multiple relational databases, dictated by application needs or historical developments. Data virtualization offers the ability to bridge across relational database systems to access and combine the data required.
  3. Creating deep context. Numbers from the data warehouse alone don’t always tell the full story. Consider the query: “What are our top selling products that get good or better reviews?”  This query contains both core business data and human-sourced information: top selling comes from sorting sales volumes in the data warehouse or mart, while good / better reviews requires analysis of social media and other data landed in Hadoop. In many cases, textual (or even image) information creates a much deeper context around the numbers for business. This type of analysis allows SQL queries originating in the relational environment to access the big data landed in Hadoop stores pushing processing down to MapReduce and taking advantage of the power of the applications that reside there, such as pattern recognition, predictive analytics, etc.
  4. Agile activities. Who has time to build a new data mart every time the business needs expand or change? Sometimes you just need an extra detail or two on an existing data mart. Or maybe you simply want to check if adding a new column would be valuable. The ability to join data from one mart to that in another can give quick answers, discovering if the result is useful without requiring an IT project to create a new mart. Agility to react to new needs or changes is vital today.
  5. Historical support. In the big data era, it makes sense to regularly move old, less frequently used data to a cheaper environment. But just because it’s less popular, it doesn’t mean the business will be happy if takes hours to retrieve if needed. Moving older data regularly to Hadoop, but allowing access through a virtualized query from the data warehouse offers the best of both worlds—lower storage costs with seamless user access to the data when needed.

My white paper explores IBM’s new offering, IBM Fluid Query, in the light of the architectural considerations and use cases above. It lists the set of data virtualization functionality offered through IBM PureData System for Analytics, IBM BigInsights and other IBM data management products, and shows the direction of IBM thinking in this emerging area of data management function and value enablement.

Barry Devlin

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988....

More About Barry Devlin