Register now for - "CDO TechVent for Next-Generation Data Catalogs: Practices and Products You Need to Know" - Wednesday, June 21, 11:30 a.m. Eastern Time.

How Zone-based Data Processing Turns Your Monolithic Data Warehouse into a Flexible, Modern Data Architecture

ABSTRACT: A zone-based data refinery creates an agile, adaptable data environment that supports new and unanticipated business requirements quickly.

Many organizations are saddled with legacy data warehouses that are hard to update and maintain. Adding new data elements or data sources can take weeks if not months and require extensive testing to ensure the changes didn’t break anything. This frustrates business users to no end—they want and need an agile data environment that adapts quickly to their changing requirements. As a result, most take matters into their own hands and build data silos to get their jobs done. 

Feeling the brunt of customer dissatisfaction, data architects latch on to novel architectural approaches. Several years ago, many flirted with data lakes, which became data swamps. Today, some have embraced the data mesh and are praying it doesn’t turn into data mush.

Zone-based Data Processing

There is a better way, one that is tried and true, and works with either on-premises or cloud data platforms. It’s called zone-based data processing and it’s the hallmark of new modern data architectures. The notion is simple: you reserve areas in your data platform for specific types of processing and data output. You move data from zone to zone to create the data output required to support a specific use case.

A data refinery turns a monolithic data warehouse into a flexible data environment that gracefully adapts to new and unanticipated business requirements while maximizing reuse and standards.

At Eckerson Group, we call this a data refinery. Just as an oil refinery takes a raw material (crude oil) and processes it into a multiplicity of outputs (e.g., gasoline, lubricant, asphalt, jet fuel), so does a data refinery. It converts raw data from source systems into clean data, integrated data, wide flat tables, dimensional data, graph data, and so on to support a variety of business use cases. A data refinery turns a monolithic data warehouse into a flexible data environment that gracefully adapts to new and unanticipated business requirements while maximizing reuse and standards. (See figure 1.) 

Figure 1. Zone-Based Data Refinery

Although you can build a data refinery on premises, we highly recommend using cloud data platforms because of their innate flexibility to spin up new virtual machines on demand. This gives architects the ability to add zones or scale processing within zones up or down on demand. In addition, the cloud makes it easier to virtualize the zones across different types of hardware, database engines, and even cloud platforms to gain maximum efficiency. 

Types of Zones

Data architects use different names to describe various types of data processing zones, and some prescribe more or fewer zones depending on requirements. Cloud data engineers use extract, load, and transform products (ELT) such as Matillion to move data from zone to zone and transform it into its new state. Here are descriptions of four zones that exist in most modern data architectures: 

  • Raw Zone: This is the most important zone because it stores raw data from source systems in its native form without transformation or cleansing. As a permanent store, the raw zone provides the foundation for all other zones, allowing data to be repurposed for new and unanticipated use cases. Data architects need to design a flexible ingestion process that captures data in batch, near-real time, and real-time (streaming) to support all use cases now and in the future.

  • Refined Zone: This zone takes data from the raw zone and cleans, integrates, and masters it into subject areas (e.g., customer or product). The refine zone provides the core building blocks for other zones and datasets. 

  • Trusted Zone: This trusted zone takes data from the refined zone and curates it. Data architects and stewards apply an extra layer of validations, checks, and approvals so the data can be exposed to business users and developers without issue. 

  • Analytic Zone: Data in this zone has been explicitly modeled to support a business use case, such as an operational dashboard in the finance department. A data architect uses data from the refined or trusted zone to create a use case-specific schema or semantic layer. The analytic zone might contain dozens of schemas (i.e., data marts.) 

Besides the above four zones, there are specialized zones that data architects create to support unique business requirements: 

  • Integration Zone: Also called a real-time delivery zone, this zone pulls data primarily from the raw zone but other zones as well to support operational or near real-time use cases. This is what we used to call an operational data store.  

  • Discovery Zone: Sometimes called a sandbox, this zone provides an isolated area in which authorized users (e.g., data scientists) can copy enterprise data and blend it with their own to run queries and train and test models. The goal is to explore and experiment, not create production systems, and thus, access typically expires after 90 days. 

  • Formatted Zone: This zone prepares data for distribution to a target data environment or application. For example, this zone will convert parquet files into a format required by Amazon Redshift if that is the destination of a data pipeline. 

Summary – Governance Needed!

A data refinery with zone-based data processing adds flexibility and reuse to a monolithic data environment, speeding development and agility that delights business users. It’s imperative, however, that a data architect governs the environment to preserve the integrity of each zone so that it remains true to its purpose. Without such governance and oversight, a data refinery will slowly ossify and turn into a monolithic environment that is hard to change and modify. 

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson