A Fresh Look at Data Modeling Part 1: The What and Why of Data Modeling
ABSTRACT: Many organizations abandoned data modeling in recent years. They are now finding that it really is needed. But today’s data modeling is different from that of the past.
Many organizations abandoned the practices of data modeling as they shifted from data management practices of the past to adopt big data, data lake, and NoSQL technologies. Past practices focused on relational data and were typically relegated to logical and physical design to develop new databases. Today’s data modeling has much larger scope driven by many factors. These include advances in analytics and data science, rapid growth in the volume and variety of data, a shift from primarily working with enterprise generated data to acquiring lots of external data, semantic disparity of operational data as operational systems become predominantly SaaS applications, and the pursuit of data lakes and NoSQL technologies.
These factors influence data modeling practices in three significant ways: (1) modeling to understand content and structure of existing and acquired data as well as modeling to design new databases, (2) semantic and conceptual modeling as well as logical and physical modeling, (3) modeling for all types of data including key-value, document-oriented, knowledge graphs, property graphs, etc.
With those differences in mind, my goal with this article is to make the case that data modeling is not dead. It is more important than ever before. And it is more interesting than ever before.
The Data Modeling Process
Data modeling is the process of constructing data models. That simple definition expresses the reality, but not the complexities of data modeling. It is important to recognize that a data model is more than a diagram. It is a description of the content and structure of a collection of data. That means a diagram (or set of diagrams) supported with descriptive text and definitions. Furthermore, it is a description of the content and structure of a collection of data from a particular perspective – semantic, business, system, or technical perspective. Those perspectives partially align with the multiple levels of data modeling that have been practiced for decades. (See figure 1.)
Figure 1. Levels of Data Modeling Past and Present
Today’s data modeling practices expand on those of the past in several significant ways. The addition of a semantic modeling level is the most visible change from past to present. Semantic modeling provides context for other data representations in the form of controlled vocabulary and shared meaning (more about this later). The flow of the modeling process has also changed from past to present. Traditional data modeling was performed in a top-down sequence – from conceptual, to logical, and then to physical – with conceptual modeling often neglected and sometimes only physical modeling practiced. Modern data modeling is a more fluid process – sometimes top-down, sometimes bottom-up, and sometimes middle-out. It begins at the levels where you have knowledge of data, and progresses to the levels where you need to understand and document data content and structure. When modeling for a new application, you may follow a top-down process from conceptual (business concept) to logical (system design), and then to physical (technical design). When seeking to understand undocumented data, you may begin at technical specification (e.g., discover the schema), then reverse engineer physical and logical models. When working to achieve data interoperability, you may begin with existing physical models, create logical models to understand the data in business system context, then create a semantic model to establish controlled vocabulary and shared meaning across diverse data sources. These are but a few examples of a fluid modeling process. Begin with what you know and model to fill the gaps where you need additional knowledge.
The Structure of Data
Earlier in this article, I defined a data model as a description of the content and structure of a collection of data. That definition typically causes someone to ask “What about unstructured data?” It seems to be impractical to describe the structure of something that is defined as unstructured. That is a common reason (or excuse) to avoid the work of data modeling. It is a common reason, but it is not a good reason because all data has structure.
All data has structure. Unstructured data does not exist.
Stop thinking unstructured, and instead think differently structured data.
Yes, all data has structure. Unstructured data does not exist. Stop thinking unstructured, and instead think differently structured data. The term structured data is typically used to describe data that is organized as rows and columns—most commonly relational data. Structured data also includes multidimensional data, which is relational data with additional constraints.
Much of the data that is managed today is not organized as rows and columns. Yet we process, analyze, and derive value from this data. Processing and analysis always requires that we understand how the data is organized. That organization is the structure of the data. It is not unstructured data. It is differently structured data. There are six common data structure patterns. (See figure 2.) Only two of those patterns conform to the commonly accepted meaning of structured data. Understanding the variety of data structure patterns is a key to modern data modeling. You need to know the type of data structure to determine the right types of data models to represent the structure.
Figure 2. Data Structure Patterns
Relational and dimensional structures organize data as rows and columns. These data structures are readily modeled using long-standing entity-relationship and dimensional modeling techniques. Although the modeling techniques have existed for decades, in most organizations the practice is typically limited to physical design of database tables. Expanding the practice to include logical and conceptual modeling helps to resolve data disparity and improve data integration efforts. Reverse engineering models from tables in SaaS and ERP systems captures knowledge needed for data integration and interoperability efforts.
Modeling dynamically structured data is ideally a part of every data engineer’s skill set.
Dynamically structured data is characterized by frequent change of data content and organization. In a relational table, every row contains the same set of columns in exactly the same sequence. In a dynamically structured dataset, not all records contain all of the same fields. When the same fields are present they don’t always occur in the same sequence. These datasets are adaptive and self-describing, with schema description embedded in the data records. Tagged files such as XML and JSON are dynamically structured data. NoSQL databases including key-value stores, document stores, wide column databases, and graph databases are dynamically structured. Dynamically structured data is well-suited to schema-on-read use cases, where understanding of data structure is coupled with data access instead of with data storage. Models of dynamically structured data are valuable when creating NoSQL databases and when preparing to use existing data. Modeling these kinds of data is ideally a part of every data engineer’s skill set.
Semantically structured data focuses on data meaning and controlled vocabulary. Meaning of the data is expressed using language constructs, most commonly as triples which take the form of subject-predicate-object. Customer-places-order and patient-receives-treatment are examples of the subject-predicate-object form of expressing data meaning.
All data, regardless of its native structure, can be viewed through a semantic lens
and incorporated into a semantic data model.
All data, regardless of its native structure, can be viewed through a semantic lens and incorporated into a semantic data model. Ontology models represent the subject-predicate-object assertions of semantic models. Taxonomies expand on ontology to represent standard classification structures. Ontologies are readily modeled as knowledge graphs, and taxonomies as hierarchical structures. Collectively, the models describe the standard terminology and agreed meaning of data in a problem domain.
Ideally, semantic modeling is a part of every data architect’s skill set.
A semantic data model is the understructure of a semantic data layer that hides obscure technical language and provides business-friendly data access. Mapping disparate data stores to a common semantic model builds a strong foundation for data interoperability, data exchange, and data sharing. Ideally, semantic modeling is a part of every data architect’s skill set.
Geospatial and multimedia data are two other types of differently structured data. Geospatial data describes locations and places, both as geometric coordinates and as named places. Multimedia data encompasses images, audio, and video as well as bio-sequence data such as genetic sequencing. Data organization varies with media type, generally using objects, links, and sequences to describe data structure.
Models for each of these data structure patterns are large topics individually—too much to address in a single article. Each may be the topic of a future article in this series. First, I’ll look at dynamically structured data because it is central to modern analytics use cases and data engineering challenges. Next, I’ll cover semantically structured data with attention to data interoperability. Then I’ll move to modeling for the remaining structural patterns to complete the series.
A Look Ahead
At this point, I have introduced several types of data models: entity-relationship, dimensional, key-value, document, graph, ontology, and taxonomy. Figure 3 shows examples of some of those models.
Figure 3. Data Model Examples
The number of model types is substantial, and it is compounded when you consider modeling data at multiple levels – semantic, conceptual, logical, and physical. There is much to be learned to understand the full scope of modern data modeling. In future articles, I will describe the “what and why” of various data models, but the “how” is better suited to training than to articles. Watch for my upcoming live and on-demand Dataversity data modeling courses, soon to be announced for 2024.