A Fresh Look at Data Modeling Part 2: Rediscovering the Lost Art of Data Modeling
ABSTRACT: Many data engineering teams lack the core skills of data modeling. It is time to rediscover the lost art of data modeling.
Data modeling is a core skill of data engineering, but it is missing or inadequate in many data engineering teams. Today’s data engineers focus primarily on data pipelines—the processes to move data from one place to another. That is a natural consequence of recent focus on data lakes and data science. But the result is loss of focus on shaping and structuring data—the stuff of data modeling and database design.
Most modern data engineers are process engineers, not product engineers.
Most modern data engineers are process engineers, not product engineers. But modern data management needs to have both process engineering and product engineering. The need for product engineering is apparent with the rising interest in data products and the need for product thinking. To achieve the right balance of process and product focus we need to rediscover the lost art of data modeling.
Attention to data modeling diminished as data pipelines and schema-on-read became mainstream practices.
I refer to data modeling as a lost art because it has existed as a data analysis and design discipline for decades. Entity-Relationship Modeling (ERM), introduced by Dr. Peter Chen in the mid-1970s, is the foundation of most modern data modeling techniques. ERM was widely practiced from the 1980s to around 2010 when data lakes became a data management priority. Attention to modeling diminished as data pipelines and schema-on-read became mainstream practices. It is time to rediscover the lost art of data modeling, starting with the foundational concepts of ER Modeling.
In the first article of this series, I stated that today’s data modeling is different from the past. Today’s data modeling extends data modeling practices of the past to work with the variety of data that we work with now. That first article describes several kinds of data structures including relationally structured, dimensionally structured, dynamically structured, and semantically structured. This article describes modeling techniques for relationally structured data—a logical place to begin because it is foundational. Many data professionals who are mid-career or beyond are already familiar with relational modeling techniques, and should not think these skills are outdated. Relational data is still predominant in operational systems, and most BI and analytics projects still rely on relational data. For those who have not yet experienced relational data modeling—especially data engineers—this is an important skill set to acquire.
In future articles I’ll discuss dimensional, dynamic, and semantic modeling techniques. There you will see how these more recent modeling techniques build on a relational data modeling foundation.
What Is Relational Data?
With apologies for stating something that may be obvious, I’ll briefly describe relational data to be sure that we’re all on the same page. Relational data is structured data that represents real-world things and the relationships that exist among them. Relational data structures organize data as rows and columns in tables, with rows representing instances of things and columns representing facts about those things. Throughout this article I use examples of a fictional car rental company to illustrate data and data models. Figure 1 shows examples of relational data for customers and vehicles.
Figure 1. Relational Data Tables
The Components of ER Models
Entity-Relationship models incorporate three primary components: entities, relationships, and attributes.
An entity is a person, place, thing, concept, or event about which data is (or needs to be) collected, stored, and managed. Entities are represented in an ER model as labeled rectangles. The labels show entity names, which are expressed as singular nouns.
Relationships are important associations among entities that are (or need to be) collected as data. Relationships are represented as labeled lines that connect entities. Labels express the nature of the relationship, often but not always in verb form. Relationships have a property called cardinality (more about this shortly).
Attributes are characteristics of entities. These are the facts about entities that are (or need to be) collected as data elements. Ideally they are named descriptively.
All of the model components are illustrated in the vehicle rental data model shown in figure 2.
Figure 2. Entity Relationship Data Model Example
Returning to the concept of relationship cardinality, think of it as an expression of a particular type of business rules in data model context. Cardinality expresses the minimum and maximum number of occurrences of each entity type allowed to participate in a relationship. Cardinality is represented in the model as symbols at each end of relationship lines. For example, the relationship between vehicle and rental says that one rental provides exactly 1 vehicle, stating that 1 is both the minimum and the maximum number of vehicles that can participate in one rental. Looking at the relationship in the other direction, it says 1 vehicle is provided for zero, 1, or more rentals. In this case the minimum is zero and the maximum is many (meaning a number greater than 1). This example also introduces the concept of optional relationships, with zero meaning that a vehicle is not required to participate in any rental relationships. Looking from the perspective of a vehicle, the relationship is optional.
In the ER diagram, cardinality is shown with “crows foot” symbols. Cardinality may be any of:
Reading the Diagram
At a high level, an ER model done well can be expressed as a series of simple sentences that describe each relationship in both directions. These sentences describe the business concepts represented as entities, relationships, and cardinalities. The model in figure 2 yields these statements:
One customer contracts one or more rentals.
One rental is contracted with exactly one customer.
One rental provides exactly one vehicle.
One vehicle is provided for zero, one, or more rentals.
One customer makes zero, one, or more payments.
One payment is made by exactly one customer.
One rental requires one or more payments.
One payment is required for exactly one rental.
One rental has exactly one pickup location.
One location is a pickup location for one or more rentals.
One rental has exactly one return location.
One location is a return location for one or more rentals.
One vehicle is owned by exactly one location.
One location owns one or more vehicles.
Sentences such as these are expressions of business concepts and business rules. Describing a model in this form is a great way to test it and to guide business alignment of data content and data structures. Attributes are the remaining components of the model. Each attribute represents a fact about the entity with which it is associated.
ER modeling has three primary activities: (1) identify, name, and describe entities, (2) identify, name, and describe relationships, (3) identify, name and describe attributes. The way these activities are performed varies with the nature of the modeling effort. The first is the question of what data needs to be modeled.
When modeling for a new application or database, the process is generally top-down—conceptual modeling, leading to logical modeling, and then to physical modeling. (See the first article in this series for more about levels of modeling.) Starting with and developing only a physical model is a common bad practice that pursues technical design without business context. The conceptual model provides a big picture view—a small number of the primary entities and their predominant relationships. A logical model adds detail to understand the data from an application or system perspective, but without implementation specifics. Finally, a physical model translates the logical model to a technical design.
When developing data models for existing data, use a bottom-up process of reverse engineering data models by looking at implemented data. This kind of modeling is common to understand the content and structure of legacy databases, ERP systems, SaaS applications, legacy data warehouses, and relational tables in data lakes. Starting with schema, you’ll find attributes by identifying what fact is contained in each column. Then you’ll find entities by asking what real world thing each fact describes. Finally, you’ll look for relationships by examining foreign keys and cross-table dependencies. Extracting a physical data model, then rolling up to a logical model is sufficient for most use cases, but if needed it is practical to further abstract to create a conceptual model.
There is, of course, a lot of detail behind these brief descriptions of modeling processes—techniques such as normalization, entity abstraction, attribute abstraction, data naming, data typing, and more. That is too much detail for an article of this type, but those are learnable skills. I recommend Steve Hoberman’s Data Modeling Made Simple as a good place to start learning.
This article provides a quick and high-level look at data modeling for relational data. It is the most mature of today’s data modeling techniques, and it is important because it is the foundation for more recent techniques. ER modeling knowledge and skill is something that is needed in every data engineering team. If you’re an experienced data modeler, you already know this stuff. Get out there and share that knowledge. It is not obsolete. If you’re a modern age data engineer, broaden your horizons and complement your skills to be both a process engineer and a product engineer.
A Look Ahead
With two articles in this series, I have introduced several types of data models: entity-relationship, dimensional, key-value, document, graph, ontology, and taxonomy. And I have expanded here on entity-relationship modeling. In future articles I’ll describe modeling concepts and techniques for the remaining types of data models.