SQL for NoSQL Data

TableauQlikPower BI, and other self-service tools are great for easy, code-free data analysis and visualization. But sometimes, you want to get close to the data and work with it more directly. Today’s business organizations often have data-savvy people who are comfortable and productive working directly with data and whose data goals are not limited to producing visualizations and dashboards. They frequently use structured query language (SQL) to find answers in data, to ask questions about the data, and to manipulate data and prepare it for downstream work. 

Unfortunately, the people with these SQL skills are limited to working exclusively with relational data. Yet much of today’s interesting and valuable data is stored in non-relational forms—NoSQL databases such as key-value stores and document stores. The mismatch of SQL skills and NoSQL data limits the value that can be derived from the data and from the talents of the workforce. What we really need is a SQL language for NoSQL data. 

The Powerful Thing about SQL

The SQL language is based on the relational model created by E.F. Codd in the 1960s. Prior to the relational model, knowledge of the data – structure, content, relationships, etc. – was embedded in program code. With the relational model, data knowledge was separated from program code and instead coupled with the data. That is the underlying power of SQL. All of the SQL commands—SELECT, INSERT, UPDATE, DELETE, JOIN, COUNT, GROUP BY, and many more—use the metadata described by the data model to operate on the data itself. A mathematically-based model founded in set theory—the branch of mathematics that deals with collections of objects—makes SQL possible for relational data. 

Some 40 years after Codd introduced the relational model, we experienced big shifts in the world of data management. The mid-2000s brought NoSQL databases, Hadoop, MapReduce, data lakes, and the rush to big data. Today we’re well aware of the many benefits (and the many challenges) of the big data explosion. We are perhaps less aware of one significant side-effect—one area where we took a giant leap backward. With the adoption of NoSQL databases, we put the knowledge of data back in the code! We need a new data model that separates data knowledge from code, a new data model that is mathematically based with a set theory foundation. That model is the basis of a SQL language for NoSQL data.

The Promise of KeySQL

I was recently introduced to an emerging technology that steps up to the SQL for NoSQL requirement. KeySQL by Keyark applies the concepts of mathematically based data modeling to capture data knowledge independently of program code. Keyark describes KeySQL as “a structured query language that employs a keyobject (k-object) data model. The keyobjects work as data definitions for the k-object instances, which represent the data itself.” Based on the k-object data model, KeySQL uses high-level data manipulation statements—INSERT, UPDATE, DELETE, etc.—similar to those of SQL. Joins, grouping, and other set-based operations are fully supported. But KeySQL can be applied to NoSQL data that includes nesting, hierarchy, embedded arrays, lists, and other data constructs not practical in relational databases. In short, KeySQL brings SQL capabilities to data of any complexity.

K-Object Data Model

Just as SQL is built on the foundation of a data model, KeySQL is also model-based. The relational model is a generalization of the mathematics of set theory relations. The underlying data model for KeySQL has a similarly strong mathematical foundation that is based on hereditarily finite sets. For those of us who are not mathematicians, this may be a pretty imposing term, so let’s break it down.

Hereditarily Finite Sets

In simple terms, and without getting into the depths of mathematics theory, a finite set is a set with a finite number of elements. When applied in data modeling, a finite set represents actuals (what exists) as opposed to possibilities (what might exist). This principle aligns neatly with the big data concept of schema-on-read vs. schema-on-write. When designing databases for schema-on-write, it is necessary to support things that don’t yet exist. When understanding databases for schema-on-read, all that matters is actual database contents. 

Now, let’s look at the hereditary aspect of hereditarily finite sets. A hereditary set is a set whose elements are all hereditary sets—all elements of the set are also sets. (See figure 1.) A property of a set is hereditary if it is inherited by all subsets. There are many similarities with class hierarchy and inheritance in object-oriented design principles. Considering this, it is not surprising that the KeySQL data model has a distinctly object-oriented feel. The state-based (as opposed to method-based) concepts of object orientation apply.

Figure 1. Hereditary Sets

K-Objects

Now we have the context to better understand the description of KeySQL as “a structured query language that employs a keyobject (k-object) data model.” A k-object is expressed as a key-value pair, for example:

{YEAR: 2019}

The example above is an elementary k-object that describes an instance of stored data. More complex k-objects are composed from elementary k-objects:

{VEHICLE: {YEAR: 2014, VIN: ‘5TGAW5F12EX597834’, MAKE: ‘Toyota’}}

This is an example of a composite k-object that is composed of the elementary k-objects YEAR, VIN, and MAKE. Below is an example of a multi-composite k-object. Multi-composite k-objects are created as the composition of one k-object whose instances are repeated several times (e.g., several instances of the k-object VEHICLE). 

With these examples, you see hereditarily finite sets in action. Every member of the VEHICLE_INVENTORY set is also a set, and every member of each VEHICLE set is a set that represents an elementary object: sets, composed of sets, composed of sets. A next level of multi-composition could represent a data store with members of different types:

In some respects, this is an elegantly simple data model. Sets, contain sets, contain sets, etc. giving the ability to represent flat data, lists, nesting, arrays, and more—a data model that supports the realities of the data that we work with today. 

The Power of the K-Object Data Model

With a first glance at the examples, you might have the impression that this looks a lot like JSON or XML and be tempted to ask “what’s so special about that?” You would certainly be correct that it has similarities with JSON and XML. That is good news because that is part of the power of this data model. It describes actual stored data in a familiar way and is able to describe both relational and non-relational data. But it goes beyond descriptions of stored data to describe system concepts such as data stores and business concepts such as inventory and vehicle. JSON and XML are data formats. The KeySQL data model is an abstract model that is able to support operations on data, much like the relational model does with SQL. I see three very powerful capabilities in this modeling approach:

  • Key-Value Pairs – K-objects instances are created as key-value pairs. This is a natural fit for NoSQL data because every type of NoSQL database—document store, graph, and wide-column—is built on key-value pairs. 

  • Bottom-Up Modeling – The modeling process begins with actual stored data and systematically progresses upward to higher level concepts representing system and business views of the data.

  • Universal Modeling – One model represents business, system, and physical views of the data. There is no need to create conceptual, logical, and physical models and to manage the transitions and translations from one modeling level to another. The path from data to business semantics, and from business semantics to stored data, is clear and direct. 

Beyond Structured Query Language 

In this article, I’ve introduced KeySQL as a data manipulation language for non-flat data and explored the underlying data model. The Keyark product suite includes a comprehensive query language, as well as a Universal Database, KeySQL Server, KeySQL Studio, APIs, and data connectors. These things support analysts, administrators, and developers and make it practical to integrate KeySQL into your existing technology stack without a “rip and replace” approach. Discussion of KeySQL query language, environment, and operations capabilities are topics for future articles. 

Dave Wells

Dave Wells is an advisory consultant, educator, and industry analyst dedicated to building meaningful connections throughout the path from data to business value. He works at the intersection of information...

More About Dave Wells