Knowledge Graphs – Part I: What is a Knowledge Graph?
When I say “graph,” what do you think of? Something with x and y axes? Maybe a chart? Unless you took a fair amount of college mathematics, the odds are high that the terms vertices and edges don’t pop into your brain first. But as graph databases become more widespread and knowledge graphs begin to power business critical use cases like fraud detection, customer 360, market intelligence, content management, drug discovery, and data integration, we’d do well to get a better handle on these concepts and how they work.
Relegated to academics, tech giants, and niche enthusiasts for the last couple of decades, graph technology is going mainstream. Just last month, graph vendor Neo4J raised $325 million in the largest funding round ever for a database of any kind. But what exactly is a graph in this context?
In this blog I will introduce the basic terminology for understanding the world of graphs and define knowledge graphs—a specific application of this technology. In later articles in this series, I will highlight key use cases for knowledge graphs, go over best practices for implementing them, and offer predictions about the long-term role of these tools.
The Mathematical Definition of a Graph. Our story starts in Königsberg, Prussia in 1735. Peerless mathematician Leonhard Euler solves a puzzle involving seven bridges that cross the Pregel River and connect the four landmasses, or districts, of the city. As part of his proof that there was no way to walk through the city while crossing every bridge exactly one time, Euler treated each landmass as a vertex, or node, and each bridge as an edge, or link. His approach became the basis of graph theory (See figure 1.)
Figure 1. Euler’s Map of Königsberg
Since then, at least in the realm of mathematics, a graph has been defined as a set of objects linked by relationships between one or more pairs. And, as it turns out, you can describe a whole lot more than bridges with a graph.
The Modern Evolution of Knowledge Graphs. Let’s fast forward to the late 1990s/early 2000s. The internet has arrived and Tim Berners-Lee’s World Wide Web links millions (soon to be billions) of webpages in what is no less than a giant graph. At that scale, keeping track of what all the objects represent is nearly impossible, so Berners-Lee makes a proposal in the May 2001 issue of Scientific American for a new approach—the Semantic Web.
The Semantic Web would use standards to structure metadata about webpages and the links between them, making the knowledge stored in these relationships machine-readable. Unfortunately, the Semantic Web developed a reputation for being too academic and never fully caught on, but the idea behind the proposal remained sound.
In fact, search engines and social media companies in the mid to late-2000s quickly realized they were dealing with extremely large graphs and in order to make sense of them, they needed a systematic way to map general knowledge about the nodes and their relationships. Google can ultimately take credit for rebranding semantic webs and popularizing the term “knowledge graph.” In 2012, it used the term to introduce the Google Knowledge Graph, which powers its search engine and provides the information found in the small knowledge boxes that appear after most Google searches. (See figure 2.)
Figure 2. Eckerson Group Google Knowledge Box
Google wasn’t the only company to start using knowledge graphs, however. In addition to its competitors in search, Microsoft and Yahoo, companies including Facebook, LinkedIn, Amazon, and Airbnb all started to rely on knowledge graphs internally to manage their core offerings. But while large tech companies had the expertise to build their own in-house graph solutions, other industries needed more out-of-the-box platforms to build their knowledge graphs.
In recent years, the number of graph technology vendors has proliferated, lowering the bar for smaller and less technologically savvy organizations in various industries to take advantage of knowledge graphs. New offerings from older vendors, fresh start-ups, and large cloud vendors, such as AWS, have reduced the time to value, increased ease of use, and improved query performance at scale. As we move into the future, I anticipate knowledge graph adoption will continue to grow, becoming a standard solution that runs under the hood of many of the point tools that already exist. In a later post, I will dig into this final phase in more detail. The chart below summarizes the trajectory of the adoption.
Figure 3. The Adoption of Knowledge Graphs
What is a Knowledge Graph? Now that we’ve covered the origins of knowledge graphs, it’s time to dig into their basic attributes. Essentially, a knowledge graph maps the relationships between objects (data) and provides information that helps humans and machines understand what the data actually means. A knowledge graph is data plus metadata (or semantic information) linked in a graph. This concept is easier to understand visually, so I have crafted a small knowledge graph below:
Figure 4. Knowledge Graph about the State of Maine
From this graph, I can not only learn about Maine (where I live), but also uncover new information that was not explicitly entered. For example, nowhere does it say that Canada borders New England. However, because I know from the graph that Maine borders Canada and is a part of New England, I can logically infer that Canada must border New England. In practice, most knowledge graphs are exponentially larger than the one I’ve depicted here, so it makes sense to use software to keep track of them. This also allows machine learning models to ingest the information they contain, allowing them to power a wide range of applications.
Most knowledge graphs reside in a graph database. Although theoretically, one could build a knowledge graph on any type of data store, given the fact that it’s a graph, it’s best to use a store specifically designed to handle graph data. Graph databases differ from traditional relational databases because they treat relationships between objects as first-class citizens. There are two main types of graph databases:
RDF Triple Stores
Labelled Property Graphs
At a high-level, RDF Triple Stores lend themselves toward standards and interoperability with open data, while Label Property Graphs focus on performance and fast start up times. This a gross generalization, and, as the approaches evolve, their capabilities are quickly converging.
RDF Triple Stores. Triple stores are the direct descendants of Berners-Lee’s Semantic Web initiative. Using the Resource Description Framework (RDF) specification created for the Semantic Web, triple stores break graphs down into sets of three, consisting of two nodes and an edge. (See figure 5.)
Figure 5. A “triple”
What makes RDF triples special is that each piece of the triple—subject, predicate, and object—are given a Uniform Resource Identifier (URI) that gives it a unique, machine-readable identity. Standard ontologies and taxonomies give these triples structure and make them interoperable with other RDF-based knowledge graphs. This standardization makes it easier to tap into open data sources and third-party data. Triple stores also have a standard query language SPARQL, so engineers only have to learn one language regardless of which database they use.
Labelled Property Graphs. Labelled Property Graphs initially came into being in Sweden as part of an effort to build an enterprise content management system. Unlike RDF, with its focus on standards and interoperability, labelled property graphs, or just property graphs, emphasized storage and query speeds. Property graphs differ from RDF not only in their lack of standardization but also because they allow both nodes and relationships to have, well, properties. Instead of having a triple for every piece of information, property graphs permit elements of the graph to have internal structure. For instance, in my example, I might represent founding dates as an attribute or property, rather than a linked node. (See figure 6.)
Figure 6. Properties in a graph
This approach allows for greater flexibility but requires more building from scratch. Organizations are largely responsible for creating their own ontologies and taxonomies to manage the graph and cannot rely on open-source resources to the same degree. Query languages for property graphs tend to be more approachable for developers than SPARQL, but they lack standardization and essentially every database has its own language.
Functional convergence. Over time, both approaches have appropriated features of the other. RDF-Star, a new standard for RDF graphs permits the addition of metadata to edges, previously something that could only be done in a property graph. At the same time GQL, an initiative to standardize the query language for property graphs, will create greater interoperability between property graph databases giving them some of the advantages that RDF currently enjoys. Ultimately, we are rapidly reaching a point where the distinction between RDF and property graphs will blur as the databases on both sides begin to interface with and convert from one format to the other. In the meantime, either can store a knowledge graph.
The Upshot. A knowledge graph is essentially a map of everything an organization knows about a given topic. They can be confined to a particular domain, or, in the case of an enterprise knowledge graph, map everything the company knows about everything. Knowledge graphs combine data with semantic information to provide context, enabling human and machine-intelligible insights. Graph databases, both property and RDF, provide a storage layer for knowledge graphs, providing the means to efficiently query massive graphs.
The knowledge graph industry has grown enormously since the days of the Semantic Web Manifesto, and knowledge graphs are an increasingly attainable solution for organizations outside the world of big tech. In my next article, I will dive more into the various applications of knowledge graphs and the business use cases they enable.