Graph Theory and Graph Database

In “The Big Bang Theory” TV show S12:E16, Leonard created a whole network of celebrities who are likely to play Dungeons and Dragons with Whil Wheaton. He uses Graph Theory, a branch of Mathematics to map social networks of Whil Wheaton by using his twitter and other online media networks.

Graphs are mathematical structures used to study complex relational data. It is widely used in social network models, recommendation systems, Google maps, genomics etc. With the Big data era, latest storage spaces like Neo4j and advanced algorithm such as GNN, GraphEDM under the field of study Geometric Deep Learning, Graph has become an integral part of any organization. Google’s Knowledge graph is one such example.

Graph Theory dates back to 1736 when Leonhard Euler (1707–1783) solved the well-known K¨onigsberg bridge problem. This problem asked for a circular walk through the town of K¨onigsberg connected to 4 territories by 7 bridges (now Kaliningrad, Russia) in such a way as to cross over each of the seven bridges spanning the river Pregel once, and only once. Euler proved that the problem is unsolvable by representing the four territories by points (called vertices), and the bridges by curves joining the respective points, then the problem looks like a graph as shown below.

Euler’s bridge problem

Mathematical notations

Graph

A graph G is a pair G = (V,E) consisting of a finite set V and a set E of two-element subsets of V . The elements of V are called vertices. An element e = {a, b} of E is called an edge with end vertices, a and b. We say that a and b are incident with e and a and b are adjacent or neighbors of each other, and write e = ab.

The degree of v, deg v, is the number of edges incident with v

Directed and undirected Graphs

Undirected graph is a graph without an orientation, e.g. is the graph above, V1 to V2 is via e1 and vice versa, same for all vertices.

Directed graph or digraphs has specific orientation, below is an e.g. of digraph, arrow pointers to specify the direction.

Weighted Graph

A weighted graph is a graph with weights assigned to the edges connecting the vertices. It is denoted by (V,e,w), where w is the weight.

Walks, Trails and Paths

Let v1,v2…vn be the vertices and e1,e2,…en be the edges of a graph G, if the vertices are connected by edges as ei = vi-1vi for i=1,2,…n then Graph G is called a Walk. (Graph below is a walk). A walk for which the edges are distinct is called a Trail. If both vertices and edges are distinct, then it’s called a Path.

Cycles, Connectedness and Trees

A trail with n ≥ 3, for which the vertices are distinct with v0 = vn, is called a cycle.(graph below). A graph is called acyclic if it does not contain a cycle.

Two vertices vi and vj of a graph G are called connected if there exists a walk with start vertex vi and end vertex vj. If all pairs of vertices of G are connected, G itself is called connected.

A graph which is acyclic, connected and has n-1 edges is called a Tree. A vertex of T with degree 1 is called a leaf. A forest is a graph whose connected components are trees. E.g. decision trees, random forest.

Bipartite and Complete Graph

A bipartite graph is one whose vertices can be split into 2 independent groups, such that every edge connects between these groups. In the graph below (V1, V2) and (V3, V4) are the 2 independent vertices. This is a very useful graph in networking.

A complete graph has a unique edge between every pair of nodes. A complete graph with n vertices is denoted as Kn.

Graph Database

A graph database is a storage engine that combines a basic graph structure of vertices and edges and a query language for easy storage and retrieval of information. Graph emphasize on the relationship between entities.

What are the other available databases?

We know of RDBMS (Relational Database Management System) as a common storage, and has been existent a long time, however with big data era, there were need for different types of storages other than relational DB. Below are the storage engine used currently according to the data structure:

Key-value — this data has a unique identifier (a key) and an associated data object (the value). Examples include Berkeley DB, RocksDB, Redis, and Memcached.

Wide-column (or column-oriented) — Stores data in rows with a potentially large number or possibly varying numbers of columns in each row. Examples include Apache HBase, Azure Table Storage, Apache Cassandra, and Google Cloud Bigtable.

Document — Stores data in a uniquely keyed document that can have varying schema and that can contain nested data. Examples include MongoDB and Apache CouchDB.

Relational — Stores data in tables containing rows with strict schema. Relationships can be established between tables allowing the joining of rows. Examples include PostgreSQL, Oracle Database, and Microsoft SQL Server.

Graph — Stores data as vertices (nodes, components) and edges (relationships).Examples include Neo4j, Apache TinkerPop’s Gremlin Server, JanusGraph, and TigerGraph

Taken from “Graph Databases in Action”, an excellent book by Dave Bechberger and Josh Perryman.

Why can’t we use RDBMS and SQL for querying complex relational data?

Relational database, otherwise what its name suggests is not really good at addressing complex linked relations such as social networking, hierocracy, shortest path search etc. when data becomes more inter-related one must use multiple and complex joins and self-joins, which is confusing and expensive task. Graph on the other hand focuses on the relationships which makes it easier to analyze complex relations. Index-free adjacency, a property of graph database allows queries to traverse millions of nodes per second, offering response times that are several orders of magnitude faster than with relational databases for connected queries.

Graph database comes with Gremlin as its querying language, similar to SQL for relational database. Recursive querying is one of the specialty of Gremlin.

Data science enthusiast.