Richard Lemmens website

Graph databases

A B C D
       
       
       
Upon hearing the word database, many people picture a relational database in their head. These are well established, versatile and well known. However, there are other types of databases too. Collectively these are called "NOSQL", which stands for "Not Only SQL", though that is sometimes translated as "No SQL". This bunch includes document stores, XML databases, object databases, key-value stores, wide column stores and other types. Most have not made into the mainstream, though NoSQL databases are used by some very large companies for very large sets of data with very good results. In my experience one of the most interesting are graph databases, especially the nice implementation from Neo4j.

graph Graph databases do not consist of tables, but of nodes and edges. A graph is a different structure than a table, though both can be used for the same thing: storing data. Graph databases are good at querying relationships. Without writing a single line of code, you can see why through a thought experiment for a typical relation query:

Suppose a database stores friend-relationships, like 'Ayla is a friend of Bud' (and vice versa); 'Bud is a friend of Chris' (and vice versa); etcetera. In a relational database, a straightforward way to store these relations is to make a two-column table where each friend-relation fills one row, with the people in both columns. In a graph database, a straightforward way would be to store the people in nodes and their relations in edges. Suppose there are 1,000,000 such relationships between 150,000 people in the database. In the relational database, that would mean 1,000,000 records; in the graph database 150,000 nodes connected by 1,000,000 edges.

Both types of database handle simple queries like 'Give me all friends of Bud' equally well. The relational database will probably use an index to find all records with Bud either left or right quickly. These records contain all Bud's friends. The graph database will probably also use an index to find the single Bud-node quickly. After that, all edges radiating out from that node should be traversed, which can also be done quickly because they are few (on average 50 in this example).

The two database types start to diverge when more complex queries are used, for example 'Give me all friends of the friends of the friends of Bud'. Now the relational database starts to struggle. First it must run a query to find the friends of Bud. Say this yields 50 people. Then it must run a second query to find the friends of those friends. This is more difficult because it must match 50 values instead of 1. Say this yields 2,400 unique people. Finally it must run a third query to fiend the friends of the friends of the friends. It must match 2,400 values, too much to handle for an index, forcing it to revert to a slow table scan, which may take a lot of time for 1,000,000 records. After a lot of work some 100,000 records are produced.
The graph database has less trouble. A first query yields 50 nodes related to the Bud-node. A second query follows all edges connected to those 50 nodes, traversing 2,500 and yielding 2,400. A third query does the same for those 2,400, traversing 120,000 and yielding 100,000.
So the relational database runs two index scans, one for 1 value and one for 50, plus a table scan for all 1,000,000 records. The graph database traverses 50 + 2,500 + 120,000 = only 122,550 edges, an order of magnitude lower than the relational database.

Of course both query and database size in the example above were deliberately chosen to make the graph database shine, but the example does represent a common real-life situation. This article has not touched on (advantages of) other types of NOSQL databases, like quering with XPath in XML databases or the robust redundancy of tuple stores. So if you are grounded in the world of relational databases, the next time you have to choose a database, consider the NOSQL variants too!