In previous posts on data models, we saw that when you have one-to-many relationships, the best model is Tree and when you have no relationships than the best model is document model.
Similarly when you have many-to-many relationships, then for simpler relationships, relational model works, but as your structure becomes complex then it is best to use graph models.
Graph models generally has vertices and edges. The objects are the vertices and edges represent the relationships.
For example, Facebook’s data model, it uses heterogeneous graph models. It has customers, comments, likes, friends, etc, as vertices and has relationships between them as edges.
Property Graph
In the property graph model, each vertex consists of:
- A unique identifier
- A set of outgoing edges
- A set of incoming edges
- A collection of properties (key-value pairs)
Each edge consists of:
- A unique identifier
- The vertex at which the edge starts (the tail vertex)
- The vertex at which the edge ends (the head vertex)
- A label to describe the kind of relationship between the two vertices
- A collection of properties (key-value pairs)
You can think of a graph store as consisting of two relational tables, one for vertices and one for edges. The head and tail vertex are stored for each edge; if you want the set of incoming or outgoing edges for a vertex, you can query the edges table by head_vertex or tail_vertex, respectively.
Some important aspects of this model are:
-
Any vertex can have an edge connecting it with any other vertex. There is no schema that restricts which kinds of things can or cannot be associated.
-
Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus traverse the graph—i.e., follow a path through a chain of vertices—both forward and backward.
-
By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.
Cypher is a declarative query language for property graphs, created for the Neo4j graph database.
Triple-Stores and SPARQL
In a triple-store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object.
The subject of a triple is equivalent to a vertex in a graph. The object is one of two things:
- A value in a primitive datatype, such as a string or a number. In that case, the predicate and object of the triple are equivalent to the key and value of a property on the subject vertex. For example, (lucy, age, 33) is like a vertex lucy with properties {“age”:33}.
- Another vertex in the graph. In that case, the predicate is an edge in the graph, the subject is the tail vertex, and the object is the head vertex. For example, in (lucy, marriedTo, alain) the subject and object lucy and alain are both vertices, and the predicate marriedTo is the label of the edge that connects them.
SPARQL is a query language for triple-stores using the RDF data model. (It is an acronym for SPARQL Protocol and RDF Query Language, pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar.
As we discussed in previous post, network models and relational models both could be used to solve many-to-many relationship problems. Are graph databases the second coming of CODASYL(network model) in disguise?
No. They differ in several important ways:
-
In CODASYL, a database had a schema that specified which record type could be nested within which other record type. In a graph database, there is no such restriction: any vertex can have an edge to any other vertex. This gives much greater flexibility for applications to adapt to changing requirements.
-
In CODASYL, the only way to reach a particular record was to traverse one of the access paths to it. In a graph database, you can refer directly to any vertex by its unique ID, or you can use an index to find vertices with a particular value.
-
In CODASYL, the children of a record were an ordered set, so the database had to maintain that ordering (which had consequences for the storage layout) and applications that inserted new records into the database had to worry about the positions of the new records in these sets. In a graph database, vertices and edges are not ordered (you can only sort the results when making a query).
-
In CODASYL, all queries were imperative, difficult to write and easily broken by changes in the schema. In a graph database, you can write your traversal in imperative code if you want to, but most graph databases also support high-level, declarative query languages such as Cypher or SPARQL.
Datalog
Datalog is used in a few data systems: for example, it is the query language of Datomic and Cascalog is a Datalog implementation for querying large datasets in Hadoop.
Datalog’s data model is similar to the triple-store model, generalized a bit. Instead of writing a triple as (subject, predicate, object), we write it as predicate(subject, object).
It uses rules to tell database about new predicates. These predicates aren’t triples stored in the database, but instead they are derived from data or from other rules. Rules can refer to other rules, just like functions can call other functions or recursively call themselves. Like this, complex queries can be built up a small piece at a time.
Thanks for stopping by! Hope this gives you a brief overview in to graph data models. Eager to hear your thoughts and chat, please leave comments below and we can discuss.