Neo Technology Just Closed $20M in Funding
blog.graphenedb.comI do believe graph databases are here to stay. However, they are overhyped right now, leading everybody to think their problem is a graph problem, just like how the hype of hadoop led everybody to believe their problem was a map reduce problem.
A good rule of thumb is whether you ever have a need to query the relations as well as the data. If you ever get to that point, you can be pretty well assured that a graph database will help you. Until then, the safer decision is to stick with a relational db for prod.
Also, check out OrientDB. I have found it to be superior in every way to Neo4j.
I've been using Neo4j for around 3 months, but I'm unconvinced by the "graph" model as having any significant advantage over the relational model. A graph can be trivially modelled relationally as a "vertex" table and an "edge" table. A particular problem I have with the graph model is that edges cannot be referenced by other entries in the database; thus the schema designer must prefer using vertices to model all concepts, in case future data has to make reference to it.
The "graph model" was, in the 60s and 70s, called the "network model" [1]. This is the environment in which Codd introduced the relational model [2], where he wrote:
> In previous work there has been a strong tendency to treat the data in a data bank as consisting of two parts, one part consisting of entity descriptions (for example, descriptions of suppliers) and the other part consisting of relations between the various entities or types of entities (for example, the supply relation). This distinction is difficult to maintain when one may have foreign keys in any relation whatsoever. In the user’s relational model there appears to be no advantage to making such a distinction (there may be some advantage, however, when one applies relational concepts to machine representations of the user’s set of relationships).
It's quite funny that after the elegance of Codd's work, we're thinking about going back to the network model and all its problems. If we're going to consider that, we need very good reasons, but I haven't seen any.
I do see some features of Cypher as nice compared to SQL. Specifically those features that implement graph algorithms (e.g. shortest path). I do think there is space in the relational world for languages other than SQL. For example, it is disappointing that no relational databases implement Datalog as a query language. However, it is important not to conflate "SQL" and "the relational model".
[1]: https://en.wikipedia.org/wiki/Network_model [2]: http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
The advantage may just be the way you interface with the data, SQL vs Cypher/Gremlin(Groovy). I know with Gremlin one can easily walk the graph, write lambda functions to evaluate certain points while you traverse, create easy algorithms to do breadth/depth-first traversals, etc. all as a query. Those things just don't seem possible with SQL alone. Using the setup of a node and edge table, how could I write a query that would start with a node, find all edges of a type and go X levels deep, Y nodes wide, keep reference points that you can then go back to and restart the query from. Maybe write some sort of SQL function, but those things are hideous.
It would be interesting to see a script-type language on top of a traditional relational db.
The main reason is speed. Graph traversal is constant time as opposed to O(log n) for each traversal. The speed only becomes relevant later, and you can therefore add graph databases as a slightly delayed replica on top of a db for things you need to traverse fast.
This is true, however I think this again conflates two things: the relational model and the "tabular" data layout. I gather that databases like Neo4j obtain constant time traversal by storing each node's list of "relationships" at the same location as the node itself. We might be able to generalize this to a relational database, where every row in the database is stored together with the list of addresses of all other rows which reference it via a foreign key.
How? Describe a detailed solution.
PostgreSQL has an "array" data type (http://www.postgresql.org/docs/9.4/static/arrays.html), as do other databases. You could use that for that purpose (I know of research prototypes that have attempted this). It is unlikely to work well in general, especially given that degree distributions are often skewed (i.e., there is very high spread in the number of relationships per node) -- if I had the time, I would do a proper empirical comparison.
Couldn't this be improved with new data types and types of indexes in a relational db, similar to spatial types/indexes?
datalog is slow, even if performance was improved it still, not competitive with SQL. That said is really handy.
I'm told that Datalog is equivalent to SQL with recursive subqueries, which means that any Datalog query could be compiled to SQL. (Whether it's feasible to compile to efficient SQL, I don't know.)
I'm not aware of such a thing.
Really this is based on my readings and few experiments, I'm not 100% sure that datalog can not be made faster and socialite (see below) seem to claim I'm wrong.
Datalog is the native query language of datomic. I can put the finger on it, but I think there is a project for querying cassandra with datalog+clojure.
This is the most recent work I'm aware of:
- python querying of hadoop with a datalog-like language http://socialite-lang.github.io/
- overview: http://fr.slideshare.net/Hadoop_Summit/t-325p211seo
Also best way to learn datalog: http://www.learndatalogtoday.org/
I've only had a brief look at Neo4J, but I really like what I see. Graph databases seem to be marketed as an ideal way to represent a social network or similar where you want to represent the connections with other entities.
I see them as much more than this. I love using document databases because I don't have to worry about strict schemas. However, I lose a lot when I have multiple entities that are related in some way. Say I have a list of books and a list of authors. In MongoDB the book object would just have an id of the author and would then need to do a separate query to get the authors details.
Before graph databases, this sort of data would be best represented in a relational database as they have a really efficient way to join related data together.
However, with graph databases I get the best - and even better - of both of these worlds. My book entity contains a pointer to the author(s), so to query the book and their authors is even faster than a relational database.
In addition there is no fixed schema, so I am not tied down to a strict schema, thus allowing me to iterate faster. This could of course also be seen as a disadvantage - however I see no reason why a graph database couldn't build in the facility to have strong validation on the schema for specific entity types..
What is there not to like about Neo4J?
> However, with graph databases I get the best - and even better - of both of these worlds. My book entity contains a pointer to the author(s), so to query the book and their authors is even faster than a relational database.
Many folks take this as a given, but the performance difference is often not that large, and may even be in the other direction, if you build the right index (e.g., on the AuthorID in this case). The index is essentially those pointers as well. If the data fits in memory (Neo4j, in the past at least, has pretty much required that), then the B+-tree index lookup is also very very fast. You would be surprised at how fast PostgreSQL can do even seemingly complex graph-natured queries, e.g., finding, for an author, the co-author with which it has written most books.
In my experience with Neo4J a few years ago, i naively used it as my primary DB. The latency (not suprising given its JVM based) was painful and often there are classic relational aspects that we tried to capture in the graph.
This is ENTIRELY MY FAULT - not Neo4j, they never market it as that - just as MongoDb never markets itself as a true replace for relational problems.
But that said there is plenty to not like about Neo4j if you don't use it right - so make sure you use it for what you need and not what you don't need
I was at a Neo4j intro meetup earlier in the week. They do market it as a primary DB. Maybe the capabilities are different now.
I personally really like RDF graph databases, because while they look like graph databases such Neo4J they might be implemented completely different. For example Virtuoso is one that is fast for some query types and underneath it is a columnar relational database. GraphDB by ontotext is a custom only SPARQL implementation. Queries that work in one will work in the other (bugs excepted :( ) the only thing that changes is what are they fast at :). Side benefits is that you get a decent JSON api for free with JSON-LD and JSON-SPARQL.
When ever I try Neo4J and Cypher it just feels slow and limited in comparison to the semweb competition.
previous discussion of Neo4j: https://news.ycombinator.com/item?id=7946024
Congrats to the team of Neo Technology. If you're interested in discovering how to visualize Neo4j graph databases, check out Linkurious : http://linkurio.us/
Or just install Neo4j, start it, and go to http://localhost:7474 in your favorite web browser.