Apache AGE, a PostgreSQL extension with graph database functionality
github.comAlways bet on PostgreSQL!
I hope AGE matures a bit in the future. There are lots of use cases for Graph Databases. One I'm interested in is bitemporality. It's easy to use ltree or CTE for tree-like structures. But what if you want to move nodes in the graph at certain times? Like a device being scheduled to be in different rooms across time. And also the history of those schedules. In a graph database you can label edges with temporal attributes and then query for a view of the graph at a certain point in time and in a certain history state by filtering the edges.
That's a really interesting idea.
Can you recommend any good references for bitemporality in graph dbs?
Theres XTDB
Thank you!
XTDB Core2 looks very interesting. I've worked a lot with Arrow, which makes it extra appealing.
I'm hesitant with databases like these because they are less battle-proven. I normally want my database to be as boring as possible.
What subset of the openCypher dialect specification will Apache Age support?
Every implementation is all over the place and completely non-portable. And neo4j performance leaves much to be desired.
My personal go-to is RedisGraph paired with RedisInsight for the instant visualizations. It just feels "right" and, while not perfect, is overall intuitive.
I've been following RedisGraph with great interest since I first heard about it. Fell in love with cypher but not a big fan of neo4j. Most recently I've been playing with EdgeDB (not cypher-based) which scratches my graphDB itch pretty well but it's still a little too early for me to consider adopting it in a serious project.
I'm glad there seems to be continued interest in graphDBs, I think there's a lot of potential in that space and I'm eagerly awaiting a clear winner to emerge.
We might see a clear winner in query language, not product. What is the clear winner in relational databases? SQL certainly but what product?
The Apache AGE is inspired by AgensGraph, a multimodel database fork of PostgreSQL. So it uses AgensGraph dialect.
AGE is an extension, not a fork. You can have relational tables and graphs in the same database, probably you can also use both in the same query.
That's right, AGE is designed to use both queries either separately or together. SQL (PostgreSQL), openCypher (AGE)
Related:
Apache Age: A Graph Extension for PostgreSQL - https://news.ycombinator.com/item?id=26345755 - March 2021 (45 comments)
Apache AGE: PostgreSQL-based graph database - https://news.ycombinator.com/item?id=26309560 - March 2021 (11 comments)
Anyone tried this? How does the performance compare to neo4j and RedisGraph?
I’m about to give RedisGraph a try and I guess I will try this one a go as well.
AGE 1.1 should perform better or at least similar to Neo4j. Not sure about RedisGraph.
Did you try dgraph? For our use cases it won over neo4j. Didn’t try redisgraph.
We’ve been using Dgraph in production for a few years now.
The project is fine for hobby projects but it is NOT production ready.
Don’t take my word for it, though… I invite you read through some of the issues reported in their discussion forums and to take a look at their Github contributions over the past year.
There was major turmoil in Dgraph Labs (the project’s maintainers) last year which resulted in the CEO and 95% of the engineers exiting the company. They are currently in a rebuilding phase, with limited staff and runway.
There are several critical bugs, which lead to either data loss, data corruption or cluster instability, which the current maintainers have failed to fix. Additionally, their customer support is often either unresponsive or unhelpful (even for paying customers).
Running a Dgraph cluster is expensive, with heavy memory utilization and favoring vertical scaling. If you need scale, then be prepared to spend big.
The documentation is not great and because very few people use this project in production, help is extremely limited.
Best of luck to you should you choose Dgraph and to anyone currently using it already.
There is also a graph layer for Cassandra. It was based on Titan ... JanusGraph.
I also played around with a graph-document database hybrid when I had downtime, but never got it close to anything usable.
A json document database with relations between documents is basically a property graph. I've seen a lot of the document databases (rethinkdb, orientdb, elasticsearch, etc) that seem close to realizing this too, but no one has run with it.
Most document databases have some sort of nested "walker" api, and if your json doc has properties that are subdocuments, will walk those. That's basically a graph api.
I wrote it as a "streaming api" so a large document/property graph could be serialized out to the client as the lookup engine walked the graph, and you don't need to fully load a complex set of documents in the query layer memory before sending it out to the client.
But I just didn't have the development horsepower to get to the various query and index capabilities. I think the general distributed design was decent and offered hybrid plain-old-table, document, and graph capabilities all in one. And cassandra, PITA that it is, does linearly scale.
> There is also a graph layer for Cassandra. It was based on Titan ... JanusGraph.
It also was monstrously hard to reason about the components in that thing, and our experience was that it didn't take any efforts to delineate which parts run "locally" versus which parts are evaluated on the Janus server
At the very least, our experience was that the (Janus/Gremlin/Tinkerpop) client library who just wished to run some Gremlin needed to be aware of all the nuances of the backing Cassandra store, and not (as a reasonable person would presume) queries sent to Janus and Janus talks to Cassandra using its own auth
0/10 I would quit over someone trying to make me use this again
Didn't dgrapgh dissolve? Based on open functionality discussed on their help forum and stbility issues I wouldn't touch it any more.
A product which lived up for VC money and little more.
No.
We benchmarked all 3; redisgraph was fastest by far.
Downside is it's harder to keep all data in memory and Redis has less sophisticated query facilities. Not sure how AGE stacks up in that regard, especially interesting would be joining graph data and non-graph data.
Regarding query facilities RedisGraph supports Cypher
Ok, but on what metrics? I don't mean I disbelieve, I would just like to know that it was fastest on the things I might care about.
it's an in-memory graph, I'd kind of expect it to be faster than the others
Thanks, will try it soon.
This is written in C. I wonder how common it is to write PG extensions in safer langugaes and what would be the most suitable.
I'm somewhat wary of using nontrivial C extensions, having seen so many of them sometimes seg fault the backend (eg PostGIS). There seem to be PG backend crashes described in this projects issues as well.
PG has a special memory manage rule, named MemoryContext. All memory allocated in a context will disappear when it leaves that context. this means that you can safely not free memory, or your memory will be freed in unexpected places. this is a big conflict with the way rust manages memory. write extension in rust won't improve it much.
And in PG, there is a special method to create a process, creating threads is not possible because the logging system makes heavy use of setjmp().
> creating threads is not possible because the logging system makes heavy use of setjmp().
Naive question from a non-c user, setjmp/longjmp just manipulate the stack and since each thread has its own execution stack, that should be completely safe ISTM - so why is it unsafe/impossible? I'm missing something.
The general issue is that it is not designed for multi threaded environment and there is bunch of global and frequently accessed state. One could probably comb through the entire codebase and make all that state thread-local, but the benefit probably does not outweight the amount of work.
There's https://github.com/tcdi/pgx for writing extensions in rust.
Postgres itself is written in C. I suppose its every internal interface is in C. I wonder how many unsafe sections would an extension written in Rust have to have to use these interfaces.
I wish something like Lua + LuaJIT could be used to write such extensions; at least it's memory-safe. OTOH mapping these C interfaces to Lua structures, and making them work with GC may happen to be non-trivial.
PG ships with Lua support: https://www.postgresql.org/docs/current/external-pl.html
(Also Python, Javascript, and Java)
I don't know specifics about the API coverage. It seems this extension mostly just implements new SQL visible functions and data types, which should be doable from those languages as well. Composite types might have to be defined as PG records (or json) instead of C level new PG object types.
You can write stored procedures in it. Which is fine.
You cannot though write a new storage engine, a new kind of index, or something else that takes an extension.
I think very few PG extension use cases need new storage engines. Looking at https://age.incubator.apache.org/overview/ it says it's using the normal PG storage.
Regarding indexes - (1) you can implement most custom indexe scenarios based on supplied expressions involving custom functions: https://www.postgresql.org/docs/current/indexes-expressional... (2) I suspect for the graph use case you could leverage json indexes: https://www.postgresql.org/docs/current/datatype-json.html#J...
This is not to claim with any high degree of certainty that this could be implemented in PG/Lua, just from my armchair there doesn't seem anything immediately obvious that would prevent it.
For example the querying in AGE is syntactically implemented in the function cypher() that is used in the examples to receiving custom syntax as strings with the "dollar quoting"[1] syntax:
[1] https://www.postgresql.org/docs/current/sql-syntax-lexical.h...SELECT * FROM cypher('graph_name', $$ MATCH (n) WHERE exists(n.surname) RETURN n.first_name, n.last_name $$) as (first_name agtype, last_name agtype);
Timescale wrote their Promscale extension for Postgres in Rust. So it exists.
I'm considering using a graph database for a SaaS product. If I used Apache AGE, I would probably have a "graph" for each customer to partition the data. Are there any downsides or limitations to having thousands of separate graphs?
From the documentation it seems that each graph will use a separate "namespace" in Postgres. Are there any performance costs of switching namespaces for each query?
Or do you recommend that we use a single graph with a label per customer? This option seems like it could open up some security issues if some queries forget to add this label. By using a separate graph per customer, the query will need to have a valid graph name for a customer to return any data. If it is filtered by a label, you can easily forget to add it and think everything is OK because it actually returns results.
I might try running this with Docker just to try it out, but probably this is the type of project to watch and wait for maturity.
I am a big fan of graph databases. Professionally I have used RDF data stores with SPARQL queries and Google’s Knowledge Graph with a pattern matching query mode. I play around with Neo4J, but no one has paid me to use it yet.
I think it very likely that in a year or two AGE will get better Cypher query language support and other changes, and should be a wonderful platform for combining relational and graph data stores.
"Apache AGE is currently being developed for the PostgreSQL 12 release"
Well sorry, we have PostgreSQL 15 already.
If you keep reading, the words following your quote make it clear that PG15 should be supported:
"and will support PostgreSQL 13 and all the future releases of PostgreSQL."
and I believe the more recent the version is, the less stable it is. Sure there will be additional nice features, but older versions are still in progress. And yeah, if you kept reading, AGE sounds like it will support more versions in the future.
Apache AGE Discord
I’ve been waiting to see Apache AGE on HN. Looks amazing, thanks for all the great work!
Could it be efficient to use Apache AGE for e.g. retrieving all comments on an article?
Currently I’m using materialized paths to efficiency return all commments but would be keen to know if AGE can help query comments for an article more powerfully.
The ltree extension (https://www.postgresql.org/docs/current/ltree.html) is perfect for this usecase.
How does ltree compare with jsonb?
Have you tried recursive CTEs with a simple id, parent_id etc schema? These should perform very well if those columns are in an index.
Afaik this is pretty much the canonical way to store recursive comment trees. Or any kind of DAG.
As long as comments are a tree, there’s only one path from the root (the post) to an individual comment. How would a recursive CTE perform better than a prefix scan on an indexed string column?
Storing a pointer to each node’s parent or using sorted sets seems like it would make the parent poster’s query slower. Those approaches would make it easier to reparent comments, though, and they’d support arbitrarily deep trees (whereas the materialized path implementations I’ve seen limit path length).
Especially in postgres, you can use a hash index to get constant time index lookups rather than log(n) with a b-tree index, which is the best you could do with prefix lookups on a column. And you're storing and indexing less data since the tree is inherent in the structure and not expressed as part of the data.
Honestly I'd test both implementations to see which is fastest, but my gut is still that recursive CTEs would be fastest, while also simplest, structure-wise. You'd also still be getting all the benefits you'd expect from database-native functionality, like referential integrity and schema enforcement, etc. Presumably there's structure in these indexed string columns that the database knows nothing about, and thus can't enforce any constraints or optimize outside of plain string prefix lookups.
My experience has been that people don't try recursive CTEs because they don't realize they exist, and so they reach for all sorts of exotic or bespoke implementations of essentially that same concept. So I at least try to make sure people know they're a really solid tool to keep in your tool belt, and one that hasn't let me down yet even in very large data sets.
This comment gave me a flashback to Celko's SQL for Smarties. I believe the updated books are split off into a few smaller books? But the section/book on trees in a relational database helped me greatly once in a galaxy far far away.
AGE can handle that and recursive ctes can as well, but AGE has mechanisms to handle cyclic graphs as well.
Why not a trigger that maintains this in a simpler query in a separate table? Sounds more performant to me!
Recursive CTEs sounds like something you would do if your total comment count in the db is not in the six figures or something. What does HN do?
I once achieved it by setting a parent_id string column - for root comments it’ll just be article id. For replies it’ll be “parent_comment.parent_id || parent_comment.id”. Then it’s just a single list no recursion needed to get the full hierarchy at whatever level. Can also be easily migrated to dynamodb and get infinite scaling and zero downtime costs.
Interesting. What are some good extensions for pg? I have only used UUID and postgis.
> I have only used UUID and postgis.
If you used "uuid-ossp" to get uuid_generate_v4(), then this is no longer necessary since Postgres 13 as there is now a built-in gen_random_uuid()
Oh wow TIL thank you
The ltree extension is fantastic if you have data like comments or any other hierarchical structure.
Well, I wouldn't call it interesting, but I like the citext extension for case insensitive comparison.
TimescaleDB
pg_stat_statements
This is a great project, but last time I checked it was lacking a lot of CYPHER features and wasn’t moving very fast forward. But I’m hoping it will catch up to the point it will become useful.
Cypher as in the Neo4J query language?
There is now an OpenCypher specification and AGE seems to strive towards supporting most of it.
Does this construct edge tables (many-to-many) for every relationship behind the scenes? if so, can attributes be added to the edges?
I'd hope so! You can do that in relational DBs.
has anyone here worked on graph neural networks ? basically creating embeddings for node based on their edge connectivity (or reachability) and using that for neural networks ?
how do you do this at scale ?its generally a NP hard problem, but wondering whether something like AGE helps.
not sure how Google, etc or even someone on fraud detection does this at scale
You subsample. One package I used made N 'random walks' for each node. The random walks are written out as 'sentences', where the node id's are words.
That results in a huge text file, that you then embed as if it were a normal text. The result is a normal 'word embedding' where the words are in reality the node id's. Works like a charm. Highly scalable.
really ? so u keep subsampling as the data becomes larger and larger.
instead of ...well...throwing more hardware that seems to be easier and easier these days.
P.S. not trolling. im genuinely wondering if there is a better way to split the problem heuristically
All I'm saying is that you don't take into account all paths for each node. Just for ex 100 random walks starting at each node. And that results in an embedding that is 'good enough'.
Of course it is better to throw more hardware at the issue. But at a certain point the added value of being more precise or adding more hardware becomes moot, because you gain 0.1%
That is what I meant by 'it scales'. You can solve 'reasonably complex issues' with 'reasonably cheap hardware'
How are graph edges and nodes exposed to the Postgresql type system?
There is a documentation for AGE that has some basic examples on how cypher and sql can be mixed. Regarding nodes and edges: they are sub-types to a type called Agtype. More details are in the documentation.
AGE uses Agtype l, which is a superset of JsonB for its uses, the primary addition is the edge and vertex type.
I create a DAG using recursive sql. I assume that saving data in a graph and querying the graph with a native graph language would be faster. Has anyone benchmarked performance differences between the two?
Depends on the use case. Recursive SQL can be good enough, maybe even faster, for certain use cases. The problem isn't so much the query language but the indexing. Graph engines index the nodes and edges in a particular way so that traversal is fast.
Most examples of Recursive SQL I've seen will only involve nodes on exactly one Table an with exactly one kind of a relationship/edge (for example a tree with "parent" edges). Graph DBs allow you to relate multiple different types of nodes using multiple kinds of edges. The edges can have queryable attributes like an intermediary table in a many-to-many relationship. And all of that is still indexed efficiently.
Wish Postgres had an optional data type (tagged union), Does someone know an extension which implements that?