Querying an RDF triplestore using SPARQL and Relational pipes
relational-pipes.globalcode.infoFor research, I created experimental RDF storage on top of Parquet and Apache Spark for querying big graphs[1]. It converts the RDF graph in a sort of property graph, where we have a row for each entity and where the columns are the all possible properties. The trick is to use a columnar format with the proper encoding (in our case Parquet), to solve the problem of having a lot of columns and a huge NULLs space. With this representation we can eliminate costly joins for most of the common queries, but also reduce the size of the necessary ones.
Wikidata has a great SPARQL editor with plenty of examples and built in visualizations:
My experience is that it's a opaque query tool and the wikidata editing process does nothing to make it easier. Maybe if you know the ins and outs of your data and at least a basic understanding of the query language.
This is an awesome post! I'll be rereading this a couple of times ;-)
> Several ways of connecting to an RDF triplestore
There is another way to connect to an RDF triple store [1]. It's done by a company called Triply.
Triply made a product that is basically a GUI that allows you to host RDF data and use SPARQL to query it. You can also query other public datasets [2]. Currently, it's a B2B offering only, but there is a consumer version in the works. I wouldn't know when it comes out.
For now, you can try out the SPARQL querying feature on public datasets [2].
Disclaimer: I work there recently, this post is my own opinion, not of Triply.
[1] That uses Virtuoso or Jena under the hood.
[2] This example uses DBPedia: https://triplydb.com/wikimedia/dbpedia/sparql/dbpedia
I was wondering if there’re use cases for this in the real world where relational database doesn’t cut it?
I was tech lead for an RDF/Linked Data base Library Discovery Interface called Prism.
The dominant Libary Interchange Format is MARC which predates the relational database by a few years, and is a hierarchical document structure. It's possible, but not particularly helpful, to store it normalised in a relational database.
I gave a presentation almost a decade ago about our work. [1]
The takeaway is that Linked Data is just a more natural fit for modelling Library Data, especially when you use external sources like VIAF[2] to help anchor your identifier URI's.
Sadly, whilst RDF has seen some uptake in the library world (e.g. BIBFRAME[3]), the full potential was never tapped whilst I was working in the space. Quad Stores are very fun to tinker with though, and features such as inference and property paths in SPARQL 1.1[4] allow you to do some interesting things that are difficult, or non-idiomatic, in a relational system.
[1] - here: https://www.slideshare.net/philjohn/linked-library-data-in-t... [2] - http://viaf.org/ [3] - https://www.loc.gov/bibframe/ [4] - https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#prop...
Potentially, if there is a need for the description logic supported by OWL. Some of the logic would be pushed from the application into the data layer, e.g. as a replacement for something like Prolog in the app. If you're dealing with a database and not a "knowledge base" of facts, then it's probably not for you. If you need a graph database, consider regular graph databases instead.
I have never seen a case where a triple store was used because it was necessary to achieve an outcome. It was always part of the premise, to show that a certain task can be achieved using triple stores. The "semantic" label is also problematic, some people think that "semantic" technology is magic and will somehow help them solve their problems.
My experience with semantic technology is from University and a commercial project. An architect who didn't talk to the engineering team decided that "semantic" technology should be used. The project was a catastrophe, we spent most of the time trying to get the technology to work for the simplest things. The situation improved when we started working around the semantic stuff, using a relational DB internally allowed us to improve performance by a few orders of magnitude.
Because all data is stored as triples of subject, predicate and object, the indexing options for improving query performance are limited compared to relational databases. While it's possible to change the graph structure to speed things up, the structure is usually chosen for semantics and defined in an ontology. A change in the structure is also a change of the semantics of the graph.
Given an undocumented triple store, it's quite difficult to figure out the graph structure stored inside. In a relational database you can just run the equivalent of "SHOW TABLES" and go from there. In the semantic world, you need a manual for the ontologies used. It's sad because the whole point of "semantic" technology was to attach meaning to data.
Triple stores also receive way less attention than databases like Postgres or MariaDB and I'd rather use something proven in production scenarios.
I work in the energy space. A relatively common format for information exchange is the Common Information Model ("CIM")[1] which is usually serialized as RDF XML documents.
In Europe, this format is used as the standard data model for transmission system operators (TSOs) to share network information needed to run the Europe wide electrical grid. ENTSO-E [2] publishes a set of RDFS profiles that more tightly scope the data model to their use cases. [3]
With that use case in mind, why would an RDF database be useful in this case: data size. It's quite easy to get into millions of objects in the data graph for a single distribution feeder. A large utility might have thousands of such feeders, plus associated sub-transmission & transmission infrastructure (the hierarchy is nicely shown in [4]).
This can be represented in a relational database, but a lot of the queries start to become recursive. If object A connects to object B which connects to object C, and you want to query everything connected to A, you don't know that C exists until you've found object B.
Refs:
1. https://en.wikipedia.org/wiki/Common_Information_Model_(elec...
2. https://en.wikipedia.org/wiki/European_Network_of_Transmissi...
4. https://en.wikipedia.org/wiki/Electric_power_distribution#/m...
> This can be represented in a relational database, but a lot of the queries start to become recursive.
Recursive queries are natively supported in SQL, via Common Table Expression (CTE's). It's not like a triplestore is doing anything different underneath.
Most of my academic and current research career involves RDF and SPARQL and I wish it were a bit more used outside of academia or esoteric library / information science related fields. Currently taking a bunch of heterogenous data from different providers, most of which is in some kind of relational database, then transforming it into RDF and letting it be queryable using sparql. The creation of aggregated entities based on related resources from different providers, plus the possibility to do some advanced querying using sparql, is the main benefit here, though I don't know enough about relational databases to say whether or not it's just as doable there.
I'll say one disadvantage of RDF is the lack of any well implemented reification of triples.
RDF* or Edge Properties fixes this definitively and it's spread to most RDF systems. See http://stardog.com/docs/ for example.
SPARQL is nicer than SQL in a few cases: * Federation is better than ETL+Datawarehousing * Data is to be integrated on demand by the user
A practical example of mine is the UniProt database in the life sciences and the European Patent Office SPARQL endpoints.
These two datasets have some intersection of data. Combining these two in a classical datawarehouse with ETL pipelines would cost a few million in start up costs (Full data fidelity, small team 1 year work, optimisitic). The same with non federated RDF/SPARQL is 3 days work. This with federated querying is 2 minutes work.
SQL has a richer ecosystem with many more people confident in it's usage. More deployment options. etc. Which is why often you will see tools like StarDog Virtual Graphs which will turn existing SQL DB's into SPARQL ones (via translating SPARQL->SQL) for in organization federated knowledge graphs. i.e. no to minimal ETL pipelines, direct querying on (standby, copies) of primary datasources.
In some domains the "business" analysts know SQL, even rarer but possible they know SPARQL. Letting them ask any query they can think of not bounded by what is one specific database can be extremely valuable. For organizations that can extract that value the lower market penetration of SPARQL is sad but not a real issue. This works in practice for SPARQL but not for SQL as what can be done with a user stated SERVICE clause in SPARQL requires a DBA to setup a foreign table connection in the SQL world.
Another example is when a few (N) hospitals need to exchange data. They have a few more relational databases (N+2) for this data in house for their patient groups. Upon project commencement they notice that all are SQL, but non are similar from vendor differences, but more problematic modelling issues. Transforming their data to RDF is as complicated as to standardize on one new schema. But the RDF gives immediate integration with SnoMedCT, ICD10 and LOINC which allows easy queries taking into account hierarchical knowledge in those medical standards. Those queries would be possible two write by hand in SQL but are easier in RDF/SPARQL when attaching a minimal OWL reasoner. Then integrating with GeoSpatial data is again easier because in this country that is provided in RDF as well.
NOTE: A full fidelity UniProt Schema (including all subdata sets e.g. UniParc) would be a 150-200 tables depending on some modelling choices. EPO I assume to be in the same order.
NOTE2: While federated querying is a standard SPARQL feature this can of course be limited/turned off depending on the security/legal context.
RDF is apparently heavily used in bio-informatics and life-sciences. I'm not well versed enough in any of the subjects involved whether a relation DB would not cur it though.
Being able to create cross links between data independent of a predefined schema seems useful for research though. With RDF, you can just define a predicate on your own and link two objects/subjects in completely different, independent triple stores, if I understand correctly.
RDF, SPARQL and Linked Data are standards for interchange. They don't care about the underlying technology, you could definitely expose a relational SQL database using a Linked Data endpoint.