Realtime funnel analysis using Solr and Cassandra
blog.getjaco.com>For funnel analysis, it’s not feasible to use this data model for getting back a summary of the funnel steps and the sessions matching it, since there’s no option in Solr to run a recursive query, which would allow to go over each session and check if it’s a match for the funnel.
I don't think this approach scales, even in an environment that supports recursive queries like PostgreSQL.
The more scalable approach would be to use either a commercial database systems with explicit support for pattern matching or encode conversion path as a string (ex: "top page -> product page with SKU=1337 -> Purchase" becomes "T_SKU1337_P") and use REGEX/GROUP BY.
In all cases, this sounds like a suboptimal use case for either Solr or Elasticsearch.
Why do you think this approach isn't scalable? would love to hear your input on that. Also what commercial database systems do you think will be good for this?
The suggested approach most likely requires a lot of recursive backtracking. Of course, there's an efficient way to implement this, and that's what most commercial databases' path analytics features do. Here's one example by Oracle: https://docs.oracle.com/database/121/DWHSG/pattern.htm
I've always found it befuddling why so many developers want to use Solr/Elasticsearch for analytics heavylifting. It's probably because
1. SQL is not the most intuitive (although most pervasive) API for data analysis
2. Much of the data is already in Solr/Elasticsearch to make your data searchable/perform simple roll-ups and filtering, etc., so it'd be great if you can do more complex analytics against them as well
AS to why Solr/Elasticsearch is not ideal: the existence of superior alternatives that is OLAP databases.
Why do people use C* in addition to ES? It seems like in this case most of the data could directly be piped into ES?
I understand that ES can lose data, or have some data storage problems, but one could just as well store all the incoming data on Hadoop or so, without having to bother with C*, no?
C* makes it much easier to manage a cluster of Solr as the data grows (specifically with DSE), as with the tight integration you get all the benefits of C*. (HA, eventual consistency, multi-dc replication..)