Outlier Detection in SQL
periscopedata.comI've recently been doing a bunch of stuff with sports stats, which involves lots of GIS data. This sort of thing comes up a lot - trying to find a player's 'territory' based on coordinates of their actions in a game, without including outlying events that cause you to overestimate the area.
There's a concept in animal behaviour called a 'home range' which is more or less the same thing - GPS attached to tigers in the wild etc. Some of the algorithms there are quite interesting, from simply drawing a bounding box around the the data points, to working out the probability density, to things like LoCoH, which sort of recursively build up convex hulls from nearest neighbours.
All of these things are pretty much possible in SQL to one degree of performance or another. But ultimately I'm fascinated by things like SQL Server's R support - you can get far simpler, more natural implementations of these things in R (or indeed in custom aggregates or functions in other languages). I think in the long term, database engines that offer this sort of extensibility are going to thrive for analytics work, be they SQL based or otherwise.
I hope it works out but they need to do a lot of work on it between these CTPs and RTM. They've made a critical mistake in the way they've marketed and documented it.
* They call every use a "data scientist". This is fun and assuredly deserved for R experts but is going to scare a hell of a lot of potential users away.
* The installation notes give you a lot of steps but zero insight into why you're doing them. Install these two packages; okay but what exactly are they doing? We need to know this kind of thing. R people might know...
* There's very little troubleshooting documentation either on the installation/initial setup (and there are a LOT of bugs in getting it going; trust me I spent most of last week digging through them) but also once it's operational how it's going to be tracked and managed within the context of everything else that's going on in the SQL instance. Like the memory pools, wait states, resource governor pools, etc. We kind of need to know this stuff. Otherwise we see a server that gets reported as slow, we know it uses R, but we don't know where to look to determine whether R is causing it or not and what we can do about it.
* AFAIK it's single-threaded. Considering most places have super-downsized their CPUs and are going for massively threaded servers these days (which is arguable given Microsoft's 2012+ per-core licensing model; but I'm talking about servers in very big enterprises that have unlimited licensing agreements), that's not going to end well. It's very possible this will be fixed before RTM though.
* And finally, because they've bought revolution and rebranded it as Microsoft R a few days ago, and now it's split into SQL and non-SQL products, there's going to be a lot of confusion and a long wait for appropriate training materials to catch up for all of us non-R users.
* I haven't been impressed with the current tutorial materials. As a non-R user we're going to need a lot more to really understand how to use it. I'm sure there are fantastic R resources available but if you want to understand it with a Microsoft SQL Server background then it's a slightly different story...
Can you easily fit things like glm across a view or large table with the SQL Server R addon?
Interesting - I just helped on a similar project that showed football (soccer) player zones of control using voronoi diagrams for a CES demo: http://www.geekwire.com/2016/sensoria-microsoft-soccer-colla...
We used a custom PowerBI visualization (d3 + TypeScript). For dev purposes I just averaged data with some simple SQL queries, but the real thing used (I believe) 10k datapoints fore each player over the course of the match.
Is it just me or is Periscope allergic to saying what databases it supports? I can find nothing on the website, and I am really hesitant to sign up for more info once companies started actually calling my phone after I did that.
Also, none of those queries will work on MySQL.
Periscope Data cofounder here. I wouldn't say we're "allergic" necessarily. ;) Like a lot of startups, we're very young and haven't gotten around to building out the website as much as we'd like to.
`jplitz and `ajones are right. As of now we support MySQL, Postgres, Amazon Redshift, Vertica, SQL Server, Oracle, MemSQL, Sybase, Exasol and Google BigQuery. We add more all the time.
The examples worked well in Hive also.
I've never used Periscope but have followed and read the blog for a while. Always great content.
Shouldn't be too much work to copy/paste that onto the website, should we expect to see it there soon? :)
I recognize the syntax in "date_trunc('day', created_at)::date as day,", namely the cast "::date", as PostgreSQL syntax. Could be something else as well but I've only seen it with PostgreSQL.
They support Postgres, Microsoft SQL Server, Redshift, and Oracle I believe.
I know that they also support MySQL. We have Periscope connected to a MySQL database.
That's useful info, I may sign up for a demo then.
They also support MySQL. At least they did when I tired it some time ago.
PostgreSQL or its derivatives.
The code would work on most engines that support CTEs, with some changes to the type casting syntax.
Thus most except MySQL.
The examples are either PostgreSQL or Amazon Redshift (which is based on the same syntax).
It's redshift, based on the contents of previous blogs.
Is it just me or is Periscope allergic to showing the "Pricing" page? (SaaS anti-pattern) Or is it free?
The blog content is good though
I won't blow it by sharing the pricing they quoted me, but I thought it was pretty reasonable.
That said, I do have one major gripe with their sales process: Their free trial lasts 7 days. There is no way a company can evaluate a completely new BI system in a week. I begged and pleaded and they would not extend the trial.
I'm not just saying this, but I really really do think we would have bought it if we could have had 30 days. I just can't build enough organizational momentum to spend money on a new system in 7 days.
Sorry about that! Just sent you an email.
You can get a free trial, but it's most definitely not free (on the contrary, the price is a lot higher than i would've expected).
It's not extremely expensive, but it's clearly aimed at enterprise-level companies and not at small businesses.
I would like to know about more sophisticated techniques for outlier detection. These are stat 101 level. Z scores? You can get into a lot of trouble assuming a normal distribution.
What are credit card companies doing? What's the best way to combine multiple variables that are predictive of an event for outlier detection? Is there a simple framework to automate reporting of these events in real-time?
One way to model this might be to treat the outcome ("Is this event an outlier?") as a 0/1 variable and use one of the many ways to model that type of data–random forests, logistic regression, neural networks, etc. The problem is that this isn't really "outlier" detection anymore.
Anomaly detection is very different across different domains. For CC fraud / risk, you have discrete transactions so the problem is one of classification, and generally approached with supervised learning.
I don't know what you mean by combining multiple variables. Do you mean analysis methods that work with multiple variables (instead of 1-dimensional z-scores) or do you mean methods that combine multiple variables into 1, to reduce input dimensions (i.e. principal component analysis)
Because data and data reporting platforms are so different across companies, there's no 'simple framework' to do reporting. You probably want something like https://github.com/etsy/skyline.
You also describe an ensemble method for outlier detection, which is what Skyline uses. I want to note that there is no reason to consider ensembles "not" outlier detection.
I meant using multiple variables to categorize outlier events. What was shown here are also techniques to categorize discrete events ("Does this day cross some threshold?"). I guessed supervised learning methods.
These are some cool SQL tricks! I like it.
The big caveat with the standard deviation technique is that it assumes a normal distribution. Many datasets are not actually distributed normally (power-law, Poisson, beta, etc, etc) and so the technique won't work. It's a much harder problem to 'generically' detect outliers without knowledge of the underlying distribution.
I don't have any idea how to do it (though a former colleague came up with nice idea of building a histogram and searching for values that occurred after some number of empty bins, implying an outlier). Is there an accepted state-of-the-art for general-purpose outlier detection? Or is that such a broad question as to be meaningless?
Other than normal distribution assumption, there is another assumption that doesn't hold true for most time series related to human activity, nature, or scheduling. If heteroscedasticity is present, you cannot use the same standard deviation for the entire series. A more practical approach is to compute variance for each calendar period separately.
Here's an example - expected variance for the number of SWIFT payments processed during non-banking hours is 0. Transaction counter greater than 0 is an outlier.
This is a very good point (to me as someone without any stats knowledge).
I think when the article begins it should first provide code that can validate that the values you're going to give the function will fit within a normal distribution that makes the outlier detection worthwhile. Is that possible?
Your colleague is describing something similar to kernel density estimation, which would be your first port of call google-wise.
Standard deviations assume normal distributions... but ultimately when you look for some two standard deviations from the mean, you're just look for an event that only occurs 95% of the time. A technique that works regardless of distribution is percentiles. Postgesql supports these now as well.
I wish they went into more detail within the context of what we're discussing. I've looked at the wiki pages and my eyes glazed over.
Trying to pick out "bad" stats from the gigabytes of information I gather from the servers I manage is always a top-of-the-list thing that I want to do.
But when we're talking about 95%; well I mean it's fine for showing on a graph but how can you run alerting for something like that?
Maybe I'm just unfamiliar with some of the Postgres syntax.. but shouldn't there be a GROUP BY in the first query?
Thanks! Fixed.
More please.
try using moving statistics, looking for outliers in more than 1 dimension, and so on