Stop using the elbow criterion for k-means

arxiv.org

79 points by Anon84 a month ago


whatshisface - a month ago

The elbow method has a simple theoretical justification. If the data actually has k well-separated clusters, the gains from splitting halves of clusters should be dramatically less than from splitting distinct clusters. If that isn't true, the data doesn't have well-separated clusters anyway.

joshdavham - a month ago

Since when did researchers decide to start titling their papers like clickbait YouTube videos?

The “Stop doing [conventional thing]!” title formula is, for whatever reason, the title that I always find the most annoying.

djoldman - a month ago

> ... much better alternatives such as the variance-ratio criterion (VRC) of Calinski and Harabasz [6], the Bayesian Information Criterion (BIC), or the Gap statistics should always be preferred instead.

That's fair.

And: clustering algorithms are unsupervised by definition, therefore there is no correct answer.

In my experience, the use case almost always controls which algorithm (k-means, DBSCAN, etc.) will be best as well as the parameters chosen, e.g. k if that is available.

tetris11 - a month ago

I always used the silhoutte method (mostly because it generated cool art), and I'm happy to see it works well for large K, though fails at K=1

pbronez - a month ago

Paper points out that K-means is a special case of gaussian mixture modeling that makes lots of assumptions about the data. As always, understanding the statistical assumptions behind your model is absolutely essential.

egberts1 - a month ago

But, but BUT credit card fraud divisions make heavy use of these "hockey-sticks" in K-means.

esafak - a month ago

Clustering is a poorly defined task. What constitutes a cluster, separation by euclidean distance? That's not scale invariant.

The representation and clustering algorithm should be optimized for the downstream task.

Der_Einzige - a month ago

Stop using K-means. HDBScan in its GPU implementation in CuML is probably superior, yes even on your "very large" datasets (A100s can be rented for less than 1$ an hour).

rgavuliak - a month ago

Elbow method was never working well in practice. In my over a decade of experience I am yet to see a significant separation in the chart.

bbstats - a month ago

Also, stop using k-means

jb1991 - a month ago

This seems a bit like a knee-jerk reaction.