Scaling product analytics built on ClickHouse
posthog.comClickHouse is awesome, but most of the benefits come from columnar storage and you need to design around that. Be aware of how the thing works and how computer architecture works, because sympathy with the machine is what reaps rewards.
You want to minimize the number and size of columns touched when filtering and aggregating. If you need source data, store it relationally or in a document store and only select the key from CH. Don't put JSON in CH, fat columns don't make sense. And CH can be just as slow as MySQL if you select a whole wide row but only apply predicates to a handful of columns. Only touch the columns you need.
Joins are super expensive because it costs a whole lot of instructions to look up a row on a per value basis. CH can use vectorized operations to eliminate or aggregate multiple "rows" with instruction level parallelism because the column data is contiguous. Joins are going to be an order of magnitude slower, just with memory latency randomly hopping around a hash table. Insert data prejoined; use the low cardinality string column, substitute (i.e. precalculate) conditions on low cardinality relations with integer IN tests, denormalize high cardinality relations.
Partitions and various other storage level tricks are a way to eek out better perf for mutations when they're needed. Rebuild a subset of the data and swap it in and out. This is common on Hadoop based columnar stores like Parquet and last I looked at CH it was getting better ways to shuffle partitions around, "attach partition from" and so on.
In practice product requirements can get in the way of technically ideal solutions. One example of this is that analytics products allow users to pass in and analyze arbitrary number of user properties to do analysis on - more than even a columnar database can handle. The current solution of storing JSON as a column indeed has a very significant performance trade-off, but it's also needed to power queries users need to run. This is also why we're really excited about the new Object data type that landed in 22.3 as it handles these cases gracefully by creating dynamic subcolumns.
On JOINs - again, requirements bite us in different ways. Product analytics tool data ingestion pipelines can get quite complicated due to needing to handle merging anonymous and signed in users and user properties changing over time. Handling that via JOINs is as a go-to-market helps avoid that upfront cost by centralising the logic in SQL, but indeed does come with a significant cost in scalability. Delaying in turn allows you to be building tools users need. That said every loan needs to be paid at some point and we're currently knee deep in re-architecting everything to avoid these joins.
Also note that JOINs don't work the way you described from our experience - rather the right hand side of the join gets loaded into memory. The bottleneck there is memory pressure rather than I/O with a good ORDER BY on the table.
All that said, what a great summary of all the different things to keep an eye on. Thanks for reading and sharing your thoughts!
I talked about memory latency with respect to joins - that the random access in the hash table is much slower per "row" than vectorized operations on columns. I didn't say I/O would be slow. That I said hash table implies that one side is fully loaded into RAM.
Perhaps your experience is different from mine but JSON string columns are very common in ClickHouse. A common schema idiom is to put the JSON in a string and then selectively materialize commonly queried properties into regular table columns. You can index the column with a bloom filter index to do needle-in-a-haystack searches on the entire row's data. Bloom filters are tricky to parameterize correctly but I have seen this approach used to good effect in very large applications.
Yeah, I think my exploration of the design space is different to other people who are using CH to e.g. store event data.
If you can eliminate most "rows" with criteria on materialized columns, then the occasional dip into JSON shouldn't be too bad.
> use the low cardinality string column, substitute (i.e. precalculate) conditions on low cardinality relations with integer IN tests, denormalize high cardinality relations.
Could you please elaborate on this one?
String columns are expensive to compare, certainly more expensive than integers.
ClickHouse has a low cardinality option when defining a column and it's particularly helpful for strings, especially those which are logically enums or a denormalized lookup from a small source table.
https://clickhouse.com/docs/en/sql-reference/data-types/lowc...
In essence it stores integers instead of strings (or whatever) so it can use a fast int comparison when you're using equality conditions on the column.
If you want fast scanning you don't want joins. So you want to denormalize attributes that you'd normally reach for across a join. But if you can't denormalize those attributes - e.g. they change (so they'd be prohibitively expensive to update in ClickHouse) - then you can store the primary key in ClickHouse, calculate your conditions separately, then use an IN check.
Let me give an example. Suppose we have a Label table in a relational database and an Event table in ClickHouse and we want to find events which have a particular label name (or owner, or possibly other attribute even more joins away).
Instead of using foreign tables in ClickHouse and joining Event to Label, or importing the Label table, you can evaluate your condition in your relational database to get a list of label ids, and then use an IN (label_id, ...) in the ClickHouse query. That way you avoid the join.
We store billions of JSON in CH and it performs beautifully.
Not sure where you’re getting your experience from.
If you look up a single blob by id, you're going to have a bad time.
My experience comes from using CH as an secondary store for relational data where it could be filtered, sorted and aggregated much faster. The downside was that pulling out all the columns resulted in poor performance - like I said, it was just as slow as a relational store when doing a SELECT *.
This is impossible. I suggest reading through CH's json functions and seeing how they combine with materialized views.
I like Posthog (and ClickHouse).
If the author of this post is reading, just a recommendation - when you write posts talking about performance increases - the reader kind of expects to see some type of before/after graph that pictorially shows the improvements.
Not this reader, FWIW. The headings "10x faster", "speeding up by 25x", and "23% win" painted a clear picture for me.