Ten years of ClickHouse in open source

329 points by saisrirampur 18 days ago · 109 comments

Reader

drchaim 15 days ago

I discovered ClickHouse around 2017-18 and built a PoC to replace Elasticsearch: 5x better storage and qps, in a couple of weeks.

Managers rejected it because it wasn't well known and was seen as "some database made by Russians."

On a personal level, it's quite sad to have seen that train coming so early and not been able to get on board.

oooyay 14 days ago

I had the same experience recently. Turns out ClickHouse would reduce our DB operations by 60%, remove the need for a TSDB, and reduce query times from ~300-500ms (and sometimes ~3s) to roughly ~75ms. Lastly, and most impressively we were already seeing a ridiculous level of compression and our storage cost benchmarks were reduced to the cost of S3. This took a $2-3M storage layer down to one measured in the single thousands per month.
ClickHouse is no panacea but if you understand how your data is accessed and thus how to arrange it you will get so many miles out of it.
ashu1461 14 days ago

Same we are also stuck with ES wish could migrate to clickhouse but not able to do so because of the legacy load.
- cloudie78 14 days ago
  
  What do you not like about ES?
arunmu 14 days ago

Were you using it for simple grep search or actually required advanced searching for eg: BM25. Clickhouse will only help you with grep like search from what I understand.
- drchaim 14 days ago
  
  Actually, there was no search, only on-the-fly aggregations/filtering over "big data". ES was kind of famous at the time, although not the best tool for that job.
  afaik CH introduced FTS rececently.
fsuts 14 days ago

Can clickhouse to search? If not why did you seek to replace elastic with it
- sdairs 14 days ago
  
  Yes https://clickhouse.com/blog/clickhouse-full-text-search-obje...
  - fsuts 14 days ago
    
    Thanks
cynicalsecurity 14 days ago

Supply chain risk.

himata4113 15 days ago

ClickHouse recently has been a breath of fresh air compared to using timescaledb for a long time. Although psql is the greatest there is and I really enjoyed the fact that I could rely on a single database system to run everything, when it came to migration maintenance and deployment it's really a pain and it also feels like development on timescaledb is a bit wishy washy with all the structural changes from version to version and it really feels like an alpha product sometimes.

k_bx 14 days ago

I was using TimescaleDB some very long time ago, things have changed quite a lot since (it's now even named differently).
In my current setup I was thinking on doing both: upgrading postgresql to timescaledb (to archive old data etc.), and to deploy ClickHouse in parallel. I'm still considering whether to go big on PeerDB to get ClickHouse mirror or just deploy it separately without additional fragility layer.
Would you not recommend using timescaledb at all? I definitely want to avoid alpha-quality software pain, since PostgreSQL is one of the most rock-solid parts of the stack at the moment.
- wkrp 14 days ago
  
  In my (minor) experience Timescale works fine. The developer experience is good and it is very convenient to be able to JOIN against your hypertables. My only real complaints are operational (no logical replication, normal postgres update complaints), but man Clickhouse is really slick. I wrote some small reviews of the two in my submission history if you want a bit more detail.
- saisrirampurOP 14 days ago
  
  We at PeerDB/CickPipes tried to make that mirroring as frictionless as possible. It is validated at scale across 1000s of customer moving over half a PB of a data per month from Postgres to ClickHouse. You should give it a shot and you might be surprised.
  Side note: May be there is way even more native than CDC/logical decoding that you never have to worry about keeping PG and CH in sync. Stay tuned for some updates from us on that end!
  - Sai from ClickHouse
- himata4113 14 days ago
  
  I would just run both and decomission the old one when a) all data is migrated, b) old data is no longer relevant and can be archived
- __s 14 days ago
  
  Worked on peerdb. If you're able to batch changes on your end & push to both postgres & clickhouse, do that. Only move to peerdb when you know you need cdc
  - fsuts 14 days ago
    
    Just looked up PeerDB expecting a Db as per its name.
    But it’s a ETl tool. Stupid naming
    
    saisrirampurOP 14 days ago
    
    I know I know. Some people have loved it as it captures what it does (peering dbs) and some haven't because of the exact reason you called out. So we get it! :)
philippemnoel 14 days ago

There is significant work happening in the PostgreSQL ecosystem to make "use a single system to run everything" possible. ParadeDB is one such system pushing on full-text/vector search, with some light aggregations, at the index level. There is also work being done by DuckDB folks with pg_duckdb and others like Xata. (disclaimer: I work at ParadeDB)
- cpursley 14 days ago
  
  What’s the closest extension or tooling that provides functionality similar to ClickHouse? I’m looking for something like that before making the full jump and adding another dependency.
  - saisrirampurOP 13 days ago
    
    You should check this message that talks about everything we are doing to make adoption of ClickHouse with Postgres easy. It captures all the tools we’ve released and are actively working on.
    https://news.ycombinator.com/item?id=48599644

ksajadi 14 days ago

For our metrics and autoscaling engine at Cloud 66, we went through 5 iterations before settling on Clickhouse: 1. Redis 2. Cassandra 3. Handrolled: Ruby + RabbitMQ 4. Handrolled: Go + RabbitMQ 5. Clickhouse

Every time we reached some limit or huge optimization burdens that were unfeasible. Clickhouse has been rock solid for the past 4 years.

murkt 14 days ago

I am having a hard time picturing what could be the problem that you were solving.
Redis, Cassandra, RabbitMQ and Clickhouse. RabbitMQ looks like a black sheep in this lineup.
- tills13 14 days ago
  
  Also like redis and clickhouse are diametrically different platforms. Wild to go from one to the other.
- cpursley 14 days ago
  
  I’m gonna guess that it’s because Ruby is dog slow and not concurrent.

jaysh 15 days ago

ClickHouse replacing Loki finally made our observability stack feel 'right'. It really is a powerhouse for logs and general analytical queries.

oulipo2 14 days ago

How do you use it for visualization? Do you use ClickStack? or something else?
- jaysh 14 days ago
  
  Still via Grafana. I ran it side-by-side with Loki and despite trying to optimise Loki and using ClickHouse out of the box - it really was shocking how much faster ClickHouse was for every single query (e.g. in the last 12 hours give my the frequency of logs with a particular JSON event or even "find this log entry, then join back and find the number of times a different entry appears within the same correlation_id)
  - CubsFan1060 14 days ago
    
    What does the layout in click house look like? Do the input logs need to have a very defined structure?
    
    jaysh 14 days ago
    
    Not really, ClickHouse is super forgiving so you can do something like:
    CREATE TABLE default.events ( `timestamp` DateTime `event` String -- e.g. 'product.updated' or empty/null `message` -- human readable message `raw` -- the raw message - this is very useful when pushing logs that aren't JSON - you just let the `event` be null and dump the entire message here ) ENGINE = MergeTree PARTITION BY toDate(timestamp) ORDER BY (timestamp, event) TTL timestamp + toIntervalMonth(6)
    ClickHouse is extremely performant even in the cases of e.g.: SELECT count(*) FROM `events` WHERE `raw` LIKE '%hello world%'
    Of course, the more columns you splat out (e.g. like correlation_id, user_id, order_id, etc) the better you can index and expect those queries to perform but in general I don't bother outside the obvious core domain ones (exampled above), the performance is so good that unindexed queries are significantly faster than indexed queries in Loki. I have reached the point where I JSON extract on-the-fly for the WHERE clause with very large queries with no meaningful performance issues.
  - oulipo2 14 days ago
    
    Interesting, so you can bind a Clickhouse table as an extension to Grafana? Would you make a little Gist / post about it to show?
    
    jaysh 14 days ago
    
    You only need the plugin: https://clickhouse.com/docs/observability/grafana - then you get basically everything natively.
- malkia 14 days ago
  
  There is HyperDX - search is not fastest, but it could be something that we do too - haven't checked deeply if high-cardinality is big issue with ClickHouse, but seeing some high cardinality data with what we post.
- aleks_me2 14 days ago
  
  I have used SigNoz https://signoz.io/ for that
  - jauntywundrkind 14 days ago
    
    Worth noting both hyperdx and maple too for other observability on clickhouse options. https://www.hyperdx.io/ https://maple.dev/
- byoj 14 days ago
  
  We recently moved to openobserve for due to cost, but visualisations are good enough too.
- usrme 14 days ago
  
  Same question here!
  - jaysh 14 days ago
    
    Just replied to that question! Let me know if you have other questions.
tills13 14 days ago

We are fully embracing LGTM ourselves but this is really interesting. Loki for us has been great, though, so what is better about CH other than maybe sql being more expressive than LogQL?
- jaysh 14 days ago
  
  Off the top of my head:
  - substantially better performance on the same hardware, even moreso for larger range queries (multiple days)
  - no new query language to learn
  - significantly more expressive as you said
  - agents for scraping logs use way less CPU (I used to use grafana-agent which used about 80%, vector uses sub 5%)
  - very intuitive to manage TTLs - I can keep some logs for 10 years, and some for 1 week based on the event in the JSON
  - more compact storage, I didn't check scientifically but CH storage is better compressed at least 4-5x for us
  - no running into maximum stream limits - struggled with these even on Grafana Cloud and didn't realise we silently lost a lot of logs
  Honestly: why wouldn't you. Loki always felt like a mistake to me. A brand new query language, really counter-intuitive configuration, large ramp-up time for complex queries, lots of arguing about labels/cardinality etc. It all goes away when you drop it. I think logging should not be exotic or behave in unexpected ways.

lazyasciiart 14 days ago

> You can open a pull request as an experiment, without aiming for it to be merged - it will be tested with the same level of scrutiny as production releases. Found a new memory allocator, a new compression library, a new hash table, a data format, or a sorting algorithm? - bring it to ClickHouse, and it will expose it inside-out

Wow

benjamkovi 14 days ago

ClickHouse dev here, but this is true. ClickHouse contributed finding several bugs on our third-party libs (jemalloc, librdkafka for 100%, there much more, but I only worked on these), in linux kernel and basically everywhere. We have very rigorous fuzzers (yes, multiple fuzzers on multiple levels), running tests in insane number of configurations. I think the last number I heard a year ago is around 400 hours for a complete CI run for a single commit (not PR, but commit). So yeah, pretty insane, in the good way.
- dmurray 13 days ago
  
  Does it ever get abused?
  "This PR introduces the ability for Clickhouse to mine Bitcoin...":
  - zX41ZdbW 13 days ago
    
    The CI only runs after one of the maintainers adds a "can be tested" label. When the label is already added, it stops when any of the infrastructure-related files are modified, like Dockerfile, CI configurations, etc. This is quite ok, but not 100% bulletproof, as you can easily do weird things by modifying the code or using a bug in the compiler. However, the CI infrastructure runs on isolated machines inside an isolated account.
    Before adding the label, we have to check the diff for suspicious things. There were a few abuse attempts (all of them were from now vanished GitHub accounts).
    
    zX41ZdbW 13 days ago
    
    Found one example that survived, and it was legitimate: https://github.com/ClickHouse/ClickHouse/pull/99522/changes

adsharma 14 days ago

It's interesting that the blog post places SQLite and Ladybird on the spectrum, but omits it's chief open source rival: DuckDB.

Agree that Level 3 is what inspires confidence. But we need to invent new business models to sustain in the era of vibe-coded databases.

aaronblohowiak 14 days ago

while ClickHouse can scale down to compete with duckdb, I dont believe (but happy to be corrected) that duckdb can scale up like ClickHouse can.
most people dont have that scale problems, but when you do...
nasretdinov 14 days ago

I think the main advantage of ClickHouse over DuckDB is *MergeTree family. It lets you sort data in the background, which allows for absurd levels of compression and performance when done right. ClickHouse can easily be 10x as performant as DuckDB querying Parquet when querying non-indexed columns, and obviously infinitely faster than DuckDB when you're touching primary key.
There are so many comparisons between the two, but realistically ClickHouse and DuckDB occupy completely separate niches, where DuckDB is just a really powerful analytics _engine_, and ClickHouse is a full database management system, with replication, MergeTree engine, etc.

kjuulh 14 days ago

Love clickhouse it is just super performant. I've had to tweak a few queries for performance here and there but it has been more than stellar. I'd initially set up a real time pipeline ingest for us to handle our larger incremental ingests when i started (used redshift in the past, very expensive and quite slow in comparison), it has been unnecessary so far, as clickhouse has just been able to chew through tons of data and large transformations without breaking a sweat.

Only issue i had was that the default configuration had some quite heavy handed tracing turned on which tanked performance on the relatively little machine we had for it, it has since been scaled up and is the core of our data stack.

I can't recommend clickhouse enough. If you were truly large I'd probably choose something else, but as long as we're staying on a few nodes it is manageable complexity and we enjoy using it.

mscdex 13 days ago

What "heavy handed tracing" was turned on by default?
- kjuulh 13 days ago
  
  Cant quite remember, sadly, i followed a post about it. Might have been this one https://kb.altinity.com/altinity-kb-setup-and-maintenance/al....
  It drastically reduced our cpu usage and disk as well. At this point it was on a small t3 on aws
  - mrngm 12 days ago
    
    Probably trace_log (we had a similar experience), but you'll only notice with limited storage space, and a regular keen eye on how your database fills up.

tdiff 14 days ago

It is sad they are afraid to mention on the page that "data processing for a web analytics system ... similar to Google Analytics" was actually something used in Yandex.

corentin88 14 days ago

Elsewhere on the page, they avoid mentioning Yandex. In fact, do they ever mention Yandex?
That’s probably not to advertise for that company. I don’t see why it’s sad?
- civet_java 14 days ago
  
  Or could it be because hecklers might call them "un-American" if they did list Yandex? The two aren't the same.

brunojppb 14 days ago

Clickhouse has been a game changer for some of the companies i have worked in the past. This reminds me of this podcast episode (1) from the Rust in Production pod about their Rust adoption.

1. https://open.spotify.com/episode/0TBKDUhO0KihBxEzZqnQx1

mial 14 days ago

On top of being a really good OLAP database, what has been a game changer for me is the built-in connectors for bringing in data from remote sources. It can manage automatic recurring import of a s3 folder containing parquet/json files and it can also connect directly to Postgres. For our data warehouse at a medium sized newspaper, we switched from Druid+postgres+trino to just one big clickhouse node and I’ve never looked back. Much more performant, practical, and a lot less maintenance.

ddorian43 15 days ago

Clickhouse is *really* gatekeeping the "zero copy replication" where you store data on object-storage and have high availability from the open source version.

nasretdinov 14 days ago

That feature is central for ClickHouse Cloud offering which essentially is what lets others have (the rest of) ClickHouse for free. If you need that feature I think it's quite obvious you'll have to pay. I don't think it's an unreasonable stance at all.
P.S. I don't believe you are right to call it "zero copy" either. The object storage itself is replicated and definitely does copy and/or split data a lot under the hood. In some sense it's no different from ClickHouse's replication, apart from that ClickHouse can't use erasure codes to reduce the data footprint, whereas many advanced object stores can
- rastignack 14 days ago
  
  The thing is you are going to be comoditized on this too. And it’s going to be faster to migrate off your database with AI. You are going to need to have a plan.
- ddorian43 14 days ago
  
  It is called "zero copy" in clickhouse land.
  That feature, is becoming central for an OLAP db, even a free/open-source one.
pepperoni_pizza 14 days ago

I think that is just the nature of the open core business - but like most such businesses, they're not very clear about how that is what they are, pretending to be open source business instead.
- fsuts 14 days ago
  
  Vc funded with recent rounds so 10 years hasn’t been enough time to make money
  https://en.wikipedia.org/wiki/ClickHouse
orian 14 days ago

This is the main driver for their cloud ;-)
nvartolomei 14 days ago

How? Have you tried contributing a reasonable implementation with test coverage and it was rejected?

orta 15 days ago

I've been using clickhouse for the last year for in-house analytics and found it a really pleasant experience, thanks for all the progress you've made

dandellion 15 days ago

Same. We replicated some data from Postgres, it was easy to set up, similar enough that the transition was trivial, and really good performance out of the box. One of those good "use the right tool for the job" experiences.

baq 15 days ago

clickhouse is the low key amazing tech people are busy using instead of posting about. keep it up!

rozenmd 14 days ago

I used to keep all of OnlineOrNot's timeseries data entirely in a hot postgres db with the rest of the relational data.

Used to take a few seconds to get a week's uptime data and do some useful analysis.

Since moving to Clickhouse I think I can grab a full year's data in around 200ms (probably less if I try optimising it). Still completely blows my mind everyday.

spprashant 14 days ago

If your data is too big for postgres, it seems like moving straight to Clickhouse is the best option. We have been through an whole array of distributed database technologies, and Clickhouse might be first one that doesn't have too many compromises.

fsuts 14 days ago

What do you mean?
Postrgesql is a relational and row based db, ClickHouse is columnar
Clickhouse doesn’t replace postgresql:
- saisrirampurOP 14 days ago
  
  Sai from ClickHouse here. Totally with you here, ClickHouse isn't a replacement for Postgres. Most use-cases are co-existence - Postgres for OLTP and ClickHouse for OLAP, basically right tool for the right job situation. Both are purpose-built technologies with a similar OSS ethos/story. Btw on an interesting co-incidence, Postgres turned 30 this year and ClickHouse turned 10.
  Above is exactly why we are embracing the Postgres + ClickHouse stack and are investing heavily to make workflows across both these DBs very easy for developers - PeerDB for native CDC, pg_clickhouse extension for querying CH from PG, pg_stat_ch for query PG observability from ClickHouse and more such are planned for future. And recently we also announced ClickHouse Managed Postgres which pacakages this entire stack as a fully managed service https://clickhouse.com/cloud/postgres
- spprashant 14 days ago
  
  This is a extremely common issue that happens in growing firms.
  You start off with everything in Postgres, it makes the most sense. Soon you realize some tables are growing really huge - usually some sort of time-series or log data reaching 10TB+. You can no longer fit it in one node. You can try you luck with some sharding extensions, but they add complexity to upgrades.
  In that case it makes total sense to move these large tables off Postgres, and I think Clickhouse is a straight up replacement here. You can still keep your relational heavy tables in Postgres.
  Yes it affects you ability to cleanly join data, and guarantee 100% consistency. With some smart application code, and schema design, you can replace parts of Postgres with Clickhouse for the big data problem.
  - ezekg 14 days ago
    
    I did this exact thing this year. I moved about >5 TB of event and log data into Clickhouse. I went from having to increase Postgres storage capacity every few months ($$$) to never thinking about it. Migrating to Clickhouse took me from constant timeouts on analytics queries to all queries returning in well under 100ms, and automated data retention using TTLs going from nightly pruning jobs that failed constantly (Postgres hates DELETEs on large tables) to again, never thinking about it. Because of this, I was able to increase retention per-account from a mere 30 days like I had -- just to keep performance and storage costs under control -- to much, much longer. Huge fan.
- gempir 14 days ago
  
  You can keep "columnar" data in a row based database like postgres, it's just more expensive. But with little data that's fine and reduces infrastructural complexity. When you reach too much data it gets to a point where you then actually want to use the correct database for your usecase.
- eklavya 14 days ago
  
  Not to mention ACID and CAP and all that. I use clickhouse AND postgres. Clickhouse is not a replacement for postgres at all.

dmix 14 days ago

We use Clickhouse in a rails app for our customer facing dashboard analytics, logging, and datalake type stuff where Postgres is too heavy and expensive. The web admin panel they built is great and we’ve had solid performance.

tarun_anand 14 days ago

How does CH compare with the recent announcements made by Databricks Reyden...

rubenvanwyk 13 days ago

Clickhouse will blow it out of the water any day.
sakesun 14 days ago

One actually exists. The other doesn't

bdavid21wnec 14 days ago

CHDB deserves a lot of credit too, wish they would unmark the rust version as experimental. The python version works very well!

computersuck 14 days ago

Has anyone tried using ClickHouse to build a SIEM or security related tool?

__s 14 days ago

Yes
https://clickhouse.com/resources/engineering/siem
https://clickhouse.com/blog/how-huntress-improved-performanc...

zuzululu 14 days ago

what are you guys using it for other than collecting analytics?

throwaway012377 15 days ago

and yet it's too difficult to package https://repology.org/project/clickhouse/versions

gkbrk 13 days ago

There's no need to package it because ClickHouse is just a single binary.

haeseong 14 days ago

The query speed deserves the praise, but the JSON ingestion path has quiet footguns nobody mentions here. Every numeric column comes back as a string over JSONEachRow, so a forgotten Number() cast silently turns arithmetic into string concatenation, and with input_format_skip_unknown_fields enabled a single typo in a column name drops that field with no error at all. Worth wiring an assertion that inserts a row and reads it back into CI before trusting the dashboards.

charrondev 14 days ago

We’ve done our JSON ingestion by keeping a schema in the app for all the types we expect, and injecting the types into the query builder.
Then as needed we have materialized columns on our different tables.

Talpur1 14 days ago

10 Years! quite a long journey, specailly observeability part is need of hour

Settings

Ten years of ClickHouse in open source

Keyboard Shortcuts