Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse

78 points by super_ar a year ago · 33 comments · 3 min read

Reader

Hi HN! We are Ashish and Armend, founders of GlassFlow. We just launched our open-source streaming ETL that deduplicates and joins Kafka streams before ingesting them to ClickHouse https://github.com/glassflow/clickhouse-etl

Why we built this: Dedup with batch data is straightforward. You load the data into a temporary table. Then, find only the latest versions of the record through hashes or keys and keep them. After that, move the clean data into your main table. But have you tried this with streaming data? Users of our prev product were running real-time analytics pipelines from Kafka to ClickHouse and noticed that the analyses were wrong due to duplicates. The source systems produced duplicates as they ingested similar user data from CRMs, shop systems and click streams.

We wanted to solve this issue for them with the existing ClickHouse options, but ClickHouse ReplacingMergeTree has an uncontrollable background merging process. This means the new data is in the system, but you never know when they’ll finish the merging, and until then, your queries return incorrect results.

We looked into using FINAL but haven't been happy with the speed for real-time workloads.

We tried Flink, but there is too much overhead to manage Java Flink jobs, and a self-built solution would have put us in a position to set up and maintain state storage, possibly a very large one (number of unique keys), to keep track of whether we have already encountered a record. And if your dedupe service fails, you need to rehydrate that state before processing new records. That would have been too much maintenance for us.

We decided to solve it by building a new product and are excited to share it with you.

The key difference is that the streams are deduplicated before ingesting to ClickHouse. So, ClickHouse always has clean data and less load, eliminating the risk of wrong results. We want more people to benefit from it and decided to open-source it (Apache-2.0).

Main components:

- Streaming deduplication: You define the deduplication key and a time window (up to 7 days), and it handles the checks in real time to avoid duplicates before hitting ClickHouse. The state store is built in.

- Temporal Stream Joins: You can join two Kafka streams on the fly with a few config inputs. You set the join key, choose a time window (up to 7 days), and you're good.

- Built-in Kafka source connector: There is no need to build custom consumers or manage polling logic. Just point it at your Kafka cluster, and it auto-subscribes to the topics you define. Payloads are parsed as JSON by default, so you get structured data immediately. As underlying tech, we decided on NATS to make it lightweight and low-latency.

- ClickHouse sink: Data gets pushed into ClickHouse through a native connector optimized for performance. You can tweak batch sizes and flush intervals to match your throughput needs. It handles retries automatically, so you don't lose data on transient failures.

We'd love to hear your feedback and know if you solved it nicely with existing tools. Thanks for reading!

hodgesrm a year ago

How is this better than using ReplacingMergeTree in ClickHouse?

RMT dedups automatically albeit with a potential cost at read time and extra work to design schema for performance. The latter requires knowledge of the application to do correctly. You need to ensure that keys always land in the same partition or dedup becomes incredibly expensive for large tables. These are issues to be sure but have the advantage that the behavior is relatively easy to understand.

Edit: clarity

super_arOP a year ago

Good question! RMT does deduplication, but its dependency on background merges that you can't control can lead to incorrect results in queries until the merge is complete. We wanted something that cleans the duplicates in real time. GlassFlow moves deduplication upstream, before data hits ClickHouse. If you think of it from a pipeline perspective, we believe it is easier to understand, as it is a block before ClickHouse.
- hodgesrm a year ago
  
  RMT does not depending on background merges completing to give correct results as long as you use FINAL to force merge on read. The tradeoff is that performance suffers.
  I'm a fan of what you are trying to do but there are some hard tradeoffs in dedup solutions. It would be helpful if your site defined exactly what you mean by deduplication and what tradeoffs you have made to solve it. This includes addressing failures in clustered Kafka / ClickHouse, which is where it becomes very hard to ensure consistency.

caust1c a year ago

How does the deduplication itself work? The blog didn't have many details.

I'm curious because it's no small feat to do scalable deduplication in any system. You have to worry about network latencies if your deduplication mechanism is not on localhost, the partitioning/sharding of data in the source streams, and handling failures writing to the destination successfully, all of which cripples throughput.

I helped maintain the Segmentio deduplication pipeline so I tend to be somewhat skeptical of dedupe systems that are light on details.

https://www.glassflow.dev/blog/Part-5-How-GlassFlow-will-sol...

https://segment.com/blog/exactly-once-delivery/

ashishbagri a year ago

Thanks for your question. In GlassFlow, we use NATs Jetstream to power deduplication (and KV store for joins as well). I see from your blog post that segment used rocksDB to power their deduplication pipeline. We actually considered using rocksDB but used NATs JS because of added complexity in scaling with rocksDB (as rocksDB is embedded in the worker process). Their indeed is a small network latency in our deduplication pipeline but our end-end latency measured is under 50ms.
- caust1c a year ago
  
  Thanks for clarifying, best of luck!

oulipo a year ago

Seems interesting, but I'm not sure what duplication means in this context? Is Kafka sending several time the same row? and for what reasons?

Could you give practical examples where duplication happens?

My use-case is IoT with devices connecting on MQTT and sending batches of data, each time we ingest a batch we stream all corresponding rows in database, because we only ingest a batch once, I don't think there can really be duplicates, so I don't think I would be the target of your solution,

but I'm still curious at in which case such things happen, and why couldn't Kafka or Clickhouse dedup themselves using some primary key or something?

super_arOP a year ago

Thanks for asking those questions. Duplicates often come from how systems interact with Kafka, not from Kafka itself. For example, if a service retries sending a message after a timeout or if you collect similar data from multiple sources (like CRMs and web apps), you can end up with the same event multiple times. Kafka guarantees delivery at least once, so it doesn't remove duplicates.
ClickHouse doesn't enforce primary keys. It stores whatever you send. ReplacingMergeTree and FINAL are concepts on ClickHouse, but they are not optimal for real-time streams due to the background merging process that needs to be finished to ensure correct query results.
With GlassFlow, you clean the data streams before they hit ClickHouse, ensuring correct query results and less load for ClickHouse.
In your IoT case, a scenario I can imagine is batch replays (you might resend data already ingested). But if you're sure the data is clean and only sent once, you may not need this.
- oulipo a year ago
  
  Thanks interesting! In my case each batch has a unique "batch id", and it's ingested in Postgres/Timescale so it will dedup with the key

maxboone a year ago

Very cool stuff, good luck!

I didn't quickly find this in the documentation, but given that you're using the NATS Kafka Bridge, would it be a lot of work to configure streaming from NATS directly?

ashishbagri a year ago

Yes it would be easily possible to configure the tool to stream directly from NATs and skip Kafka completely. The reason we started with a managed Kafka connector (via the NATS Kafka Bridge) is because most of the early users sending data to clickhouse in real time had already a Kafka in place

the_arun a year ago

Congratulations!!

Questions:

1. Why only to ClickHouse, can’t we make it generic for any DB? Or is it reference implementation for ClickHouse?

2. Similarly, why only from Kafka?

3. Any default load testing done?

ashishbagri a year ago

Thanks for taking a look! 1. The current implementation is just for clickhouse as we started with the segment of users building real time analytics with clickhouse in their stack. However we already learned during the way that streaming deduplication is a challenge for other destination databases as well. The architecture of our tool is designed in a way that we can extend the sinks and add additional destinations. We would just have to write the sink component specific for that database. Do you have a specific DB in mind that you would like to use?
2. Again, we started with kafka because of our early target users. But the architecture inherently supports adding multiple sources. We already have experience in building multiple source and sink connectors (from our previous project) so adding additional sources would not be so challenging. which source do you have in mind?
3. Yes, running the tool locally on a macbook pro M2 docker, it was able to handle 15k requests per second. We have built a load testing infrastructure and happy to share the code if you are interested.
nine_k a year ago

AFAICT, there are native connector implementations for ClickHouse and Kafka, so it's plug and play with them specifically.
OTOH for deduplication you mostly need timestamps and a good hash (like SHA512), you don;t need to store the actual messages, so a naive approach should work with basically any even source; all you need is to look up the hash, compare the timestamps, and skip the message if the hashes match. But you need to write your own ingestion and output logic, maybe emulating whatever protocol you're using if you want the whole thing to be a drop-in node in your pipeline.
- ashishbagri a year ago
  
  Yes its true that if you just want to send data from Kafka to clickhouse and do not worry about duplicates, then there are several ways. we even covered them in a blog post -> https://www.glassflow.dev/blog/part-1-kafka-to-clickhouse-da...
  However, the reason for us to start building this was because duplication is a sad reality in streaming pipelines and the methods to clean up duplicates on clickhouse is not good enough (again covered extensively on our blog with references to cickhouse docs).
  The approach you mention about deduplication is 100% accurate. The goal in building this tool is to enable a drop-in node for your pipeline (just as you said) with optimised source and sink connectors for reliability and durability

saisrirampur a year ago

Neat project! Quick question, will this work only if the entire row is a duplicate? Or even if just a set of columns (ex: primary key) conflict and you guarantee only presence of the latest version of the conflict? I’m assuming former because you are deduping before data is ingested into ClickHouse. I could be missing something, wanted to confirm.

- Sai from ClickHouse

super_arOP a year ago

Thanks, Sai! Great question. The deduplication works based on the user-defined key, not the entire row. You can specify which field (e.g. a primary key like event_id) to use as the deduplication key. Within a defined time window, GlassFlow guarantees that only the first event with a given key will be forwarded to ClickHouse. Subsequent duplicates are rejected. Our idea was to keep ClickHouse as clean as possible.
- saisrirampur a year ago
  
  Got it. Thanks for the clarification. That might not work if the ingested row represents an UPDATE. We do this in Postgres CDC by replicating an UPDATE as a new version of the row and that is what you want to retain. For most customers using FINAL (with the correct ORDER KEY as needed) works well for deduplication and query performance is still great. But in cases where it isn't, customers typically resort to tuning faster merges with ReplacingMergeTree or Materialized Views (either aggregating or refreshable) to manage deduplication.
  Anyway, great work so far! I like how well you articulated the problem. Best wishes.

YZF a year ago

How do you avoid creating duplicate rows in ClickHouse?

- What happens when your insertion fails but some of the rows are actually still inserted?

- What happens when your de-duplication server crashes before the new offset into Kafka has been recorded but after the data was inserted into ClickHouse?

ashishbagri a year ago

- We used our custom clickhouse sink which inserts records in batches using clickhouse native protocol (as recommend by clickhouse). Each insert is done in a single transaction so if an insertion has failed, partial record do not get inserted on clickhouse. - The way the system is architected this cannot happen. If the deduplication server crashes, the entire pipeline is stopped and nothing is inserted. Currently when we read a data successfully from Kafka into our internal NATs JS, we acknowledge the new offset into Kafka. And the deduplication and insertion happens after. The limitation currently is that if our system crashes before inserting into clickhouse (but after ack to kafka) we would not process this data. We are already working towards finding a solution for this.
- YZF a year ago
  
  Right. I think this is fundamental though. You can minimize the chance of duplicates but not avoid them completely under failure given ClickHouse's guarantees. Also note transactions have certain limitations as well (re: partitions).
  I'm curious who your customers are. I work for a large tech company and we use Kafka and ClickHouse in our stack but we would generally build things in house.
zX41ZdbW a year ago

ClickHouse provides idempotent atomic inserts, so the insertion of the same batch can be safely repeated. The client can provide its own idempotency token or rely on the block hash.
- YZF a year ago
  
  Afaik this is always best effort, e.g.: https://clickhouse.com/docs/operations/settings/settings#ins...
  "For the replicated tables by default the only 100 of the most recent blocks for each partition are deduplicated"
  This doesn't work under failure conditions either (again afaik), e.g. if the clickhouse server fails.
  - zX41ZdbW a year ago
    
    100 is the default, and can be changed at runtime.
    The deduplication works regardless of server restarts, and it does not matter when a request goes to another replica, as it is implemented with a distributed consensus (RAFT) via clickhouse-keeper.
    
    YZF a year ago
    
    ah. interesting. So some hash of the batch is recorded in the distributed log after the batch has been written? to disk? Isn't there still a race there?
    At least intuitively this seems very hard to guarantee something more than "at least once" but I might be missing something.
    
    zX41ZdbW a year ago
    
    It is more complex.
    The batch is written to a temporary directory. Then we have to do an atomic commit into three different places: on disk, which can be external storage as well (rename the temporary directory), in memory (to the data structure containing the snapshot), and in Keeper (which contains metadata, including these hashes, and is the only source of truth).
    The metadata in Keeper is the only place that decides which data exists and when. The whole operation is done by committing the data files first, then committing to Keeper, then committing to memory, and then responding to the client.
    Now you have to analyze what happens, and what have to be resolved, if the server is killed between these steps.
    If it is killed after writing data files and before writing to Keeper, the data is not considered written, but we have garbage in the storage to collect.
    If it is killed after writing data files and after writing to Keeper, but before committing to memory or answering to the client, the data is written, but the client does not know that. This is the situation when the client has to retry, and the retry will be deduplicated.
    If the data is written to the storage, but the transaction to Keeper (the only source of truth) did not go through due to a network error after subsequent retries, the changes in the storage will be attempted to roll back straight ahead, and the client will receive an exception. If the client will not receive an exception due to another network error, the client will have to retry.
    It is quite easy, if you look at it as Keeper is the central place, and the only place that does distributed transactions, and everything else is around it.

ram_rar a year ago

What uses cases would this be effective compared to using replacing merge tree (RMT) in clickhouse that eventually (usually in a short period of time) can handle dups itself? We had issues with dups that we solved using RMT and query time filtering.

super_arOP a year ago

Great question! RMT can work well when eventual consistency is acceptable and real-time accuracy isn't critical. But in use cases where results need to be correct immediately (dashboards, alerts, monitoring, etc.), waiting on background merges doesn't work.
Here 2 more detailed examples:
Real-Time fraud detection in logistics: Let's say you are streaming events from multiple sources (payments, GPS devices, user actions) for a dashboard that should trigger alerts when anomalies happen. Now you have duplicates (retries, partial system failure, etc.). Relying on RMT means incorrect counts until merges happen. This situation can lead to missed fraud, later interventions, etc.
Event collection from multi-systems like CRM + E-commerce + Tracking: Similar user or transaction data can come from multiple systems (e.g., CRM, Shopify, internal event logs). The same action might appear in slightly different formats across streams, causing duplicates in Kafka. ClickHouse can store these, but it doesn't enforce primary keys, so you end up with misleading results until RMT resolves.

darkbatman a year ago

Are there any load test results available, we would like to use this at zenskar but at high scale really need it to work.

System merges and final are definitely unpredictable so nice project.

super_arOP a year ago

Great to hear that you are considering it for zenskar. We don't have a publicly available load test, but in internal checks it was able to handle 15k requests per second (locally on a MacBook Pro/M2 Docker). What is the load that you are expecting? Happy to connect.

brap a year ago

Just wanna say I dig the design. In-house or outsourced?

super_arOP a year ago

It is a combination of both. We have a fantastic product designer colleague who takes care of the product, and a few friends who designed the website. I will forward your message to them. I am sure you've made their day. Thank you! :)

Settings

Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse

Keyboard Shortcuts