Kafka can be so much more

This is my first time working in a company (Confluent) that was born out of open source (Kafka). I have known about Kafka for sometime now but have never spent much time trying to understand its evolution and piece together why it went in certain directions. So, being in Confluent is a perfect time to do that.

The core idea was simple - the fundamental unit of data should not be the state stored in a database, but the stream of events that led to that state.

Just like the balance in a bank account is less interesting but all the account transactions that led to the balance being what it is now, are more interesting. Wait! What? Did I just say the bank balance is less interesting? Well, what I meant was you can arrive at the balance by carefully tallying your deposits and withdrawals.

But, I digress.

Anyway, the implementation of the above mentioned core idea was a distributed immutable commit log that allows decoupling between producers and consumers of the data in this log.

The simplicity yet wide applicability of these concepts meant that Kafka found immediate use in many technical scenarios - several that were likely not even originally considered by the creators (I have not verified this as a fact with my CEO yet). But that is the beauty of building horizontal infrastructure technologies. Your users invariably surprise you. Log aggregation for SIEM, fraud detection in transactions data, metrics monitoring, real-time operational monitoring, etc became popular use cases in addition to the original website activity tracking use case.

However, Kafka still suffers from a problem which is hard to explain especially for a horizontal infrastructure technology. Even the most ardent fans of the technology, sometimes tend to pigeonhole it into a narrow zone of utility, rather than realize its value as something deeper and more central.

I use Kafka as a message queue

Great!

I use Kafka to build out my data lake

Awesome!

I use Kafka to build event-driven micro-services

Nicely done!

I use Kafka as a data integration layer

Cool!

Yes, yes, yes, and yes. Why not all of the above and much more?

If Kafka were a person, they would be like that really successful drama actor that bemoans the lack of good comedy roles because they have been typecast as a serious actor.

Kafka is hard. Yes. Most good infrastructure technologies are. But once you have paid the initial setup and learning cost, it can be infinitely powerful. Which is why, it is all the more perplexing to see customers use Kafka for a narrow set of scenarios, not realizing its full potential. I often hear the argument “oh Kafka is an overkill for what we are trying to do in this scenario.” But, that argument would be a lot more credible if that other team sitting right next to your team wasn’t running a massive Kafka cluster to solve a problem already. You don’t drive in your car from SF to NYC just because “getting on a flight is an overkill for a single person.” You just get on the plane like normal people.

There is an ecosystem evolving around Kafka now, something I refer to as Kafka++. This ecosystem includes the benefits that Kafka users can get from other open source technologies like Apache Flink (by analyzing and processing data in motion) and Apache Iceberg (by making data in streams available for downstream analytical use cases). Kafka++ will help widen the aperture through which users look at Kafka. While there is tremendous scope for engineering teams leveraging their Kafka investment for more operational use case scenarios, an even bigger unlock can happen when their data counterparts (on the data engineering, data science, and analytics teams) see the power of Kafka-delivered real-time data in their world.

Kafka can be so much more

Discussion about this post

Ready for more?