GCP and Confluent partner to deliver a managed Apache Kafka service

cloud.google.com

102 points by jto1218 8 years ago · 30 comments

Reader

Why would someone want to use Kafka rather than PubSub? For a nonexpert, it seems like PubSub is just a tighter abstraction---same basic operations, fewer knobs to turn. Are there important features in Kafka, that aren't in PubSub and can't be easily built on top of PubSub?

SmirkingRevenge 8 years ago

I upvoted your question - not sure why its been downvoted, as it seems like a good-faith question, and not to mention - a very reasonable question.
Subtle differences in the semantics of pub/sub and message passing services can have really significant consequences for their use-cases.
Google Pubsub is at-least-once delivery, and best-effort ordering. That means any consumer pipelines need to be able tolerate duplicates and out-of-order messages - in many cases, that's not trivial to handle. If the order of your messages isn't that important, and if you can make the operations of consumers idempotent - pub/sub is awesome. But more often than not, it becomes just another message passing service to add to your plethora of message passing services, because its limitations can't extend to all of your use-cases. (I really want one ring to rule them all - Kafka gets the closest, IMHO)
Kafka is an basically pub/sub on top of an ordered, append-only log, and consumers read the log stream very much like a single process reads/seeks on a file handle - using offsets. Given infinite storage - your entire data stream can be replayed, and any data can be recreated from scratch - thats a pretty awesome thing.
- manigandham 8 years ago
  
  >> I really want one ring to rule them all
  Try Apache Pulsar: https://pulsar.incubator.apache.org/
manigandham 8 years ago

Pubsub has no ordering guarantees, has no concept of keys, partitions, or compactions (if needed), and has no replay abilities (although "snapshot" functionality is in alpha).
If you just want an event stream with loose ordering, no throughput or partition management, and the ability to 'ack' each individual message, than GCP pubsub is a pretty good choice. It can also do push-based subscriptions. The client libraries are rather buggy though and end up with memory leaks and network failures.
Kafka meanwhile is much more compatible with many other services that are typical run these days from stream processing to databases. That being said, Kafka is also pretty easy to run and VMs are extremely reliable on any cloud so I don't see much value in managed solutions for that, unless someone just wants to completely de-risk. The bandwidth costs add up quickly for even moderate usage.
- diminoten 8 years ago
  
  Confluent is more than just Kafka though, it's also got a schema registry that lets you validate your messages, as well as an enhanced REST API for interaction with the Kafka service.
  Confluent is run by the same folks from LinkedIn who built Kafka originally (it talks about this a bit more in the article), so you're in good hands with regard to the product's vision remaining consistent to the original innovation.
  - manigandham 8 years ago
    
    Yes I know who Confluent is, and their open-source distribution already includes the schema registry and HTTP interface if you want it: https://www.confluent.io/product/compare/
    Plenty of hosted Kafka providers these days, it's not that difficult to run and the advantage of Confluent as the original authors is really just marketing. They're also working on a Kafka operator (one of many in the market) to run it even easier on Kubernetes.
lbradstreet 8 years ago

PubSub is fundamentally a different abstraction than Kafka which is log orientated. With Kafka you can maintain the ability to replay your log/stream, in order, at any time.
This enables different kinds of use cases, and can be easier to reason about.
- georgewfraser 8 years ago
  
  Super interesting, I admit I haven't spent much time understanding Kafka---it seems like it's almost a hybrid between a message queue and a database?
  - oskari 8 years ago
    
    Kafka allows clean decoupling of event producing and consuming logic unlike typical message queues. Your clients can keep pushing new messages to Kafka's persistent logs without having to know who and what will process those messages.
    This makes it possible to easily update your event processing logic or add new components to the mix while maintaining a clean architecture.
    I gave a talk about this at PostgresConf US a few weeks ago, the talk's not yet available as an article, but you can get the slides from https://aiven.io/blog/aiven-talks-in-postgresconf-us-2018/ if you're interested.
  - wenc 8 years ago
    
    Yes. Kafka is a horizontally-scalable persistent queue (among other things). This sounds kinda pedestrian (like something out of a computer science textbook), but people are rapidly discovering uses for it as a systems architecture component.
    The concept of "stream-table duality" tells us tables can be thought of as a materialized view of change data operations [INSERT/UPDATE/DELETE] streams. Kafka can be used as a buffer for streaming data that can materialized into a relational table at any time.
    One of the more interesting use-cases is multi-target replication: feed your change-data-capture data into Kafka, and replay it on any other backend data store (SQL, Graph Database, NoSQL, etc.)[1]
    Conceptually this lets you ingest data into a stream, write the data on multiple backends and keep everything in sync. Martin Klepmann has written a simple (PoC-quality) tool for doing this with Postgres databases.
    [1] https://www.confluent.io/blog/bottled-water-real-time-integr...
  - lbradstreet 8 years ago
    
    Yes, I like to think of it as as the transaction log without any of the logic. If you’re interested, I can highly recommend Martin Kleppmann’s talk “Turning the database inside out” https://www.youtube.com/watch?v=fU9hR3kiOK0
  - manigandham 8 years ago
    
    I thought you used it at Fivetran? It's a logging system where you send messages as a key/value of bytes to a topic, which is just a logical group of partition files spread over the nodes as master + replicas.
    Consumers read from the topic (the underlying partitions) and maintain the offset last read themselves, allowing for easy replay, strict ordering within a single partition, and completely disconnected pacing from producers. The optional "key" for each message allows for compaction so only the latest key/value pair is kept in a topic.
    It's definitely not a database but works well for replicating them, as well as doing most of the work of a typical message queue / service bus system.
    
    georgewfraser 8 years ago
    
    Nope, no Kafka at Fivetran, nearly all the sources we pull from are intrinsically batch, support retry, and come out of the source already partitioned by user. Our internal architecture is basically just a farm of batch processors.
    We have been hearing more and more about people using Kafka to support streaming analytics. I haven't spent a ton of time studying it, I was under the impression that it was just a giant queue, but ordering + retention is basically a database in the way I think about it.
    
    dvlsg 8 years ago
    
    If consumers need to keep track of their own offsets, do you run into complications when trying to run multiple instances of the same consumer? Or do you typically run single processes when consuming a specific topic?
    
    manigandham 8 years ago
    
    This is what partitions are for. They are used to scale out both publishing and consumption by assigning each consumer a single partition. This logic used to be complicated client-side code but now is included into Kafka itself, along with automatic offset saving on some interval (saved as entries in another topic).
    Also consumers can be in a "consumer group" so you can have multiple clusters of consumers each reading the entire log separately, but shared within each cluster.
    Also I'd recommend looking at Apache Pulsar for a next-generation architecture that combines Kafka's log semantics with the low-latency routing and individual message acknowledgements of message queues: https://pulsar.incubator.apache.org/
  - asavinov 8 years ago
    
    > it seems like it's almost a hybrid between a message queue and a database?
    If Kafka supported data processing (queries or whatever) then it would be much closer to databases. Also, databases are normally aware of the structure of the data (for example, columns).
    Therefore, Kafka can hardly be viewed as a DBMS because it explicitly separates two major concerns:
    * data management - how to represent data (Kafka)
    * data processing - how to derive/infer new data (Kafka Streams - a separate library)
    Theoretically, if they could combine these two layers of functionality in one system then it would be a database.
    
    Ezku 8 years ago
    
    For me, the best way to think about Kafka is as a distributed log. That said, it does feature rather robust transformations of log entries - does KSQL count as ”queries or whatever”? https://www.confluent.io/product/ksql/
    
    techno_modus 8 years ago
    
    As far as I understand, KSQL is not integral part of Kafka - it is based on Kafka Streams (an independent library). So Kafka is not one system and hence some ambiguity when referring to its functionalities. In particular, "Kafka is as a distributed log" means only Kafka core, not Kafka Streams and KSQL.
  - jto1218OP 8 years ago
    
    love this article about the log abstraction and some of the motivation behind Kafka. It's a great resource to get a feel for what it is and how you can use it:
    https://engineering.linkedin.com/distributed-systems/log-wha...
iampims 8 years ago

Log compaction is one important feature missing from both Google Pub/Sub and AWS Kinesis streams.
strictnein 8 years ago

At the upper end of the scale (300k events per second, PBs of data per month) the pricing on Pub/Sub begins to get insane. We'd be at something like $500k-$600k a month.
- jkaplowitz 8 years ago
  
  At that scale, you wouldn't be paying list price with any provider. These things are negotiable for significant customers.
free652 8 years ago

Kafka is "free", it's scales much better and it's quiet resilient. It's pretty light on the servers resources like RAM and CPU.
The downside is kafka doesn't provide guaranteed delivery out of the box, the producers buffer messages and that's when the loss can happen - if you need guaranteed delivery then you need to write extra code.
wink 8 years ago

A lot of people are already running Kafka, why would you switch your stack?
I know I found it a bit non-trivial to run, so if I was still using Kafka on GCP boxes anyway I'd absolutely try this out.

lbradstreet 8 years ago

I love Kafka and the log orientated streaming model, but I often have to think twice before recommending it to clients who would have to manage the ops themselves. Having a managed service on GCP, and Confluent's existing cloud offering on AWS really brings down the barrier to entry. There aren't really any AWS/GCP serverless equivalents (Kinesis has 7 day retention maximum, no key compaction, and less surrounding tooling such as KStreams/KSQL).

regnerba 8 years ago

May I ask why you wouldn't recommend it to teams that have to manage it themselves? I haven't used it myself but my team is currently looking at using it internally. The first project will be integrating it into our log pipeline between nodes and our logstash instances.
- beepbeepbeep1 8 years ago
  
  There is absolutely no reason other than the overhead you need to self manage the service like you would self manage any other internal service.
  If you are comfortable at operations you'll be fine. Some people are not good at ops so outsourcing the problem making the ops side someone else's issue can also be useful.
  Self hosting will offer far more options when it comes to scaling and tweaking. Overall on bare hardware costs it's cheaper and faster although up front costs will be higher.
  Kafka usecases are rarely elastic so don't gain that advantage in the cloud. Also Kafka's missing tierd storage makes it expensive if storing big volumes of data.
- sologoub 8 years ago
  
  In addition to what others have stated, it’s also a question of productivity - when your team has to maintain ops, you are not doing something else. Is it your highest and best use to be tuning and maintaining Kafka? It could be, but only you and your team can really answer that.
  In practice, it’s better to offload things that are not core to what your company is making money on, until you hit constraint points from scaling. As one of the comments mentioned, at PB-scale processing, even bare metal may make sense. (But not always - one of my former employers went down this path early and ended up losing all productivity in R&D because of people fighting to keep that baremetal setup alive for months on end. This really hurt future revenue growth and distracted eng. leadership from key changes in their industry.)
  Like with most complex questions, the immediate answer is “it depends”.
- lbradstreet 8 years ago
  
  It's not that I wouldn't recommend it, I would just think twice recommending it for very small ops teams for certain use cases. It's not so hard to manage, but having it be managed by Confluent is a great option.

Settings

GCP and Confluent partner to deliver a managed Apache Kafka service

Keyboard Shortcuts