Show HN: KQ – Simple Job Queue for Python Using Kafka

116 points by joohwan 9 years ago · 30 comments

Reader

Xorlev 9 years ago

One of the biggest problems with treating Kafka as a job queue is that you suffer from head-of-line blocking. Kafka doesn't expose per-message visibility/acknowledgement semantics like RabbitMQ/Redis PUSH+POP/SQS does. Each consumer group tracks offsets into the partitions of a log (aka a topic). This offset is just a number that points to a specific message in the Kafka partition. If you get stuck on message 123, you either can't proceed to 124, proceed and don't commit your offset but risk replaying 124, or skip 123.

A great many of our services publish to Kafka, those consuming services which seek to treat individual records as tasks (or bundles of tasks) as opposed to a linear log must either skip failures or push them onto SQS for background retry. Our batching consumers have to track out-of-order completion of work and commit up to the lowest completed offset, meaning a slow task can delay offset commits. If a consumer is stopped before finishing that slow task, we have to replay work which means all work has to be idempotent. In practice, it works well enough, but it's still some gymnastics.

I suspect this is why Google invested so much into making PubSub scalable despite per-message semantics. It's considerably simpler in many ways, even if you have to bake in your own ordering/monotonicly increasing identifiers.

joohwanOP 9 years ago

Very true. I indeed found the lack of visibility into per-message information very painful when I was building this. One way I tried to alleviate the issue was providing a consumer "callback" to make it easier for users to plug their own code in to handle job failures (like your example of using SQS).
I've also thought about reserving a topic + consumer group specifically for failed jobs and bake the retry logic into KQ itself. But that's an area I must explore more.
I'm not sure if I understand what you are saying about batching consumers. What do you mean by batching in this context? Thanks for your input.
- Xorlev 9 years ago
  
  We have some consumers which treat log entries as tasks, and often it's handy to debounce some of the work into larger chunks that can be executed in parallel. The chunks can be linear or they could be grouped by some property of the message (e.g. account id). In that case, we have batches of messages with multiple non-consecutive offsets, e.x. [123, 145, 155], [122, 124, 144]. In practice, that means inserting each message offset into a per-partition sorted set of pending work. When a batch completes, all the offsets in that batch are marked as "complete" and we commit the lowest safe offset. Using the example above, if the batch [122, 124, 144] completed, we'd still have [123, 145, 155] outstanding which means the lowest safe offset is 122* even though 124 and 144 also completed in batch 1. Until that second batch completes, 123 is still outstanding making it the barrier to commiting a higher offset.
  Our batching consumers provide pluggable behavior for handling a failing batch, but usually it's pushed onto SQS since those can cycle around a few times until we notice and fix whatever condition is preventing progress on that work.
  * - 123 actually, as if you commit offset 123 the consumer will fetch offset 123 again on start, but that's implementation esoterica

pfarnsworth 9 years ago

Anyone have any opinions on the long-term viability of Kafka? I've been lurking on the kafka dev mailing lists and I'm fairly turned off by the attitude from Confluent's employees that I've read from. There was a recent thread about their open source status and how they backhand Apache's open source philosophies, I'm wondering if they are thinking of moving away from open source in the future.

boredandroid 9 years ago

I'm one of the original Kafka creators and am an evil corporate shill at Confluent :-)
That was a super silly discussion and I'd be annoyed reading it too. It does not, however, represent any kind of "move away from open source" by Confluent.
The discussion was not actually about open source vs non-open source but rather around how to curate the Kafka ecosystem. Currently there are several hundred independent open source projects that act as connectors, clients, monitoring tools, processing libraries and frameworks, etc. Some of these are fantastic and active communities, some are built by companies that use Kafka, some by vendors that use Kafka in a product in some way. This has always been the structure of this ecosystem, since even before we moved Kafka into Apache. These projects are mostly Apache licensed and on github, but mostly not part of the ASF. You can see a subsample of them here: https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
The proposal from a group at Hortonworks was to duplicate one of the modules Confluent built, a REST layer, and build a new one in the main project. I don't think this is inherently a bad thing to do, but there wasn't really any concrete rationale...there was no complaint with the code in that module, nor any kind of complaint with the governance, nor any complaint with the community, etc. As a result a lot of us at Confluent kind of felt it was an odd exercise that had more to do with Hortonworks' policy of only shipping Apache projects, which in turn comes out of the weird fractious politics of the Hadoop ecosystem. The reality is that policy just doesn't fit with the Kafka ecosystem as it stands today.
winteriscoming 9 years ago

I am not associated with the Kafka dev team or Confluent in any way but have been using Kafka since their 0.7.x days. Speaking of their open source commitment so far, I haven't seen any kind of problems in the way they deal with the community. They have been open to bug fixes, feature enhancements and other contributions.
Many of the Confluent employees you see currently, started off by contributing to the open source code and still do, from what I can see.
>> There was a recent thread about their open source status
I am not sure which thread you are talking about, but if it is that thread which involved adding REST server within Kafka core then I completely back what many of the Confluent employees and other community members decided on that topic. From what I could see in that thread, the whole reason of "we should bring in REST server within Kafka core" was not related to technical reasons but hypothetical reasons like "there's a project out there which already supports this REST feature but what if they don't like my contributions and don't allow me to push features that I like into that repo". That proposal of bringing in the REST server within Kafka core was, IMO, rightly rejected but at the same time, the users were allowed to state the technical reasons why they want that feature within core.
Given any production usable project that has a large user base, discussions and decisions like these are common and that doesn't essentially mean they are moving away from open source. Overall, I have high respect to many of the members of Kafka dev team, many of whom are currently employed at Confluent, for the way they have so far dealt with suggestions to enhancements in the project.
Of course, Confluent builds on top of Kafka and has/will have they own commercial interests, so some of the features that they develop might/will be commercial.
- pfarnsworth 9 years ago
  
  Yes, I think that's the one but my biggest takeaway from what I remember was "Apache is okay but..." I remember reading about half a dozen Confluent employees towing the party line and parroting the exact same argument, about how being under Apache was stifling their innovation. And yet Kafka is thriving under Apache, so their arguments smelled fishy as hell.
  It sounds like they are setting up the foundation to pull something like what Oracle did with MySQL, and try to take control after countless outside helpers turned the product into something rock solid.
  - SEJeff 9 years ago
    
    There are plenty of outside helpers, but the heavyweights doing the overwhelming majority of the work are either Jay, Jun, or Neha and/or work for one of those 3... All at Confluent. I've met both Jay and Jun in person, and watched Neha present @ MesosCon last year. If they're doing something with tech, it is going to be good.

stratospherein 9 years ago

Don't you need zookeeper to use Kafka? Perhaps you should mention that in your dependencies.

jondot 9 years ago

To me, Kafka as a job queue is a painful impedance mismatch. To achieve that, you need to:

1. Figure out which Kafka broker you're using. The concept of a consumer and consumer APIs 0.7 is different from 0.8 which is different from 0.9 and different still from 0.10. Ranging from non-existing, to quirky, to finally a good design that's working - but you need to make sure offsets are committed in time.

2. Offset management is a thing. If you're lucky and using recent brokers you're good, but you'll still have to make sure timings for submitting offsets and heartbeats interleave in a way that Kafka doesn't think consumers are dead. If you had problems imagining this scenario - exactly. This was a very hard race condition to find, that's solely up to you and your motivation to fix it and not scratch it of in favor of "a random glitch".

3. Kafka clients are still radically different, supporting different versions. You need to be lucky to use a language and a platform that is in harmony with latest consumer APIs. However - I'm certain it'll converge. The ecosystem will converge slower.

4. Out of reasons Xorlev mentioned, you will find yourself against the wall, making sure each job task is idempotent. Suddenly - this becomes a people management problem too.

All of these can (and probably _will_) be solved, however I feel that (2) and (4) will always be there, because that's part of why Kafka is so great.

In addition, I think Kafka is one product which you _must_ read the "whitepaper"[0] for before you want to build consumers for it. The first reason - because it's an innovative design, that might come in handy in every day life if you're an engineer, and the second reason - is to understand the founding context in which it was created - logs and why there are so many tradeoffs that were done for it to be amazing at that, and to realize that this original founding context was _not_ transactional jobs.

Switching gears now. Many organizations find Kafka as a much needed cure for data processing pipelines, and pushing events and messaging as a first class citizen in the organization from a _data_ point of view. For that, Kafka is amazing. With it, you can realize the dream of having an "event mart" where groups, teams, consume and publish their view of the world, processed, as a message stream, and someone can pick up that stream and build a completely different product on top of it (not a perfect example but one we can all relate to - think about Twitter's firehose).

The perception problem is, that once this floods the organization, there's little to do, to use the same mindset to build _operational_ and transactional queue systems, where you don't process events or data, but perform tasks. Unfortunately that's not true. I'd be happy if there were stronger education about this from Kafka's side.

For the kq project - I wish best of luck and I'd be interested to see it unfold. Code is very clean and I feel it's inviting to just read and learn from it - kudos!

[0] http://www.longyu23.com/doc/Kafka.pdf

joowani 9 years ago

Thank you for the excellent feedback and insight. I will definitely give the pdf a read. I agree with all of your points, which admittedly I was not fully aware of when I first embarked on this project. As you've already implied, there are some nuances that may forever be inevitable due to the inherent design of Kafka. But I wouldn't want to dismiss it as unsuitable for job queues so early. It would depend strongly on the use case of course (e.g. jobs that are idempotent or without hard requirement to be processed), but as I am hoping that Kafka's API matures further with finer control over messages and that this could work fairly well for the most part. For now I will take note and update the documentation to clearly explain what KQ is (and what it is not), and what the best practises and use cases must be taken into account before using it. Thanks again!

reubano 9 years ago

Why should I use this over the simplier RQ?

joowani 9 years ago

RQ is certainly useful (with finer control over messages), but we've been having a lot of problems with it lately in production due to it being memory bound (Redis). Improper code deploys or insufficient/stuck workers would quickly explode the queues and make the broker go oom in matter of hours. Memory is also very expensive. With KQ/Kafka I was hoping it would provide us with a lot more to buffer for human errors and scale better.
- reubano 9 years ago
  
  Ok, that makes sense. I've never dealt with the scale you are talking about so have yet to run into these issues with redis. It may be best to put this info in the readme so people can decide for themselves when to use RQ or KQ.
brianwawok 9 years ago

Or the proven Celery
- dozzie 9 years ago
  
  This Celery? https://news.ycombinator.com/item?id=12844206
  - al2o3cr 9 years ago
    
    Dunno why people would blame the Celery devs for that bug - if you're setting the clock on production servers to a non-monotonic TZ you are going to have a bad time.
    
    dozzie 9 years ago
    
    So you basically blame people for living in countries that use daylight saving time? O_O
    Also, I would understand if it were tasks between 2AM and 3AM that run twice in transition from summer time to winter time (there are two 2:30 that day, after all), but the bug affected tasks scheduled for the middle of the day.
  - brianwawok 9 years ago
    
    I mean, Celery is a pretty amazing library. It has flaws, but I doubt this new project has no flaws. They are just unknown still.
    For that specific bug in question, does KQ have logic around DST switch? I was not able to find any. So it looks like you would have the same bug in KQ, right?

dozzie 9 years ago

What kind of jobs do people put into such queues? Because I've only seen job queues used as a poor man's RPC brokers and (poor?) replacement for crontab. I think I may miss some use cases where this is a valid choice.

chrishacken 9 years ago

If you don't like it don't use it. I read through your profile's comments, all you do is complain and bash other peoples ideas.
- dozzie 9 years ago
  
  > If you don't like it don't use it.
  For our web application, we're already in the middle of transition from Celery to a proper RPC system, so your comment (aimed to, I don't know, make me ashamed? make me shut up and go doing some real work?) somewhat missed.
  Though I really want to know if there are any sensible uses for task queue, specifically with web applications in mind, as most people seem to use task queues for that.
  > [...] all you do is complain and bash other peoples ideas.
  Only the stupid or incomplete ones, like somebody boasting about their system without describing what does it do or how to install and use it.

tobych 9 years ago

Typo: lightweight in the README should not be hyphenated :-)

crucialsnippet 9 years ago

Pretty clean code. Nice job.

smegel 9 years ago

> It is backed by Apache Kafka and designed primarily for ease of use.

That's a bold claim.

rads 9 years ago

He means the software requires Kafka as a dependency, not that the foundation is sponsoring it. The claim of "designed primarily for ease of use" is a personal one, not hard to make.
- mperham 9 years ago
  
  I read the comment as "simple and ease of use are not terms often associated with Kafka".
  - smegel 9 years ago
    
    You would be correct :)

azundo 9 years ago

Getting a Permission Denied page for the docs at http://kq.readthedocs.io/en/master/

joohwanOP 9 years ago

Fixed!

Settings

Show HN: KQ – Simple Job Queue for Python Using Kafka

Keyboard Shortcuts