Thread-Safe Lock Free Priority Queues in Golang

92 points by slobdell 9 years ago · 31 comments

Reader

It is interesting to me how popular linked lists are for non-blocking data structures. They are often slow and always so hard to reason about and implement safely.

Here's a very quick alternative

    push all messages into a channel IN.
    A goroutine reads from IN inserts into a sorted TREE.
    same goroutine reads max from TREE and pushes to channel OUT.
    workers read from OUT.

This is a pretty mediocre implementation. But I post it because the numbers posted in this article were so low. I think we can do better with something far less sophisticated.

https://gist.github.com/fmstephe/4fdc930ff180be3e92c693ad5a2...

I timed that to write 600,000 messages concurrently and read 600,000 sequentially in 1.5 seconds. Not a perfect measurement but I'm on a train and I'm about to get to my stop.

If we need really high performance we can start substituting faster, and simpler, queues than Go's channels and we can very likely improve that red-black tree that is doing the sorting. Each of these components is simpler, more likely correct, and from what I can see substantially faster than the sophisticated lock-free queue described in the article.

I don't want to come across as snarky. But I am surprised by the popularity of linked-list based lock-free data structures like this.

EDIT: So the tone of my post came across nastier than I really intend. I quite enjoyed the article, I'd be interested how far the performance can be improved.

Saltor 9 years ago
I don't think that the implementation here performs correctly...
For example, the following:
```
    q := NewPrioq()
    q.In <- timestamp(20)
    q.In <- timestamp(10)
    q.In <- timestamp(30)
    o := <-q.Out
    fmt.Println("Read", o)
```
Prints "Read 20" rather than "Read 30".
Furthermore, this input:
```
    q := NewPrioq()
    for i := 0; i < 202; i++ {
        q.In <- timestamp(i)
    }
    o := <-q.Out
    fmt.Println("Read", o)
```
will result in a deadlock.
As far as I can tell, this code behaves like a FIFO queue with capacity 201. Each iteration of the loop in Prioq.run() will read once from the input channel, insert into the empty tree, remove the "max" element from the single item tree, and then insert into the output channel.
Reading from the input channel and writing to the output channel are performed in lockstep rather than as needed. A correct implementation of this pattern would need to use a select block and to avoid the deadlock and would need a way for writes to the output channel to be signaled as needed so to avoid the stale max value problem.
- fmstephe 9 years ago
  
  You are right.
  But the purpose of that code was just to indicate the rough performance that was possible using much less sophisticated parts, compared to a lock-free linked list with sorting built in.
  But, thinking about it further it does seem like solving the 'stale max value' problem would be non-trivial.
  If I was implementing this for a real system I would side-step that problem by redefining the requirements (which could be seen as cheating).
  We can agree that the Out channel is full of stale max values and isn't really well sorted. But it will only have len(out) many stale values, and the values which replace them as Out drains are reasonably well ordered. At any time we have at most len(out) out of order tasks.
  I am thinking of the queue in two different scenarios.
  1: Workers are keeping up with new tasks - this queue provides FIFOish semantics and no effective priority ordering. But it doesn't matter, workers are keeping up prio is irrelevant.
  2: Workers are falling behind. The red-black tree fills up as Out is filled to max capacity. As Out drains new tasks are in priority order (roughly).
  Because of the vagueness of the guarantees given above you'd need to think hard about whether that provides enough. But I would strongly prefer something along these lines (i.e. avoid lock-free linked list with built in sorting) if I thought I could live with it.
  What do you think?
slobdellOP 9 years ago

thanks for posting...I will try the implementation you linked to and see the effects on performance.
Channels end up locking under the hood as well, so from a purity standpoint I wanted to avoid those...if channels are indeed faster, then clearly I need to rework my approach.
- adekok 9 years ago
  
  Locking isn't the problem. Contention is the problem.
  Even "lock free" algorithms can still not scale due to contention. Read this paper for a good summary:
  http://queue.acm.org/detail.cfm?id=2991130
  In short, a queue is made up of a linked list, where each entry is itself a circular buffer. Insertions go into a circular buffer... unless there's contention, in which case the next circular buffer is used.
  The result is that insertions are mostly not contented. Removals are mostly not contented. And it scales very well.
- fmstephe 9 years ago
  
  If you want to push the performance of that approach (a single goroutine prioritising your messages for you a queue in and a queue out) I have built a set of lock-free queues on a ringbuffer.
  https://github.com/fmstephe/flib/tree/master/queues/spscq
  Those are only single producer single consumer (spsc) so not flexible enough for your needs right now, but I am working on expanding into multiproducer/multiconsumer variants which could be a good fit.
  The single producer/consumer queues get up to 100 million messages per second (on microbenchmarks) and I expect the multi variants to be above 10-20 million per second.
  I also have a handbuilt redblack tree which does around 10 million inserts/pops per second
  https://github.com/fmstephe/matching_engine/tree/master/matc...
  Although that is very specialised for another purpose, I would be happy to try stripping it down if you wanted that. But, start with the LLRB in that gist.
- jasonwatkinspdx 9 years ago
  
  My experience is that you can lock and unlock a golang mutex in around 130ns, do a cas in around 60ns, and a fetch and increment in around 30ns. So building something atop the atomic primitives is potentially faster, but the difference is not so dramatic that you shouldn't try a simple implementation first.
  - greenleafjacob 9 years ago
    
    Lock free structures typically have worse average or even uniformly worse performance. The issue is what happens under contention or predictability. Lock freedom guarantees that at least one process makes progress whereas with locks a process can acquire a lock and get switched between cores or get GC paused or whatever.
jalfresi 9 years ago

That's a really interesting approach, thanks for posting

theaustinseven 9 years ago

These queues are thread safe and lock-free, but they are far from efficient. My biggest issue here is that the runtime evaluations are wrong. The insert operation is definitely an average runtime of N (where N is the number of elements in the queue) and a worst case of infinity(since the insert operation is restarted if a race condition occurs, an insert operation could be repeated forever). I don't know if I missed anything, so I could be wrong, but this implementation seems incredibly flawed.

That said, it is always nice to see new data-structures and I really wish there were more posts like this. I always like to see the different ways that people try to solve these problems.

jasonwatkinspdx 9 years ago

Yeah, something's up with the OP's numbers/design. I didn't dig into it further because I found the writing style off-putting. A basic lock free queue or linked list in golang should run at 10+ million ops per second. I suspect even with the parallelization the priority queue implementation is spending too much time reorganizing. A skiplist of 64 byte or 4KB buckets would probably be both simpler and faster.
slobdellOP 9 years ago

I agree, these are currently far from efficient (I hope to resolve this). My suspicion right now is that a lot of time is being spent somewhere in spinning loops waiting for progress to be made.
Average runtime of insertion is constant time because the size of the linked list does not grow beyond a certain size, and inserting into the priority queue is constant. The worst case is not infinity because there is always at least one goroutine making progress.
Thank you for the comment about wanting to see posts like this :)
I do not think this implementation is flawed from the standpoint that at least one goroutine is making progress, there are no locks, and results are returned in order if dequeues are not saturated
This is not something that would make it to a white paper because the implementation does not return deterministic results, and I could not explain from a mathematical standpoint how flawed the results are returned because they're not guaranteed to be in order.
I should have also included the profiling results. Currently 25% of time is spent garbage collecting, which is expected because the priority queues effectively use multi-version concurrency control, but this will come in handy when I want to persist data to the hard drive with as little locking as possible (and for general simplicity in avoiding corruption)
Another 15% or so is spent sleeping, which would be the spinning loop part.
- theaustinseven 9 years ago
  
  The insert is still an N operation since N in this case is the number of nodes in the list. It would only be O(1) if the list was of constant size(and even that would be misleading). The worst case is still infinity, because you are concerned with a single insert operation, and not that there are some making progress. We are merely concerned with the progress that a given insert operation makes, and there could potentially be an infinite loop. You can reduce the probability of this happening by shortening the loop between repeating the CAS operation.
  Otherwise I think it would be awesome to see profiling results, I suspect that the garbage collection time can be reduced significantly. I have recently been implementing a locking thread-safe queue, and I think it would be interesting to compare the differences in speed and garbage produced, just to get a better idea of the pros and cons. I suspect if you shortened the retry loop you have, your implementation would be faster.
  When I said flawed, I meant that it was not really enforcing order on insertion, but for many applications, this doesn't matter that much. Close enough is often good enough(and the probability of this going wrong is fairly small, except in extreme cases).

bjacokes 9 years ago

I get that this is dry subject matter and some humor helps to lighten it up, but I found the writing style distracting. Some paragraphs are dense with algorithm details, others just have a picture of a tech office, a story about running into someone on BART, or descriptions of data structures as "lame" or "losers". Made it really tough to retrace my steps and get context from the preceding section.

Animats 9 years ago

Yes. I'm starting to call this "neckbeard writing".
- pfarnsworth 9 years ago
  
  I don't think neckbeards have 24" arms (as per OP's site).
slobdellOP 9 years ago

thanks!

Animats 9 years ago

Can you write lock-free code in Go without assembly language support? Sometimes you need fence instructions or hardware compare-and-swap.

jasonwatkinspdx 9 years ago

Yes. Portable primitives are provided in the sync.atomic package. They've been careful about the details (eg, inserting fences on architectures like arm). Most applications shouldn't touch this stuff but it's there if you want to try to write a lock free data structure or algorithm.
- mappu 9 years ago
  
  There are some interesting caveats to using sync.atomic. Some working code on x86_64 did panic on x86_32 since the target struct member was no longer 64-bit aligned.
  - jasonwatkinspdx 9 years ago
    
    Yeah. I believe the API docs do call that out. There's not much that can be done about it either, short of forcing 8 byte alignment for all golang objects on 32bit platforms.
    
    mappu 9 years ago
    
    Not all, only those that can be proven to be passed to atomic calls.
    Given the rejuvenated emphasis on compile performance, it's probably not worthwhile to figure that out :) Overall Go's "don't sugarcoat everything with abstractions" attitude suits me fine for now.
  - fmstephe 9 years ago
    
    mappu, do you have a need for 64-bit aligned data?
    I would be interested if you did. I have a repo that provides a set of packages for doing things like that. If that is a real need I would be interested to add it to the library.
    
    mappu 9 years ago
    
    I have a need for quickly aggregating counters across multiple goroutines. `atomic.AddUint64(&central, threadLocal)` works quite well for that, better than pushing to a channel, better than deferring the aggregation until a sync point.
    Any needs i have for alignment are derived solely from the requirements of the atomic package.

jalfresi 9 years ago

Hats off, this is a GREAT post! Lots of detail and I'm way out of my depth but lots to dig into and learn from!

Thanks for posting!

happytrails 9 years ago

I didn't make it past the bar pic :(

employee8000 9 years ago

Comments like this are NOT appreciated on Hacker News, especially if you haven't taken the time to read the article.
- softawre 9 years ago
  
  Really? This seems like (potentially poorly worded) feedback that might be useful to the author.
- _Codemonkeyism 9 years ago
  
  Good that you know what is appreciated on Hacker News to tell us.
  Oh, you've meant the comment police from the ministry of truth will kill his karma, yes they surely will. They always do.
  PS: Down voting below zero is not moderation it's punishment, and yes, I know they enjoy it.
- oneloop 9 years ago
  
  I appreciated it.
  I notice that in about 80% of your comments on your comment history you mention how old you are. Maybe you're just getting grumpy?

Settings

Thread-Safe Lock Free Priority Queues in Golang

Keyboard Shortcuts