Settings

Theme

High performance services using coroutines

medium.com

69 points by gmosx 11 years ago · 15 comments

Reader

themartorana 11 years ago

Writing high performance server software is a whole other world from writing services. In the first, data copies can cause unacceptable slow-downs of a fraction of a millisecond - and those copies may be hidden all the way down at the driver level. When writing services (which are just as often written in Ruby or Python) trading off a few milliseconds for safety is often a worthy thing.

I spend most of my time writing services, and when looking at a language like Go, there's a reason the default is pass-by-value. Passing around pointers to multiple co-routines is asking for a race condition, and in non-GC languages, null pointers. Services are rarely written with the precision of high-performance servers.

I don't envy the server writers (although I can see how it would be fun!). Giving up a few milliseconds per request to make sure my co-routines aren't sharing pointers is worthwhile, and I appreciate the safety that gives me. I'm sure someone will mention that Rust could give me the safety I was looking for in a non-GC language, but that's the point, isn't it - that by being able to game the system, you can gain a few precious microseconds here and there that enforced safety might cost you.

  • kibwen 11 years ago

      > I'm sure someone will mention that Rust could give me 
      > the safety I was looking for in a non-GC language, but 
      > that's the point, isn't it
    
    I'm not sure what you're trying to say here. There is no need to game the system, as Rust lets you pass pointers between threads while enforcing that they are used safely. Rust lets you be as fast as an unsafe language while as safe as a managed language; the tradeoff is that you have to give more thought to your design up-front (which, given the pain of trying to reproduce and fix race conditions, is more than worth the effort IMO).
aaronlevin 11 years ago

This sounds very similar to warp, the Haskell web server that is known for its high-performance on many-core machines. The Web Application Interface (WAI) is also coroutine based (I believe): http://www.aosabook.org/en/posa/warp.html

marktangotango 11 years ago

>> If not, the coroutine will yield, and at the same time, be scheduled(migrated) to another set of threads responsible for the ‘slow’ requests(ones that need to block). Those will just schedule the coro in their own scheduler, and perform the read again and continue.

>> Alternatively, we could just yield the coro, waiting for a new special coroutine that would run in a background thread just executing readahead() and then re-scheduling(making runnable) the original coro, waiting for readahead.

Seems to me this scheme will ultimately be limited by the slower request. Performing fast and slow operations is essentially Little's Law[1] where the average time dominates. However if the slow/blocking reads where also async, I think you'd eventually be limited by io speed?

[1] http://en.wikipedia.org/wiki/Little%27s_law

  • markpapadakis 11 years ago

    I am familiar with Little's Law. The key concept is optimistic sequential execution. The overwheleming majority of requests do not need to page in cold-data or otherwise stall for too long. Also, in the case where they can stall to e.g process a lot of data in a loop, they can yield back to the scheduler, thereby being fair to all other accepted coroutines/requests and giving a chance to others to complete earlier. In practice, at least based on my prototype and tests, this works extremely well. If about 10-15% of the requests may end up blocking the thread, while the read can read kernel-cache resident data and proceed, that's really huge in terms of performance gains. The background threads that either become owners and responsible for running slow/blocking coros can rely on a work-strealing scheme to keep them all busy, and if you don't rely on them, thereby removing contention/stall from the 'fast' threads, you are going to eventually suffer stalls and latency spikes.

SixSigma 11 years ago

Syscalls are slow. See my previous HN post

https://news.ycombinator.com/item?id=8961582

rubiquity 11 years ago

The end implementation sounds a lot like... Erlang. With the exception that this solution avoids allocations that Erlang probably makes due to immutability.

  • markpapadakis 11 years ago

    This is a C++ implementation. Erlang and other high level languages use Processes(in the case of Erlang) or other such constructs to encapsulate logic and those are indeed scheduled based on either cooperative/preemptive fashion.

    The main goals of this system is to minimize potential for blocking or otherwise slow processing steps which can stall processing of other requests, especially if this happens for many such requests/second.

    Specifically, the 'optimistic sequential execution' idea is that you run the request on the request it was accepted in, use mincore() to determine if will _likely_ stall if you read from it, and if so, either migrate the coro to a 'slow' coros threads list, or just keep on going (which should be the fast-path here).

    This is a solution to a specific class of problems. The similarities with Erlang's Process scheduling is that a process is a lot like a coroutine (it encapsulates logic and data and it's scheduled by a user-space scheduler, and that there can be thousands of them in runnable state, and each has its own stack). But that's really the definition of stackless coroutines.

amelius 11 years ago

Coroutines are essentially a form of cooperative multitasking. I'm not sure we should be using that in this day and age, especially considering that requests are not purely I/O bound in all cases, and could depend on actual computation.

  • markpapadakis 11 years ago

    The fewer the trips to the kernel, the more work you can do in user-space, the fewer the sys-calls, the fewer the context switches, the better the throughput and performance.

    Even if a request is completely CPU-bound, it’s a good idea to be fair to other requests accepted in the thread; that is, if a request should take 2 seconds of processing time, while others would take a few microseconds, they shouldn’t all wait until that one request is processed.

    Instead, yielding back to the scheduler where appropriate, will give them a chance to complete in time proportional to the effort required to process them, not to the time required to handle long-running requests that stall processing of other coros.

    • amelius 11 years ago

      Yielding will take CPU time, even when no yielding is necessary. And where to put these yields? In an inner-loop? Should all loops be rewritten to ensure a yield takes place at least every N instructions? Considering all this, I think a better approach would be to optimize the thread scheduler in the kernel, assuming it is as inefficient as you are suggesting.

      • markpapadakis 11 years ago

        In practice, you indeed yield often enough — there is no perfect formula I believe — so in loops or elsewhere where you are spending too much time doing one thing. You don’t need to do it perfectly, you just need to yield often enough to give other coros a fair chance at completion. Relying on the kernel is a bad idea; that’s where you lose on performance. Also, to do what you are suggestions would mean that you use many, many OS threads and the kernel somewhat manages them all in a pre-emptive fashion, and you don’t need to do it yourself; but that means you need to pay the context switching, memory footprint etc costs. It won’t scale and will be slow.

        • amelius 11 years ago

          The memory footprint is there only to provide for stack-space. This makes me wonder, how do coros implement stacks? And don't they suffer effectively from the same problem?

          Another point: what if your code calls a library function for which you don't have control over when it yields?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection