How Rust optimizes async/await

tmandry.gitlab.io

351 points by tmandry 6 years ago · 129 comments

Reader

weiming 6 years ago

As a newcomer to Rust, wishing that this post was one of the first ones I've read about this topic. It took scouring through many many posts, some of them here on HN, to be able to grasp some of the same idea. (I may not be alone, judging from the very long discussion the other day: https://news.ycombinator.com/item?id=20719095)

hathawsh 6 years ago

Development of high quality async support in Rust is happening right now, so remember to wear a hard hat. ;-) I like to watch https://areweasyncyet.rs/ and https://this-week-in-rust.org/ to see where things are.
steveklabnik 6 years ago

Part of this is just that the design has been in the works for four years and has changed significantly during that time; it’s only now that things are almost stable that it’s worthwhile to write these kinds of things.
GolDDranks 6 years ago

Well, it's got still three months until it lands stable, maybe it's just that the time hasn't been ripe for great, understandable posts about the feature until recently.

zackmorris 6 years ago

This is one of the most concise tutorials on how generators, coroutines and futures/promises are related (from first principles) that I've seen.

I'm hopeful that eventually promises and async/await fade into history as a fad that turned out to be too unwieldy. I think that lightweight processes with no shared memory, connected by streams (the Erlang/Elixer, Go and Actor model) are the way to go. The advantage to using async/await seems to be to avoid the dynamic stack allocation of coroutines, which can probably be optimized away anyway. So I don't see a strong enough advantage in moving from blocking to nonblocking code. Or to rephrase, I don't see the advantage in moving from deterministic to nondeterministic code. I know that all of the edge cases in a promise chain can be handled, but I have yet to see it done well in deployed code. Which makes me think that it's probably untenable for the mainstream.

So I'd vote to add generators to the Rust spec in order to make coroutines possible, before I'd add futures/promises and async/await. But maybe they are all equivalent, so if we have one, we can make all of them, not sure.

tmandryOP 6 years ago

It's the same underlying mechanism for generators as for futures: they are stackless coroutines. All the space they need for local variables is allocated ahead of time.
In my experience, the fact that they are stackless is not at all obvious when you're coding with them. Rust makes working with them really simple and intuitive.
- truncate 6 years ago
  
  Debugging can be pain though, as you may not know the right stack, and makes it harder to follow how the code executed in that context. But yes, rather than writing an async state machine with callbacks, I would prefer this.
  - steveklabnik 6 years ago
    
    https://crates.io/crates/tracing is attempting to solve this issue!
jnwatson 6 years ago

Regarding determinism, async is way more deterministic than multiple threads, because you don’t have arbitrary point where execution contexts can change.
- zackmorris 6 years ago
  
  That's true in a way, but only for multithreaded code. Multi-process code with full isolation uses different metaphors like joining threads within higher order functions to achieve parallelism in code that looks single-threaded.
  For example, lisp-based languages like Clojure can be statically analyzed and then parallelized so that all noninteracting code runs in its own process. This can also be done for code that operates on vectors like MATLAB and TensorFlow.
  For me, isolated processes under the Actor model in languages like Elixer/Erlang and Go is much simpler conceptually than async/await, which is only one step above promises/futures, which is only one step above callback hell. I know that the web world uses async/await for now, but someday I think that will be replaced with something that works more like Go.
adwn 6 years ago

> I think that lightweight processes with no shared memory, connected by streams [...] are the way to go.
No, at least not in general. There are a lot of problems in the real world for which "no shared memory" is incompatible with "efficient parallelism".
- zackmorris 6 years ago
  
  That's actually not true - the efficiencies due to copying can be overcome with the runtime or abstractions.
  For example, the copy on write (COW) mechanism of unix where memory pages are mapped to the same location until a forked process writes to one, in which case the virtual memory manager makes a mutable copy.
  There's also Redux and Closure's immutable state tree that makes copies under a similar mechanism to COW but through code, since they run at a level of abstraction above C++ or Rust.
  My feeling is that these techniques run within a few percent of the speed of hand-optimization. But in the real world, I've seen very little human code remain optimized over the long term. Someone invariably comes along who doesn't understand the principles behind the code and inadvertently does a manual copy somewhere or breaks the O() speed of the algorithm by using the wrong abstractions. Meanwhile immutable code using mechanisms like COW avoids these pitfalls because the code is small and obvious.
  I feel that the things that Rust is trying to do were solved long ago under FP, so I don't think it's the language for me. That's also why I moved away from C#, Java, C++, etc. Better languages might be Elixer or Clojure/ClojureScript, although they still have ugly syntax from a mainstream perspective compared to say Javascript or Python. I love that Rust exists as a formal spec of a mature imperative programming language. I think it's still useful in kernels and the embedded space. But I'm concerned that it's borrowing ideas like async/await that trade determinism for performance.
  - ryani 6 years ago
    
    > I feel that the things that Rust is trying to do were solved long ago under FP
    I don't think this is true at all. Statically analyzed Sync/Send are an amazing tool that I don't see in any other language. In fact, your point of view w.r.t. the actor model is extremely well-supported in Rust, due to Sync/Send traits. Immutable objects can be shared between threads using simple pointers, and mutable objects can have ownership moved between threads with just a pointer copy.
    Those threads may have 'shared memory' in the strict sense of the term. But compared to other languages, Rust makes working with shared memory 'feel' like working with independent processes, except with extremely efficient message passing. The language statically guards against threads interacting directly as long as you don't use primitives like Mutex<T>.
  - adwn 6 years ago
    
    > For example, the copy on write (COW) mechanism of unix where memory pages are mapped to the same location until a forked process writes to one
    That is shared memory – memory which is shared between two or more threads or processes.
    > There's also Redux and Closure's immutable state tree that makes copies under a similar mechanism to COW but through code
    That's also shared memory.
    "Shared memory" does not necessarily mean "shared mutable memory". Rust, for example, statically prevents memory from being simultaneously shared and mutable (except through synchronization primitives like mutexes etc.). A pure actor-based language, in contrast, would prevent even immutable memory from being shared.
truncate 6 years ago

>> dynamic stack allocation of coroutines, which can probably be optimized away anyway
This seems interesting. Do you have any pointers to places/papers I can look more into this? I'm also curious, since the stacks have to be rather small when you are running several thousands of coroutines (like Go), how often people get into issues of running out of stack because of some big stack allocation somewhere and stuff like that.
- zackmorris 6 years ago
  
  I haven't studied it deeply, but a breadcrumb would be that cooperative threads (green threads) are equivalent to coroutines.
  Ok it looks like current techniques are stackless runtimes and compiling coroutines to stackless continuations:
  https://en.wikipedia.org/wiki/Stackless_Python
  https://stackless.readthedocs.io/en/v3.6.4-slp/library/stack...
  http://jessenoller.com/blog/2009/02/23/stackless-you-got-you...
  https://engagedscholarship.csuohio.edu/cgi/viewcontent.cgi?r...
  https://pdfs.semanticscholar.org/b9aa/49e4b7a00e6c9f0d8c18ba...
  https://www.osnews.com/story/9822/protothreads-extremely-lig...
  This looks like a rare gem, although I just started reading it:
  https://cs.indiana.edu/~dfried/dfried/mex.pdf
  I grew up with the cooperative multithreading of classic Mac OS and was really shocked when I first saw Javascript back in the 90s and it had no notion of it (because it didn't have generators). That sent us down the callback hell evolutionary dead end, through promises/futures and finally to async/await where we are now. That could have been largely avoided if we had listened to programming language experts!
  - truncate 6 years ago
    
    Oh I think I misundetstood you (or you me), and I didn't really articulate my question well. I'm well aware of how stackless coroutines are implemented.
    My main concern was whenever you do that, you lose the ability to look at backtrace when something goes wrong. So some implementations keep a separate stack for each coroutine (like Go), and switch SP whenever context is switched. That way you don't have bunch of random function calls in your stack trace. The problem though is that these individual stack for each couroutine has to be pretty small in general (since you would spawn hundreds of thousands of these). Go solved this temporarily using fragmented stack, and later ditched that in favor of copying the whole stack over. And I thought, you were talking about optimizations around this whole "stackfull" coroutines thing so that I can have my backtrace.

pcwalton 6 years ago

As a reminder, you don't need to use async/await to implement socket servers in Rust. You can use threads, and they scale quite well. M:N scheduling was found to be slower than 1:1 scheduling on Linux, which is why Rust doesn't use that solution.

Async/await is a great feature for those who require performance beyond what threads (backed by either 1:1 or M:N implementations) can provide. One of the major reasons behind the superior performance of async/await futures relative to threads/goroutines/etc. is that async/await compiles to a state machine in the manner described in this post, so a stack is not needed.

woah 6 years ago

I find async/await easier to reason about than threads for anything more involved than the 1 request per thread web server use case. This is because you avoid bringing in the abstraction of threads (or green threads) and their communication with one another. You trade syntactical complexity (what color is your function, etc), for semantic complexity (threads, channels, thread safety, lock races).
- Matthias247 6 years ago
  
  I would agree to the statement for languages where it's an either/or. E.g. in Javascript there are only callbacks and async/await, so we can avoid all the complexity of threads - which is great!
  However in multithreaded languages it's always an AND. Once you add async/await people need to know how traditional threading as well as how async/await works. Rust will also make that very explicit. Even if you use async/await on a single thread the compiler will prevent accessing data from more than one task - even it those are running on the same thread. So you need synchronization again. With maybe the upside that this could be non-thread-safe (or !Sync in Rust terms) in order to synchronize between tasks that run on the same thread. But however also with the downside that utilizing the wrong synchronization primitives (e.g. thread-blocking ones) in async code can stall also other tasks.
  Overall I think using async/await in Rust is strictly more complex. But it has its advantages in terms of performance, and being able to cancel arbitrary async operations.
- kccqzy 6 years ago
  
  They have the same semantic complexity. You have tasks in async/await and you still need to deal with inter-task communication, locking, etc.
  - lalaland1125 6 years ago
    
    The main difference is that in async/await you control where context switches occur and the syntax (.await) explicitly points them out. This means you can often avoid locks and do things in a more straightforward manner.
    
    GolDDranks 6 years ago
    
    Note that this applies only to tasks that are !Sync. If they aren't, two tasks accessing the same state could be moved to different threads and access that state racily. In that case .await tells you nothing about accessing non-local state. However, for purposefully single-threaded task pools avoiding locks this way certainly seems possible.
- ori_b 6 years ago
  
  Async/await are effectively threads, the switches are just scheduled statically.
jeffdavis 6 years ago

Async/await is also good for integration with other systems where starting new threads is not practical or you are calling non-threadsafe FFI functions. Tokio offers the CurrentThread runtime, which allows concurrency without creating any new threads.
dangxiaopin 6 years ago

To implement, no. To make a performant server- yes. Most systems have limited number of threads (low limit constant compiled with the kernel), and each thread is triggered by the scheduler, not by network events, which is very uneconomical.
You with application level concurrency you get 100x performance boost for network servers.
- pcwalton 6 years ago
  
  On extreme workloads, perhaps. But we have people happily running Rust code with thousands of threads per second in production.
  M:N was experimentally found to be slower than 1:1 in Rust.
jnordwick 6 years ago

epoll io loop performs better for most network io though and is simplier to manage when you have to start dealing with out of band issues (like efficient hearbeats - every time I've had a conversation without how to move some of the heartbeat code over to async it comes down to just accepting it isn't going to be as efficient as my c++ implementation and either strain heavily of accept over publication).
Last time I saw there was still a couple extra allocations going on too in the compiler (I was told they were being worked on) and basically the default executor, tokio, wasn't very efficient at changing events in the queue (requiring a an extra self signal to get the job done).
I'd be interesting to see how little cost these are, because there is defintely a cost to the generator implementation. Yes, if I wrong a generator to do this, I couldn't write it better, but I wouldn't write a generator (and that would be a very odd definition of zero-cost there anything can be called zero cost even GC as long as it is implemented well - well, that depends on if rustc saves unnecessary state).
> Additionally, it should allow miri to validate the unsafe code inside our generators, checking for uses of unsafe that result in undefined behavior.
This is really useful as a lot of the buffer handing code needs to use unsafe for efficienty issues. And the enums sharing values is nice too - hopefully the extra pointer derferences can be optimized out.
I do worry though about all this state sitting on the head and ruining cache locality on function calls though.
- pcwalton 6 years ago
  
  If you want to use epoll directly in Rust, go right ahead. Nothing is stopping you. You don't have to switch to C++ to use epoll.
  It's pretty clear that most people don't want to write and maintain that kind of code, though—that's what async/await is for. Personally, I hate writing state machines.
  - jnordwick 6 years ago
    
    They're not very well done. Rust kind of skipped the whole non-blocking thing and went straight to tokio, then had to backtrack a little to mio+tokio, then when mio is only half way jupmed to async/await. (mio was originall tokio rolled into it - it was the whole thing and you had to use mios sockets and mios timers on the mio event loop - tokio was then split out - metal io, it was not - this was because there wasnt a non-blocking interface in rust yet).
    Async is deinitely a distraction. Rust should have stabilized other pieces more before jumping to yet another paradigm. It is gaining this collection of half-implemented things. I'm guessing generators are going to the the next move before async/await is even in the print book.
    Rust has moved from zero-cost abstraction to "an fairly efficient implemention of everything you can think of".
    
    pcwalton 6 years ago
    
    Rust didn't implement async/await for fun. The team implemented it because it was perhaps the #1 requested feature from users, including me.
    You can see it here on HN. What's the #1 comment that always appears on Rust async threads? Invariably it's "why not just use goroutines? That kind of code is so much easier to write!" Now for various reasons M:N threading like in Go doesn't work in Rust. What does work, and offers most of the ergonomics of threads/goroutines, is async/await. That's why the team has prioritized it.
    There are people who care less about ergonomics than optimizing every CPU cycle for socket servers. For those people, the focus on getting the ergonomics right before microoptimization of mio and tokio might be frustrating to see. But the overwhelming majority of users just want easy-to-use async I/O. Async/await is an enormous improvement in readability and maintainability over hand-written state machines. Rust would be doing a disservice to its users if it focused on microoptimization of tokio instead.
    
    saghm 6 years ago
    
    > Rust kind of skipped the whole non-blocking thing and went straight to tokio, then had to backtrack a little to mio+tokio, then when mio is only half way jupmed to async/await.
    Not sure exactly what you mean here; mio predates tokio by a couple years, and has been possible to use standalone since before tokio ever became a thing.
    
    devit 6 years ago
    
    Early versions of mio had a bunch of non-essential things like a timer wheel built-in, that were later removed.
    
    rapsey 6 years ago
    
    What does tokio have to do with that?
    
    SomeOldThrow 6 years ago
    
    This strikes me as needlessly hostile without contributing anything.
- rapsey 6 years ago
  
  Honestly I don’t get it why simply using mio (epoll, kqueue, iocp wrapper) is so unpopular.
  - oconnor663 6 years ago
    
    See these examples of how async/await simplifies looping and error handling: https://docs.rs/dtolnay/0.0.3/dtolnay/macro._01__await_a_min.... Await syntax makes it much easier to compose asynchronous code from different libraries, in the same way that ordinary function calls compose synchronous code. Talking to epoll/mio directly is certainly possible, but it's hard to make two different libraries work well together if they both need to talk to a single epoll handle.
    
    rapsey 6 years ago
    
    Yeah but you’re talking about things not even stable. Old tokio with closures is terrible imo.
    
    steveklabnik 6 years ago
    
    It’s extremely close to being stable; it just missed a release train, because there’s some last minute polish that needs to be done. It will be on its way soon!
    
    jnordwick 6 years ago
    
    everybody always points to the 5 line basics, never to how complex async/await gets in a real system with timers, tuneouts, multiple errors types that need to be handled in different ways, heartbeats, side channel info, etc... It isn't better at that point.
    Cool, you can write a 5 line echo server easier, but good luck with a high performance server handling important transactions.
    (If there is documentation on how to do some of these or good example code, please point me to it, because my last attempt at getting send/rcv heartbeats efficiently didn't work.)
    
    jcranmer 6 years ago
    
    The networking implementations I've had to do were all made a lot easier moving to async/await from hand-written state machines. It's a reduction in LOC terms of 6x or so, and the logic is much, much easier to follow.
    State machines are viable if your transition function is such that effectively every combination of (state, input) leads to a different state. If the transition function is mostly describing "go to the (single) next state or error," then you're essentially asking the user to do a lot of bookkeeping that compilers are good at and users are bad at.
    
    littlestymaar 6 years ago
    
    Having ported a full peer-to-peer engine in JavaScript from old-school callbacks to async/await, I strongly disagree with your comment.
    Async/await really shines when the complexity arises.
    
    pcwalton 6 years ago
    
    People are doing a lot more with async/await than five-line echo servers. They're converting entire codebases and seeing dramatic reductions in noise.
    
    leshow 6 years ago
    
    Probably because you're just looking at examples; that's selection bias. The combinators/manual implementing Future approach to async IO gets even worse at scale, async/await is infinitely easier. If you don't like it, nobody is forcing you to use it.
  - pcwalton 6 years ago
    
    Because most people don't like writing state machines by hand.
    
    rapsey 6 years ago
    
    Pre-await Tokio code looks way worse to me.
    
    jcranmer 6 years ago
    
    A lot of that, I suspect, is that Rust's borrow checker makes specifying ownership correctly for callback chains extremely painful and unergonomic.
    In my experience, closure-based async programming is only sufficiently painless in managed languages to be worth using. When you're dealing with manual memory management (or, rather, smart pointer memory management), you spend a lot more time trying to make the captures hold onto stuff.
    
    tmandryOP 6 years ago
    
    I'd agree with this, and emphasize the point that this stuff is really tricky to get right without GC. Fighting the borrow checker is somewhat expected when you're dealing with this level of inherent complexity in your memory management.
    One of the key reasons for shipping async/await is that it erases almost all of this difficulty and lets you write straight-line code again.
- carllerche 6 years ago
  
  > tokio, wasn't very efficient at changing events in the queue (requiring a an extra self signal to get the job done).
  Could you expand on that?
omeid2 6 years ago

Curious what is the fundamental difference that makes Go do M:N thread efficiently? Considering that the compiler has far less information than Rust about the program.
- pcwalton 6 years ago
  
  Go is more efficient at M:N then Rust can be mostly for two reasons:
  1. Go can start stacks small and relocate them, because the runtime needed to implement garbage collection allows relocation of pointers into the stack by rewriting them. Rust has no such runtime infrastructure: it cannot tell what stack values correspond to pointers and which correspond to other values. Additionally, Rust allows for pointers from the heap into the stack in certain circumstances, which Go does not (I don't think, anyway). So what Rust must do is to reserve a relatively large amount of address space for each thread's stack, because those stacks cannot move in memory. (Note that in the case of Rust the kernel does not have to, and typically does not, actually allocate that much physical memory for each thread until actually needed; the reservation is virtual address space only.) In contrast, Go can start thread stacks small, which makes them faster to allocate, and copy them around as they grow bigger. Note that async/await in Rust has the potential to be more efficient than even Go's stack growth, as the runtime can allocate all the needed space up front in some cases and avoid the copies; this is the consequence of async/await compiling to a static state machine instead of the dynamic control flow that threads have.
  2. Rust cares more about fast interoperability with C code that may not be compiled with knowledge of async I/O and stack growth. Go chooses fast M:N threading over this, sacrificing fast FFI in the process as every cgo call needs to switch to a big stack. This is just a tradeoff. Given that 1:1 threading is quite fast and scalable on Linux, it's the right tradeoff for Rust's domain, as losing M:N threading isn't losing that much anyway.
- newacctjhro 6 years ago
  
  Go needs to allocate a growing stack on the heap, needs to move it around, etc. It's not as efficient as Rust's async.
  - omeid2 6 years ago
    
    But is it as efficient or more than the linux threads?
    
    littlestymaar 6 years ago
    
    I don't have the full answer (and I would love if someone more knowledgeable could jump in this thread) but I'd say it depends since there are a few antagonistic effects :
    - goroutines are (unless it changed since last time I used it) cooperatively scheduled. It's cheaper than preemptive scheduling, but it can lead to big inefficiencies on some workload of you're not careful enough (tight loops can hold a (Linux) thread for a long time and prevent any other goroutines from running on this thread).
    - goroutines start with a really small (a few kB) stack which needs to be copied to be grown. If you end up with a stack as big as a native stack, you'd have done a lot of copies in the process, that wouldn't have been necessary if the stack was allocated upfront.
- staticassertion 6 years ago
  
  I don't think the implication was that Go does it efficiently.
adaszko 6 years ago

> One of the major reasons behind the superior performance of async/await futures relative to threads/goroutines/etc. is that async/await compiles to a state machine in the manner described in this post, so a stack is not needed.
That's a great optimization but doesn't that mean it also breaks stack traces?

saurik 6 years ago

Does anyone know how Rust's implementation compares to C++2a's? The C++ people seem to have spent a lot of time creating an extremely generic framework for async/await wherein it is easy to change out how the scheduler works (I currently have a trivial work stack, but am going to be moving to something more akin to a deadline scheduler in the near future for a large codebase I am working with, which needs to be able to associate extra prioritization data into the task object, something that is reasonably simple to do with await_transform). I am also under the impression that existing implementation in LLVM already does some of these optimizations that Rust says they will get around to doing (as the C++ people also care a lot about zero-cost).

tmandryOP 6 years ago

Disclaimer: I'm not an expert on the proposal, but have looked at it some, and can offer my impressions here. (Sorry, this got a bit long!)
The C++ proposal definitely attacks the problem from a different angle than Rust. One somewhat surface-level difference is that it implements co_yield in terms of co_await, which is the opposite of Rust implementing await in terms of yield.
Another difference is that in Rust, all heap allocations of your generators/futures are explicit. In C++, technically every initialization of a sub-coroutine starts defaults to being a new heap allocation. I don't want to spread FUD: my understanding is that the vast majority of these are optimized out by the compiler. But one downside of this approach is that you could change your code and accidentally disable one of these optimizations.
In Rust, all the "state inlining" is explicitly done as part of the language. This means that in cases where you can't inline state, you must introduce an explicit indirection. (Imagine, say, a recursive generator - it's impossible to inline inside of itself! When you recurse, you must allocate the new generator on the heap, inside a Box.)
To be clear, the optimizations I'm talking about in the blog post are all implemented today. I'll be covering what they do and don't do, as well as future work needed, in future blog posts.
One benefit of C++ that you allude to is that there are a lot of extension points. I admit to not fully understanding what each one of them is for, but my feeling is that some of it comes from approaching the problem differently. Some of it absolutely represents missing features in Rust's initial implementation. But as I say in the post, we can and will add more features on a rolling basis.
The way I would approach the specific problem you mention is with a custom executor. When you write the executor, you control how new tasks are scheduled, and can add an API that allows specifying a task priority. You can also allow modifying this priority within the task: when you poll a task, set a thread-local variable to point to that task. Then inside the task, you can gain a reference to yourself and modify your priority.
- je42 6 years ago
  
  > In C++, technically every initialization of a sub-coroutine starts defaults to being a new heap allocation. I don't want to spread FUD: my understanding is that the vast majority of these are optimized out by the compiler. But one downside of this approach is that you could change your code and accidentally disable one of these optimizations.
  I don't think this is correct. C++ 20 allows a lot of choices to implement it without forcing a heap allocation.
  see https://lewissbaker.github.io/2017/11/17/understanding-opera... also see this video that goes in depth how to have billions of coroutines with C++: https://www.youtube.com/watch?v=j9tlJAqMV7U
- saurik 6 years ago
  
  Thanks for the information!!
  On your last paragraph, the thing I'm concerned by is where this extra priority information is stored and propogated, as the term "task" is interesting: isn't every single separate thing being awaited its own task? There isn't (in my mental model) a concept that maps into something like a "pseudo-thread" (but maybe Rust does something like this, requiring a very structured form of concurrency?), which would let me set a "pseudo-thread" property, right?
  As an example: if I am already in an asynchronous coroutine and I spawn of two asynchronous web requests as sub-tasks, the results of which will be processed potentially in parallel on various work queues, and then join those two tasks into a high-level join task that I wait on (so I want both of these things to be done before I continue), I'd want the background processing done on the results to be handled at the priority of this parent spawning task; do I have to manually propagate this information?
  In C++2a, I would model this by having a promise type that is used for my prioritize-able tasks and, to interface with existing APIs (such as the web request API) that are doing I/O scheduling; I'd use await_transform to adapt their promise type into one of mine that lets me maintain my deadline across the I/O operation and then recover it in both of the subtasks that come back into my multi-threaded work queue. Everything I've seen about Rust seems to assume that there is a single task/promise type that comes from the standard library, meaning that it isn't clear to me how I could possibly do this kind of advanced scheduling work.
  (Essentially, whether or not it was named for this reason--and I'm kind of assuming it wasn't, which is sad, because not enough people understand monads and I feel like it hurts a lot of mainstream programming languages... I might even say particularly Rust, which could use more monadic concepts in its error handling--await_transform is acting as a limited form of monad transformer, allowing me to take different concepts of scheduled execution and merge them together in a way that is almost entirely transparent to the code spawning sub-tasks. The co_await syntax is then acting as a somewhat-lame-but-workable-I-guess substitute for do notation from Haskell. In a perfect world, of course, this would be almost as transparent as exceptions are, which are themselves another interesting form of monad.)
  - tmandryOP 6 years ago
    
    The concept of a pseudo-thread you're referring to is a task. A task contains a whole tree of futures awaiting other futures. So no manual propagation is necessary.
    Of course, it's possible for tasks to spawn other tasks that execute independently. (To be clear, if you are awaiting something from within your task, it is not a separate task.) For spawning new tasks, there's a standard API[1], which doesn't include any executor-specific stuff like priority. You'll have to decide what you want the default behavior to be when someone calls this; for example, a newly spawned task can inherit the priority of its parent.
    To get more sophisticated, you could even have a "spawn policy" field for every task that your first-party code knows how to set. Any new task spawned from within that task inherits priority according to that task's policy. The executor implementation decides what tasks look like and how to spawn new ones, so you can go crazy. (Not that you necessarily should, that is.)
    To summarize the Rust approach, I'd say you have 3 main extension points:
    1. The executor, which controls the spawning, prioritization, and execution of tasks
    2. Custom combinators (like join_all[2]), which allow you to customize the implementation of poll[3] and, say, customize how sub-futures are prioritized (This is at the same level as await, so per-Future, not per-Task.)
    3. Leaf futures (like the ones that read or write to a socket). These are responsible for working with the executor to schedule their future wake-ups (with, say, epoll or some other mechanism). For more on this, see [4].
    [1]: https://doc.rust-lang.org/1.28.0/std/task/trait.Executor.htm...
    [2]: https://rust-lang-nursery.github.io/futures-api-docs/0.3.0-a...
    [3]: https://doc.rust-lang.org/1.28.0/std/future/trait.Future.htm...
    [4]: https://boats.gitlab.io/blog/post/wakers-i/
    
    saurik 6 years ago
    
    Thank you so much for the context here!! And yeah: a big terminology clash is that many of the libraries for C++ come from a common lineage (almost entirely from Lewis Baker, who has been involved in the STL abstractions, wrote cppcoro, and then got allocated by Facebook to work on Folly) and use the term "task" to essentially mean "a future that you can efficiently co_await". What I'm seeing so far seems reasonably sane and arguably similar to the abstraction I have been building up using lower-level components in C++; which all makes me very happy, as I'm anticipating eventually wanting to rewrite what I have done so far in Rust.
  - anp 6 years ago
    
    In a Rust implementation of your example, you might not necessarily spawn the two sub-operations as separate tasks. Awaiting them directly in a parent async function (probably using FuturesUnordered, like Promise.all in JS) will cause all of their work to be scheduled, prioritized, cancelled, etc. together because they’ll be a part of the same task. There’s a 1-many relation from tasks to Futures in Rust.
    
    saurik 6 years ago
    
    FWIW, what I meant by "join those two tasks into a high-level join task" was "call something akin to Promise.all to compose the two futures into a single one upon which I could await". It sounds like I need to learn more about the concept Rust has for "tasks", as maybe they are providing this "pseudo-thread" abstraction I was discussing in passing. I am seeing terms like "leaf Futures" and "top-level Futures" in some of the documentation I am quickly turning up.
  - comex 6 years ago
    
    I don’t fully understand your use case, but in case this helps:
    - Future is a trait, not a type, so you can write your own future types.
    - Future’s poll method takes a context argument, and async/await-based futures pass that argument unchanged to any sub-futures that they await. The context argument is a pointer to a standard library type (std::task::Context), which itself contains a pointer to a “waker” object that’s meant to be supplied by the executor, with some predefined methods. There’s some room for customization here, but it’s not as flexible as it probably should be – e.g. for now, as far as I can tell, you can’t get the pointer back out of the waker, only call the standard methods on it. Thread-local storage is an option, of course.
  - steveklabnik 6 years ago
    
    A “task” is the thing you submit to an executor, that is, the sum of all async/awaits in a single computation. Each await is not its own task.
- skybrian 6 years ago
  
  It's interesting that support for recursion is no longer the default here. A partial reversal of what happened going from Fortran to Algol?
Rusky 6 years ago

Aside from the high-level similarity of the "function -> state machine" transformation, Rust's is quite a bit different (and IMO both simpler and more flexible).
A C++ coroutine chooses a particular promise type as part of its definition. Its frame defaults to a separate heap-allocation per coroutine, with some allowance for elision. At a suspension point, it passes a type-erased handle to the current coroutine to an `await_suspend` method, which can either `return false` or call `handle.resume()` to resume the coroutine. A stack of `co_await`ing coroutines (or "psuedo-thread" as you call it) is thus a linked list of `coroutine_handle`s stored in the coroutine frames of their await-ees, rooted with whoever is responsible for next resuming the coroutine.
A Rust async function does things inside out, in a sense. It has no promise type; calling one directly returns its frame into the caller's frame, as a value with an anonymous type that implements the `Future` trait. This trait has a single method called `poll`, which resumes the function and runs it until its next suspension point. `poll` takes a single argument, a handle which is used to signal when it is ready to continue. This handle is threaded down through a stack of `poll`s (a "task" or pseudo-thread), and stored with whoever is responsible for notifying the task it should continue.
One implication of the Rust approach is that the "executor" and the "reactor" are decoupled. An executor maintains a collection of running tasks and schedules them. A reactor holds onto those handles and notifies executors of relevant events. This lets you control scheduling without language hooks like await_transform- you can associate your prioritization data with a task when you spawn it on a particular executor, and it can await any reactor without losing that information.
Another implication is that you have a choice of whether to a) `await` a future, making it part of the current task, or b) spawn it as its own task, to be scheduled on its own, much like OS thread APIs. Option (a) can get really interesting with multiple concurrent sub-futures (with things like Promise.all or select); it can be as simple as having the caller poll all its children every time it wakes up, or as complex as wrapping `poll`'s handle argument and implementing your own scheduling within a task.
Matthias247 6 years ago

My understanding is that C++ opted more for a coroutine-first design, where a very generic coroutine abstraction is in the center of the design, and other things (like generators and async functions) are built around it. That makes it very universal - but probably also harder to understand if one only has a specific use-case.
Rusts design compared to that was not only focused on async functions as the main design goal, but also on maintaining compatibility with a "struct Future" type which also can be implemented by hand and represents a state-machine.
The latter will allow Rust to reuse lots of async infrastructure that had been built in the (combinators & state-machine) Futures world in the last years (e.g. the tokio library and everything on top of it).
One downside of Rusts approach might be that some parts feel a bit awkward and hard, e.g. the 2 different states of Futures (one where it hasn't been executed and can be moved around and one where it has been started executing and can't be moved anymore) and the pinning system. As far as I understand C++ exposes less of those details to end-users - this might be something where the implicit allocations might have helped it.
As far as I understand the async function flavor of C++ coroutines also have run-to-completion semantics and can't be cancelled at any yield point like Rusts Futures can be. This has the advantage of being able to wrap IO completion based operations in a more natural fashion than Rust. But it then again has the downside that users need to pass CancellationTokens around for cancellation, and that some code might not be cancellable.

continuational 6 years ago

I don't quite follow. What exactly is the overhead that other languages have for futures that is eliminated here?

weiming 6 years ago

Going down the same rabbit hole earlier this week, found this to be a good explanation:
All of the data needed by a task is contained within its future. That means we can neatly sidestep problems of dynamic stack growth and stack swapping, giving us truly lightweight tasks without any runtime system implications. ... Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state—working just like hand-rolled code. [1]
[1] https://aturon.github.io/blog/2016/09/07/futures-design/
- davidw 6 years ago
  
  > Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state
  How are those tasks implemented, and what's scheduling them?
  - weiming 6 years ago
    
    You need a third-party library to provide an executor. Rust does not come with one to keep the runtime size small. The community seems to have centralized on https://tokio.rs/ (under the hood it uses epoll/whatever OS-specific functionality to schedule M:N)
    See: https://news.ycombinator.com/item?id=20722297
    
    davidw 6 years ago
    
    How's that play out in terms of, say, performing some long-running calculations, rather than something that performs IO?
    
    weiming 6 years ago
    
    (Of course, this is Tokio-specific)
    Cooperative scheduling is used to schedule tasks on executors. A single executor is expected to manage many tasks across a small set of threads. There will be a far greater number of tasks than threads. There also is no pre-emption. This means that when a task is scheduled to execute, it blocks the current thread until the poll function returns.
    Because of this, it is important for implementations of poll to only execute for very short periods of time. For I/O bound applications, this usually happens automatically. However, if a task must run a longer computation, it should defer work to a blocking pool [1] or break up the computation into smaller chunks and yield back to the executor after each chunk. [2]
    "poll" here refers to the callback function in the future that actually does the work.
    [1] https://docs.rs/tokio-threadpool/0.1.15/tokio_threadpool/fn....
    [2] https://tokio.rs/docs/internals/runtime-model/
    
    oconnor663 6 years ago
    
    When you want to do long-running blocking work (either computation, or synchronous IO from some library that doesn't use futures), you usually want to farm that work out to a thread pool. I think Tokio provides convenience functions for doing that.
    It's also possible to write a long-running computation as a future, which can yield after some smaller amount of work to let some IO get done. But I'm not sure if that's the recommended approach.
    
    ori_b 6 years ago
    
    You're essentially implementing statically scheduled cooperative userspace threads on top of a single thread with async/await.
    So, for computation, this is a bad solution.
- tmandryOP 6 years ago
  
  This is a great quote, and one that I missed while first writing the post! I've added it now.
tmandryOP 6 years ago

Most languages allocate every future (and sub-future, and sub-sub-future) separately on the heap. This leads to some overhead, allocating and deallocating space to store our task state.
In Rust, you can "inline" an entire chain of futures into a single heap allocation.
- pjmlp 6 years ago
  
  In .NET something similar is possible via ValueTask.
  - Matthias247 6 years ago
    
    ValueTask is more of an optimization for scenarios where the Future (or Task<T>) is often intended to be ready immediately (synchronously). For those cases its wasteful to first allocate a continuation on the heap, and then not to use it - because the code after the await can directly be executed.
    The introduction of ValueTask allowed C# code to only perform dynamic allocations whenever the task definitely needed to be yielded - opposed to one allocation for every Task<T>. However it doesn't allow for guaranteed avoidance of allocations - like Rusts model can do.
    However on the positive side removing the allocations for those cases is probably good enough since the other yields are of low-frequency (when waiting for IO). And Rust code currently requires allocations like Box<dyn Future> in order to make certain things work (like type-erasure and dynamically dispatched interfaces) that are not necessarily required in .NET in the non-yielded case.
    From an ergonomic perspective I definitely prefer .NETs easy-by-default model which is still highly optimizable up to the point of avoiding all allocations. But I understand why this wouldn't have worked for Rust and it's constraints (no GC, no classes, no runtime).
    
    pjmlp 6 years ago
    
    Thanks for the thoughtful reply.
    You are right, however there is a scenario that you have forgotten.
    The possibility that the JIT might eventually optimise the code path given enough PGO data, and apply RVO or similar approaches.
    Naturally I don't know if RyuJIT can already do this kind of optimization, given that only recently they have strengthened their focus on this area.
    However this kind of optimizations are already doable in Graal, so it is possible that Microsoft also adds them to RyuJIT.
    Which in any case would rule out some of the Rust deployment scenarios I guess.
- dom96 6 years ago
  
  Has anyone ever done a comparison to see how much overhead this actually adds? I'd be really curious to see this represented in concrete terms.
zwkrt 6 years ago

So for example, in Kotlin each piece of synchronous code within an async function is compiled into what is essentially a Java Runnable object, which must be allocated on the heap.
- Matthias247 6 years ago
  
  As far as I understand in Kotlin continuations are only allocated on the heap when necessary (the suspending function can not be executed synchronously). Therefore most allocations should be avoided until blocking for IO.
- continuational 6 years ago
  
  Ah. What kind of asynchronous task executes so fast that a heap allocation is measurable?
  - shepmaster 6 years ago
    
    I like to think of it from the other direction. I'm super excited to use futures and async/await in embedded hardware, where "the heap" might not even exist. Hardware interrupts can be treated as events that wake the executor. That lets me write some code like this:
    async fn bracketed_echo() -> ! { loop { let mut buf = 0; Serial.rx(&mut buf).await; Serial.tx(b'>').await; Serial.tx(buf).await; Serial.tx(b'<').await; } }
    This reads from the serial port and echos it back inside of `><`. The code reads very cleanly, does not require the heap, and should be very efficient.
    The caveat is that I am also trying to target a platform that Lust / LLVM doesn't support completely yet, so I haven't gotten this to actually work well. I bet the people using the Cortex processor are far ahead of me though.
    
    zwirbl 6 years ago
    
    That's exactly what I am looking forward to, the state machine generation can make a lot of code targeting embedded platforms, especially the IO part, much more comfortable. Might I ask which controller you are using? With Rust I'm still on a Cortex, but looking to apply it on other architectures in the future
  - gpderetta 6 years ago
    
    A network read or write can be in the order of hundreds of nanoseconds when using an accelerated userspace networking stack. Allocation can be a good chunk of that.

emmanueloga_ 6 years ago

Settings

How Rust optimizes async/await

Keyboard Shortcuts