C++20: Building a Thread-Pool with Coroutines

blog.eiler.eu

122 points by MichaEiler 5 years ago · 62 comments

Reader

Really nice write up! I'm excited to see more coroutine tutorials and guides come out, I think this C++20 feature has huge potential to make C++ easier to use over the next decade. I will also say I was a bit surprised to see libcoro linked in the article! I'm glad you found it useful but I need to give most of the credit to Lewis Baker's cppcoro as well -- I learned most of what I implemented into libcoro from his fantastic repository and then tuning the coroutine primitives to how I'd want to use the library for an HTTP web server. I just generally find there is no better way to truly learn a difficult concept than to roll your own version.

MichaEilerOP 5 years ago

I completely agree!
Why not benefit from Lewis's work on cppcoro? He obviously has thought through the most important issues one would otherwise also stumble across. Unfortunately cppcoro doesn't look like it is actively maintained, which his why I was looking for other implementations. I'm excited to see how your library will progress in the future!

secondcoming 5 years ago

I'll admit I find coroutines difficult to grok. It seems to me that 'callback hell' is turning into 'coroutine hell'. The only plausible use-case I can see is enabling functionality similar to that of Python's `yield`.

Does threadpool::thread_loop() not have to check if the popped coroutine is suspended before attempting to resume it?

Are they really more efficient than normal callbacks when doing async?

drenvuk 5 years ago

You're not the only one. Coroutines are complicated as hell and have too much boiler plate BUT once you handle it for a general enough case you get javascript-esque async await syntax which is very, very nice.
Take for instance, this code which relies on libuv for its event loop and co_await to retain its state during its execution: https://gist.github.com/Qix-/09532acd0f6c9a57c09bd9ce31b3023...
Lets say that you want to batch a bunch of database operations into one transaction. You could queue them up over the course of a few milliseconds, run the transaction, and then for each context that relied on different db operations simply return to each's previous point instead of having to call a handler. Granted, the handler is now inside of the `await_transform` needed to work with `co_await` but think of the possibilities. No weirdly separate callback function, no real need to make a class that encapsulates all of the operations for let's say a user's post request, and to top it all off, you can do this on a single thread. It's a tool for cleaner code but I'll be damned if it is really easy to understand.
It's just so much stupid boiler plate and a strange way of putting it together.
- asdfasgasdgasdg 5 years ago
  
  It still boggles the mind that they made it so hard to use this stuff in terms of boilerplate. Hopefully that will all be abstracted into a library that handles io, networking, multiprocessing and synchronization so most of us can just focus on writing the bits that do stuff. But I will never understand how the std maintainers could not manage to do this when every other language is doing async await support in their standard library Io utilities.
  - sebastos 5 years ago
    
    I think the reason it's like this is because the C++ maintainers are super focused on noticing whenever there is a choice to be made, and being careful not to make it for you. Given the option between foisting situationally bad performance on you and foisting lots of syntax and interface complexity on you, they always choose the latter. Which I suppose makes sense: C++ niche is that it's the Big Gun you pull out when there's serious work to be done. So they figure you're already going to be dealing with massive complexity anyway, and you're going to be prepared for it. May as well let give you _all_ the access rather than fumbling at the 1 yard line by making some opinionated choice that is a non-starter for somebody, somewhere.
    
    einpoklum 5 years ago
    
    > I think the reason it's like this is because the C++ maintainers are super focused on noticing whenever there is a choice to be made, and being careful not to make it for you
    If you think that's the case, explain why I can't decide I want to pass an argument to a function in a register:
    https://stackoverflow.com/q/58339165/1593077
    Also, modern C++ has actually made it possible to write much more elegant code, for many delicate tasks, without sacrificing the performance benefits. So a convenient default doesn't necessarily have to contradict non-opinionated nature.
  - DSingularity 5 years ago
    
    Most of that boilerplate will be hidden in the libraries which use it. Unless you are a library author I don’t know why you would care about the boilerplate-like elements from the standard— you will only use co_yield and co_await.
    
    j16sdiz 5 years ago
    
    C++ compilers are famous for their cryptic error messages. Standard Libraries are the reason.
    
    pjmlp 5 years ago
    
    Lack of concepts was the reason, it will only get better now.
    
    tialaramex 5 years ago
    
    So far as I can see in 2026 the C++ programmers will be assuring us that (some feature from C++26) is going to fix all the awful clag in the C++23 programs which, in turn, offered (some feature from C++23, maybe it's a simpler exception mechanism) but alas instead created more clag, despite their promise to clear up the mess from C++20...
    I am an old man, and so I remember when left and right every C++ programmer was excited about how the Standard Template Library was going to make everything OK and those of us who were still jeering would be writing C++ soon. How did that go?
    
    pjmlp 5 years ago
    
    I see it differently.
    Many will keep hating C++, while ignoring that Java, .NET (C#, F#, VB, C++/CLI), Python, JavaScript, PHP, Ruby,.... all suffer from similar complexity, spread around 30 - 40 years of language evolution and ecosystems.
    Others will cling to their outdated toolchains because the language owners played a Python 3 on them.
    While some will understand that the world isn't perfect and make do with what is out there.
    
    tialaramex 5 years ago
    
    ISO C++ strongly resists simplification. It can grow. I have no doubt that every brilliant idea I see in programming languages can be added to C++ but the problem is that it can't shrink and so what happens instead is that mostly "C++ programmers" speak a dialect of C++ and those dialects become mutually incomprehensible, and so then what was the benefit of C++ as a language rather than just the abstract agreement that you could, in principle, graft every conceivable feature on to something that looks kinda like C?
    An example is right in the name. Today we know that "clever" operators like the pair of ++ increment operators in C++ are a bad idea. They too easily allow mistakes to hide in plain sight, the programmer writes ++n where they actually needed n++ or vice versa, and a reviewer's brain overlooks this and so it gets shipped.
    If you're playing Code Golf then these operators are a big benefit, but we aren't playing Code Golf, we're writing actual software that will be used in the real world, so explicitly spelling out what you meant is good.
    As a result some modern languages deliberately do not have these operators. And e.g. as I understand it Swift actually removed these operators from the language. But C++ 20 still has both operators of course, it's just that your local dialect might forbid one or both of them.
    
    pjmlp 5 years ago
    
    Modern languages start from scratch.
    Python 3 will stay in history as a canonical example of what happens when those wishes turn into reality.
    
    bluGill 5 years ago
    
    That is the way of everything in the world.
  - PaulDavisThe1st 5 years ago
    
    One reason could be that C++ tends to be used more in performance-critical contexts, and frequently the abstractions present in "every other language" can get in the way of performance.
    
    cobalt 5 years ago
    
    also the memory management model in c++ is much more complex - most other languages just use garbage collection to deal with reference lifetimes
    
    PaulDavisThe1st 5 years ago
    
    garbage collection is one of the abstractions that can interfere with performance. it's out of the question for real time contexts, for example.
    
    pjmlp 5 years ago
    
    Depends,
    https://www.ptc.com/en/products/developer-tools/perc
    https://www.aicas.com/wp/products-services/jamaicavm/
    On the other hand there is real time where even malloc() is forbidden.
    https://en.m.wikipedia.org/wiki/MISRA_C
    
    PaulDavisThe1st 5 years ago
    
    The first link covers cases most simply described as "it needs to be fast, as fast as possible, but things won't go wrong if there's a random 8msec delay".
    Then there's "things will go wrong if there's a random 8msec delay".
    Then there's "people will die and/or property will be destroyed if there's a random 8msec delay".
    I was referring only to the last two. And yes, that means no malloc.
    
    pjmlp 5 years ago
    
    Depends how relevant 8msec are on a battleship weapon guidance system,
    https://www.ptc.com/en/blogs/plm/ptc-perc-virtual-machine-te...
    On any case, that is the reason why I started with depends, just like everyone wants to do big data that fits into a USB floppy, there are those cases where CircuitPython would be more than enough, yet people insist in using Assembly.
    In some cases only Ada/SPARK or MISRA will do, others not, yet all of them might fall under real time and embedded deployment.
    
    PaulDavisThe1st 5 years ago
    
    No, it doesn't depend. That fits clearly into the first category, in which things do not go wrong when there's an 8msec delay.
    By contrast, using a DAW with typical settings, an 8 msec delay is catastrophic within the scope of the task, even though nobody gets hurt and nothing gets destroyed.
    
    pjmlp 5 years ago
    
    I guess we are having a language problem here, because when we have three options to refer to, it is obvious that it depends.
    
    hedora 5 years ago
    
    The first one targets “sub-millisecond”, which is a eternity in current systems.
    According to the third link:
    > we observed a negative correlation between MISRA rule violations and observed faults
    So, MISRA isn’t really helping with real time at all.
    The second link claims pauses are capped at 10us, which (if true) is actually competing with careful use of malloc.
    
    pjmlp 5 years ago
    
    Hence why I stated "Depends".
    It is real time enough to drive battleship weapon systems, the kind of apps where the wrong people die when a GC goes wrong.
    https://www.ptc.com/en/blogs/plm/ptc-perc-virtual-machine-te...
  - pjmlp 5 years ago
    
    On Windows that library is WinRT.
MichaEilerOP 5 years ago

There are a lot more possibilities. The reason why they are so complicated is simply to give the library developers all possible options. This features is afaik something targeted at them, so that boost, asio and others can implement powerful solutions without language restrictions.
The only way to enqueue a coroutine is to call schedule() within a co_await statement/expression. In this process the coroutine is suspended. Therefore there should not be any coroutine within the queue, which we cannot immediately resume.
I'm afraid I don't have any numbers available to compare coroutines with other approaches. But nevertheless in my opinion coroutines are benefitial because they keep their state (the stack frame, local variables) alive. If you use callbacks you would have to handle all these things yourself. Think about a generator for a sequence of numbers. You would have to store at least the counter variable manually. With a coroutine this happens automatically.
contravariant 5 years ago

For what it's worth Python's yield pretty much is a coroutine, as it has a (very rarely used) feature that allows information from the caller to be passed back to the generator.
The c++ implementation seems closer to lisps 'call with current continuation', though as far as I can tell all implementations achieve more or less the same thing (though thread safety might vary among the options).
Actually continuation passing style (callbacks) are another way of doing the same thing, though they have the disadvantage that they require large structural changes to the code. It wouldn't surprise me if the callback hell can therefore also occur in all versions, though some might make it easier than others (python's implementation in particular makes it somewhat less likely by encouraging information to flow one way)
- danuker 5 years ago
  
  > as it has a (very rarely used) feature that allows information from the caller to be passed back to the generator.
  The Twisted library encouraged heavy use of this before Python implemented async/await.
  https://twistedmatrix.com/documents/current/core/howto/defer...

ptr 5 years ago

Does anyone know when Coroutines are expected to show up in compilers without enabling experimental flags?

gumby 5 years ago

Gcc for a while, says cppreference: https://en.cppreference.com/w/cpp/compiler_support#cpp20
pjmlp 5 years ago

Today on Visual C++ 2019.
- varispeed 5 years ago
  
  I am so out of touch with C++. Can you cross compile to Linux on VC++? Does VC++ support MCUs like STM32?
  - pjmlp 5 years ago
    
    No, Visual C++ is for Windows development only.

sannysanoff 5 years ago

I used similar thing, baked on top of cppcoro library (wonderful thing). My application is heavily threaded with hundreds of thousands of short-lived micro-tasks, it's interpreter of highly-parallel expressions, and values are large matrices containing expressions, so it's highly parallelizable.

I moved to C++ coroutines from composable futures (CF) library that had few thread pool implementations if memory serves (and before CF all was written with callback hell). CF out of the box had extra CPU overhead because internal implementation was not efficient enough for my use, too much templates and copying when switching tasks. Also, spawned tasks had to reference shared pointers in user space (my app code), and unneeded frequent shared pointers copying added unneeded overhead.

I rewrote CF implementation later completely, so before coroutines my app used CF API extensively, but with stuff reimplemented, however shared pointers copying was something still far from perfection.

In addition to that I had some abstraction (like async/await/spawn/wait_all) on top of CF API, so transformation of application code was not painful. I had to rewrite synchronization primitives to use mutexes which came with cppcoro, and change my own internal scheduler to use some other new primitives.

I was afraid that storing local variable in coroutines frames (instead of stack frames) would affect performance, but for some reason it did not.

I also expected compilation time to increase, but for some reason it mostly did not. Probably template expansion takes all time, so coroutines code transformation fades in comparison.

Since then I stopped using C++ coroutines .

I dropped it for following reason:

1) unable to debug. Debugger does not have access to local variables, or I cannot enable it. Reference time point: around 9 months ago. Also, stack traces. They are missing, and of course, no help from tools. You have core file, go figure.

2) g++ support was missing in the early days when i employed coroutines (clang 9 was just released), but even clang 10 compiler produced wrong code, when using suspended lambda functions. I use lambdas a lot, and as suspended functions spoil the code base, lambdas inevitably become spoiled too. So, it was just occasional SIGSEGV or wrong values. There was a workaround to move 100% of the lambda body to a separated function and then call it from lambda, but it destroys all lambda beauty.

I moved to chinese libgo (can be found on github). I don't use syscall interceptors it offers, I just use cooperative scheduler it provides, along with synchronization primitives it offers. It's stackful cooperative multitasking which keeps all yummy things. And yes, it seemingly performs slightly better in my case. And yes, i had to patch it slightly.

TLDR: dropped c++ stackless coroutines in favor of stackful coroutines (cooperative stack switching), what a relief!

MichaEilerOP 5 years ago

Having only briefly looked through the code base of libgo. It looks like they use boost::context, which is the only good stackful coroutines implementations I've come across. Not being familiar with your project I'm slightly confused about the statement of "hundreds of thousands of short-lives micro-tasks". This is usually a no-go in case of stackful coroutines, as you would waste too much memory as well as have a lot of overhead.
Regarding your debugging issues. I'd be surprised if this doesn't improve over the next year or two. Clang afaik isn't even fully compatible with the final version of coroutines yet. Microsoft has done a lot of work on the compiler itself. I'd assume that Visual Studio will likely ship improvements once they release VS2022(?). Of course these are only guesses from my side.
Summing it up it sounds to me like you suffered from the curse of being an early adopter. It would be interesting to see if you'd have less issues once tooling and compiler support has improved enough.
- sannysanoff 5 years ago
  
  I have an internal scheduler which prevents spawning too many of them. In any case, it's only stack & context allocation (and freeing) which is not that CPU expensive at my rate (does not show in profiler too much). Also, multiple concurrent processes in worst cases use much virt memory, because of stack allocations, not much resident memory in fact.
  > once compiler support has improved enough
  I give it min 5 years. It's already few year since it was in clang. I don't believe it will be fixed soon in gdb/lldb. You need to introduce many non-generic things: at least new stack chaining debug information for proper call-stacks, which is (and will!) be threadpool-implementation specific, because otherwise it should be part of standard, part of compiler implementation which is even worse. With local vars it's slightly easier however.
- sannysanoff 5 years ago
  
  it does not look like it's using boost::context. At least I never saw it in runtime. It's using its own asm routines for save/restore the context.
  - MichaEilerOP 5 years ago
    
    As far as I can tell the developer is using Fibers on Windows and boost::context on all other operating systems. You can see that he has a forward declaration in "libgo/context/fcontext.h" and then links against the respective boost assembly files in the CMakeLists.txt.
    
    sannysanoff 5 years ago
    
    Ah, correct. He's using asm files for fcontext from boost.context. Looks like those are copied into his tree from boost during build phase, and those files were the only I found. Thanks for pointing out. I maybe will need to test it on ARM one day, now I have peace of mind regarding ARM.
einpoklum 5 years ago

> chinese libgo
I believe you mean this one? :
libgo -- a coroutine library and a parallel Programming Library
https://github.com/yyzybb537/libgo
(no information about the main contributor unfortunately)
- sannysanoff 5 years ago
  
  yes
pjmlp 5 years ago

Well, they have been central to WinRT since the early days, and it was Microsoft input that largely contributed to the design.
So more an issue of tooling than anything else.

sys_64738 5 years ago

GO style co-routines and native JSON support would pretty much consign GO to history, IMO.

jerf 5 years ago

If things could be "consigned to history" by C++ adding more features, there'd hardly be any other languages left! It has all the features. That's its biggest problem.
- logicchains 5 years ago
  
  Noo, it's missing something basic: struct introspection. Unlike almost every other modern language, it's not possible to write a generic "ToJson" or "ToString" that will work for any struct.
  - quietbritishjim 5 years ago
    
    Not sure whether that's a sarcastic commentary or your just not aware, but C++ will, of course, also get (static) reflection
    https://en.cppreference.com/w/cpp/keyword/reflexpr
- pjmlp 5 years ago
  
  Usually only C++ gets the blame, but that is only because most don't pay attention to other language's reference manuals and standard libraries.
  I love to pick on Python for such examples, because it is considered to be the new BASIC, yet when I pick the standard language reference + standard library, the amount of pages outgrows those of ISO C++.
  Then there is the list of breaking changes that have happened even across minor versions since Python 1.0.
hedora 5 years ago

JSON for modern C++ and this article pretty much cover it.
C++ tends to be the second best language for everything, and this is no exception. Go beats it at Go’s own niche: it has great compilation times and it forces you down a sane asynchronous programming path.
C++ fails on both those criteria. However, once you fall off the happy path in Go, you’re probably completely screwed, where as with C++, you’re already using the second best language for whatever your new problem is.

cletus 5 years ago

C++20 coroutines confuse me. Like it's not clear to me what problem they solve.

For the last few years I've been doing Hack (Facebook's PHP fork) professionally and async-await as cooperative multitasking is pervasive. IMHO it's a really nice model. Generally speaking, I've come around to believing that if it ever comes down to you spawning your own thread, you're going to have a Bad Time.

Go's channels are another variant of this.

The central idea in both cases is that expressing dependencies this way is often sufficient and way easier to write than true multithreaded code.

C++20 coroutines don't seem to solve this problem as best as I can tell.

It actually seems like C++20 coroutines are closer to Python generators. Is this the case? Or is this a classic case of a camel is a horse designed by committee and the C++ standards committee tried to create primitives to handle these and possibly other use cases? I honestly don't know.

boardwaalk 5 years ago

They solve exactly the problem you describe; I'm not sure what you're missing. What are you thinking you can't do? I have a project where I've put C++ coroutines on top of libuv and I can do essentially anything I could do in JS/C#/Rust async/await with task/co_await/etc with the imperative style you'd expect.
You may have looked at them at too low a level. Check out something like cppcoro to see what you can do. I don't use it myself, but I've stolen a few things, like task<>, which is a pretty core thing that the stdlib does not provide.
Goroutines are not cooperative multitasking, by the way, they're non-OS/"green" threads. Until you do something silly like run CPU-bound code that doesn't hit any yield points and you have to put them in yourself (at least the last time I used Go, it's been awhile).
- harikb 5 years ago
  
  While the starvation behavior of cooperative green-threads isn’t ideal as native threads, the idea is that
  1. Properly written code will perform well, whether async/await or Go style.
  2. Making async easy makes one use it in more places. In additon having caller decide to run something sync or async also makes it way more useful. In Async/await model that can only work if all methods are declared async - very costly in complexity
  - jimsimmons 5 years ago
    
    Good way to put what’s wrong with Python async and probably by extension C++ too. The sync-async in both direction need to be symmetrical for it to make sense. Right now in Python async can interact with Python sync quite comfortably but the opposite direction is a literal black hole. I have had to read up on how sync can interact with async multiple times but I’m still not sure what is the idiomatic way and it always feels like it’s on knife edge even when I get something working.
    Obviously, the caveat is that I’m just stupid and don’t understand Python async well enough. But I have a feeling that this is common experience
    
    harikb 5 years ago
    
    No, it is not you.
    https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...
    
    jimsimmons 5 years ago
    
    This is the millionth time I’ve come across this post. I always had assumed this post was about Typed / multiple Dispatch for whatever idiotic reason.
    So glad that this has been well pointed out. Oof big relief
ninepoints 5 years ago

You're missing the big idea a bit. Coroutines in C++ can be used to implement generators or goroutines or async/await, etc. They are intended for library authors as a lower level construct. See for example: https://www.jeremyong.com/cpp/2021/01/04/cpp20-coroutines-a-...
- cletus 5 years ago
  
  I literally said this at the end of my comment about coroutines being a low-level primitive.
  - dundarious 5 years ago
    
    You raised two hypotheticals and said you honestly don't know if either/both/neither are true.
    1. Is it like Python generators?
    2. Is it a bad example of design by committee and meant as a set of lower level primitives?
    GP responded that 2 is correct (but without the sass on design by committee) and that 2 allows for them to implement such things as 1.
    I think they gave their opinion as to what parts of your hypotheticals are true, and that is a valuable contribution, and very much not the same as just repeating parts of your comment ("I literally said this").
saurik 5 years ago

I use them constantly in my work on Orchid, and, FWIW, one of your coworkers--Lewis Baker--was seemingly hired to help champion the standards and work on Folly Coro. I use the feature much the same way I would in, say, JavaScript (doing stuff like "await Fetch(...)" or "await Query(...)"); I honestly am sufficiently confused by your question that I am having a hard time saying much more... the C++ version of this primitive is great because it doesn't have any baggage about "executors" or anything and lets me have complete control over how the task switching happens. You can build generators with the feature, but I almost never do; there are a couple things I wish the C++ people had done with their implementation (which involves a weirdly thick-feeling interface to implement, and which makes it difficult to truly avoid memory allocation)--in particular, including some of Lewis Baker's interests in asynchronous deconstruction, without which I am finding myself often boxed into a corner (though what I really want, honestly, are true "linear" types)--but by and large I love the C++ implementation of this for its complete flexibility.
harikb 5 years ago

Main problems with async await model is that the callee decides whether something should run sync/async. In goroutines model, caller decides
- Matthias247 5 years ago
  
  I don't think so. coroutine awaits are just like normal function calls in Go (since all of Go functions are implicitely suspendable).
  I guess you meant the "go" statement? That is more of a coroutine spawn thing, and this would be a separate function in C++ too.
  - harikb 5 years ago
    
    > since all of Go functions are implicitely suspendable
    Right. we are saying the same thing about Go. The comparison was with those languages that have explicit async keyword.

Settings

C++20: Building a Thread-Pool with Coroutines

Keyboard Shortcuts