Thread Pools on the JVM

gist.github.com

217 points by ovis 4 years ago · 119 comments

Reader

Loom can't land fast enough!

The current issue the JVM has is that all threads have a corresponding operating system thread. That, unfortunately, is really heavy memory wise and on the OS context switcher.

Loom allows java to have threads as light weight as a goroutine. It's going to change the way everything works. You might still have a dedicated CPU bound thread pool (the common fork join pool exists and probably should be used for that). But otherwise, you'll just spin up virtual threads and do away with all the consternation over how to manage thread pools and what a thread pool should be used for.

cbsmith 4 years ago

> That, unfortunately, is really heavy memory wise and on the OS context switcher.
So, there was a time where a broad statement like that was pretty solid. These days, I don't think so. The default stack size (on 64-bit Linux) is 1MB, and you can manipulate that to be smaller if you want. That's also the virtual memory. The actually memory usage depends on your application. There was a time where 1MB was a lot of memory, but these days, for a lot of contexts, it's kind of peanuts unless you have literally millions of threads (and even then...). Yes, you can be more memory efficient, but it wouldn't necessarily help that much. Similarly, at least in the case of blocking IO (which is normally why you'd have so many threads), the overhead on the OS context switcher isn't necessarily that significant, as most threads will be blocked at any given time, and you're already going to have a context switch from the kernel to userspace. Depending on circumstance, using polling IO models can lead to more context switching, not less.
There's certainly circumstances where threads significantly impede your application's efficiency, but if you are really in that situation you likely already know it. In the broad set of use cases though, switching from a thread-based concurrency model to something else isn't going to be the big win people think it will be.
- kllrnohj 4 years ago
  
  > So, there was a time where a broad statement like that was pretty solid.
  That time is approaching 20 years old at this point, too. Native threads haven't been "expensive" for a very, very long time now.
  Maybe if you're in the camp of disabling overcommit it matters, but otherwise the application of green threads is definitely a specialized niche, not generally useful.
  > In the broad set of use cases though, switching from a thread-based concurrency model to something else isn't going to be the big win people think it will be.
  I'd go even further and say it'll be a net-loss in most cases, especially with modern complications like heterogeneous compute. If you're use case is specifically spinning up thousands of threads for IO (aka, you're a server & nothing else), then sure. But if you aren't there's no win here, just complications (like times when you need native thread isolation for FFI reasons, like using OpenGL)
  - cbsmith 4 years ago
    
    > That time is approaching 20 years old at this point, too. Native threads haven't been "expensive" for a very, very long time now.
    It depends on the context, but yes. I worked on stuff throughout the 2000's where we ran into scaling problems with thread based concurrency models. At the time, running 100,000 threads was... challenging. But yeah, by 2010 we were talking about the C10M problem, because the C10K problem wasn't a problem any more. There are some cases where you really do need to handle 10's or 100's of millions of threads, but there aren't a lot of them.
    > Maybe if you're in the camp of disabling overcommit it matters, but otherwise the application of green threads is definitely a specialized niche, not generally useful.
    Yup, but everyone is still stuck on the old mental model of "threads are bad", partly driven by the assumption that whatever is being done to handle those extreme cases is what one should be doing to address their own problem space. :-(
    > I'd go even further and say it'll be a net-loss in most cases, especially with modern complications like heterogeneous compute.
    Even more so if you're doing polling based I/O rather than a reactive model. The look on people's faces when I point out to them that there's good reason to think that for the scale they are working at, they'll likely get better performance if they just use threads to scale...
    It's so weird how we talk about the context switching costs between threads without recognizing that the thread does the poll is not the same thread that processed the IO request in the kernel.
  - samus 4 years ago
    
    > I'd go even further and say it'll be a net-loss in most cases, especially with modern complications like heterogeneous compute. If you're use case is specifically spinning up thousands of threads for IO (aka, you're a server & nothing else), then sure. But if you aren't there's no win here, just complications (like times when you need native thread isolation for FFI reasons, like using OpenGL)
    Virtual threads are going to be an /option/, not a requirement. Threads have to explicitly created as virtual threads. If this is not done, nothing will change.
    
    cbsmith 4 years ago
    
    Having it be optional increases, not decreases, complexity. ;-) It also increases the propensity for people to use the feature blindly.
    
    samus 4 years ago
    
    For the JVM developers for sure. Implementing Project Loom must have been quite a ride. But even if it is used blindly, there are only three obvious issues I see:
    * It's a no-no for computational workloads. As you said, they are concurrent, but not necessarily parallel.
    * As you said, care has to be taken to use the right thread when interacting certain low-level APIs.
    * It becomes easier to overload upstream systems by sending too many queries concurrently.
    
    cbsmith 4 years ago
    
    Oh there's a bunch of other problems as well. Developers will "solve" problems by increasing the number of virtual threads, that actually should be solved in other ways. Tons of code is going to suddenly discover assumptions about its underlying runtime model are no longer true, leading to subtle and potentially complex problems. New software will need to either take on the burden of choosing a runtime model or adopt the complexity from having to consider a mixture of both...
- vbezhenar 4 years ago
  
  Your words might be true, but the world jumped on async wagon long time ago and going all in. Nobody likes threads, everyone wants lightweight threads. Emulating lightweight threads with promises (optionally hidden behind async/await transformations) is very popular. So demand for this feature is here.
  I don't know why, I, personally, never needed that feature and good old threads were always enough for me. It's weird for me to watch non-JDBC drivers with async interface, when it was a common knowledge that JDBC data source should use something like 10-20 threads maximum (depending on DB CPU count), anything more is a sign of bad database design. And running 10-20 threads, obviously, is not an issue.
  But demand is here. And probably lightweight threads is a better approach than async/await transformations.
  - cbsmith 4 years ago
    
    It's madness.
    
    an_opabinia 4 years ago
    
    Is it? Benchmarks routinely show the async native database drivers outperforming JDBC ones in Java, and evented (async) IO is king in the only other contenders, C++ and Rust runtimes for RESTful and other server apps.
    
    cbsmith 4 years ago
    
    ...and in how many circumstances are the database drivers the limiting factor in application performance?
    As I said in the beginning, you will absolutely win in the extreme cases (and accordingly, those tend to be the drivers that are tuned more for performance). In most cases in won't really make much of a difference one way or the other, and in some cases it will actually inhibit performance.
- user5994461 4 years ago
  
  >>> The default stack size (on 64-bit Linux) is 1MB
  The default thread stack size is 8 or 10 MB on most Linux.
  The exception is Alpine that's below 1 MB.
  - cbsmith 4 years ago
    
    To clarify, the 1MB is the default stack size for threads with the JVM on 64-bit Linux.
    Search for "-Xss": https://docs.oracle.com/en/java/javase/16/docs/specs/man/jav...
  - ori_b 4 years ago
    
    The default reserved size is 8mb. The allocated size starts at a page (usually 4k), and grows in page sized increments as you use it.
- paulddraper 4 years ago
  
  Yep.
  Granted there are scenarios where you want 100,000 "threads of execution." And that clearly is going to be impractical for system threads.
  But if your worried about the overhead of your pool of 50 threads, stop it.
  - cbsmith 4 years ago
    
    > Granted there are scenarios where you want 100,000 "threads of execution." And that clearly is going to be impractical for system threads.
    100,000 was impractical in the 2000's. Today, even with the default Java stack size of 1MB, 100,000 * 1MB = 100 GB of virtual memory. For IO bound tasks, actual memory usage would typically be a fraction of that, possibly under 2GB. That's definitely practical for a modern server.
    > But if your worried about the overhead of your pool of 50 threads, stop it.
    Yeah, people seem to misunderstand how thread pools work out these days. They're more limits on concurrency than anything else.
Spivak 4 years ago

You are ignoring the downside to green threads which is that it’s cooperative. If the thread doesn’t yield control back to the event loop then the real OS thread backing the loop is now stuck.
Which leads to dirty things like inserting sleep 0 at the top of loops and dealing with really unbalanced scheduling of threads don’t hit yields often enough. Plus with loom it might not be obvious that some function is a yield since it’s meant to be transparent so if you grab a lock and yield you make everyone wait until your scheduled again.
Green threads are great! I love them and they’re the only real solutions to really concurrent IO heavy workloads but it’s not a panacea and trades one kind of discipline for another.
- cogman10 4 years ago
  
  Which is why the advice would be "Don't use virtual threads for CPU work".
  It just so happens that a large number of JVM users are working with IO bound problems. Once you start talking about CPU bound problems the JVM tends not to be the thing most people reach for.
  Loom doesn't remove the CPU bound solution by adding the IO solution. Instead, it adds a good IO solution and keeps the old CPU solution when needed.
  In fact, there's already a really good pool in the JVM for common CPU bound tasks. `Forkjoin.common()`.
- sudhirj 4 years ago
  
  Sleep 0 sounds like quite a hack, Go has the neater https://pkg.go.dev/runtime#Gosched instead, and I assume there will be a Java equivalent as well. And if most stdlib methods and all blocking methods call it, it's going to be pretty difficult to hang a green thread.
  - WatchDog 4 years ago
    
    FWIW, Java has had `Thread#yield()`[0] since inception.
    [0]: https://docs.oracle.com/javase/7/docs/api/java/lang/Thread.h...()
  - kaba0 4 years ago
    
    Since there is a runtime that knows everything about the state of the thread, my understanding is that there is no need for explicit yields. Everything will turn automagically into non-blocking (except for FFI)
- saurik 4 years ago
  
  FWIW, while you are probably correct in the context of Loom--a specific implementation that I honestly haven't looked at much--you shouldn't generalize to "green threads" of all forms as you not only can totally implement this well but Erlang does so: as you are working with a byte code and a JIT anyway, you instrument the code to check occasionally if it was preempted (I believe Erlang does this for every potentially-backward jump, which is sufficient to guarantee even a broken loop can be preempted).
- hn_throwaway_99 4 years ago
  
  Agreed, but you have other single-threaded server languages like NodeJS which have the same problem (a new request can only be handled if the current request gives up control, usually waiting for IO) and people have figured out how to handle it.
  I see Project Loom as really providing all the benefits of single threaded languages like Node (i.e. tons of scalability), but with an easier programming model that threads provide as opposed to using async/await.
- brokencode 4 years ago
  
  I was under the impression that Loom was implementing preemptable lightweight threads. Is that not the case?
  - Spivak 4 years ago
    
    So loom uses interesting terminology when talking about this. They say that they’re preemptive and not cooperative because there’s not an explicit await/yield keyword that you call from your code but that isn’t the whole story because threads are only preempted when they perform IO or are synchronized. So you as an author can’t know for sure where the yield points are and aren’t supposed to rely on them but they’re still there. You’re not going to be forcefully preempted in the middle of number crunching.
    I think most people would consider this a surprising notion of preemption where it’s out of your control-ish but also not arbitrary like it is for OS threads which still leads to basically the same problems and constraints as cooperative threads.
    
    cogman10 4 years ago
    
    Yeah... this is a place where I disagree with how the Loom devs define "preemptive". They are basically defining it as "most tasks will give up control when they hit a blocking operation". Yet, it's been my understanding that preemption means the scheduler can stop a currently operating task from running and switch to something else. That's not what happens with loom.
    
    pron 4 years ago
    
    > So loom uses interesting terminology when talking about this.
    That is a common terminology. Wikipedia says: [1]
    The term preemptive multitasking is used to distinguish a multitasking operating system, which permits preemption of tasks, from a cooperative multitasking system wherein processes or tasks must be explicitly programmed to yield when they do not need system resources. ... The term "preemptive multitasking" is sometimes mistakenly used when the intended meaning is more specific, referring instead to the class of scheduling policies known as time-shared scheduling, or time-sharing.
    > threads are only preempted when they perform IO or are synchronized
    First, they can be preempted by any call, explicit or implicit, to the runtime (or any library, for that matter). For all you know, class loading or even Math.sin might include a scheduling point (although that is unlikely as that's a compiler intrinsic). We make no promises on when scheduling can occur. Not only do threads not explicitly yield, code cannot statically determine where scheduling might occur; I don't believe anyone can consider this "cooperative."
    Second, Loom's virtual threads can also be forcibly preempted by the scheduler at any safepoint to implement time sharing. Currently, this capability isn't exposed because we're yet to find a use-case for it (other than one special case that we want to address, but isn't urgent). If you believe you have one, please send it to the loom-dev mailing list.
    The reason it's hard to find good use cases for time slicing is as follows:
    1. If you have only a small number of threads that are frequently CPU bound. In that case, just make them platform threads and use the OS scheduler. Loom makes it easy to choose which implementation you want for each thread.
    2. If you have a great many threads, each of which can infrequently become CPU-bound, then the scheduler takes care of that with work-stealing and other scheduling techniques.
    3. If you have a great many threads, each of which is frequently CPU-bound, then your cores are oversubscribed by orders of magnitude -- recall that we're talking about hundreds of thousands or possibly millions of threads -- and no scheduling strategy can help you.
    It's possible that there could arise real-world situations where infrequent CPU-boundedness might affect responsiveness, but we'll want to see such cases before deciding to expose the mechanism. Even OSes don't like relying on time-sharing (it happens less frequently than people think on well-tuned servers), and putting that capability in the hands of programmers is an attractive nuisance that will more likely cause a degradation in performance.
    [1]: https://en.wikipedia.org/wiki/Preemption_(computing)#Preempt...
    
    lmm 4 years ago
    
    Isn't the point of preemption to degrade gracefully when the cores are oversubscribed? E.g. the first system I worked on ran potentially CPU-heavy work from various clients, and used per-client threads to isolate them; every so often clients would find ways to get their thread stuck doing a large amount of CPU work (e.g. regex backtracking) and although these were in some sense bugs (and we did fix them), it was very useful that even if one or two clients blocked all their threads (which was often more than our number of physical cores), this wouldn't completely block other clients' threads from running.
    
    pron 4 years ago
    
    There's no doubt forced preemption could help, but I'm still unsure about what the right algorithm is; probably not time sharing.
    Suppose you have 100K threads, and only 1% of them become CPU-bound for 100ms. That could take down your 32-core server for 3 seconds, which is bad. But suppose we had 10ms time-slices. Then, those busy threads' latency might go from 100ms to as high as a few minutes, which means effectively taking them down. The scale has a qualitative effect here. So, rather than time-sharing, it might be better to optionally install some other preemption policy -- maybe something that indefinitely suspends threads that behave badly too often and puts them in some collection.
    The point is that time-slicing will probably not be helpful in sufficiently many cases, and we don't yet know what will. We'd like to gather more data before offering something. In some other languages/runtimes it might be worthwhile to just expose a capability and see what people do with it, but with Java, within five minutes you'll have twenty libraries doing time-sharing, and thousands of people using them blindly whether it's good or bad for them (just because they say they do time-sharing, and that's good, no?), and now there's just noise and bad habits everywhere. This is nanny-state governance, but we've learned our lesson, and you can't be too careful with an ecosystem this big.
    
    lmm 4 years ago
    
    Sure. Ultimately you've got the same problem as an OS scheduler and recognising whether threads are CPU-bound or IO-bound and treating them separately is probably going to be part of that.
    I appreciate not wanting to do things until you can do them right, but equally if you advertise this as a preemptive runtime, people are going to expect that they can use it to throw 32 CPU-spinning threads onto 8 cores and have it behave gracefully. It sounds like from a user's point of view on day 1 this runtime will be the worst of both worlds - you need to take care to not do big chunks of CPU work without yielding, but you don't get the full control that a traditional "userspace" cooperative multitasking framework would give you.
    
    pron 4 years ago
    
    But the experience people have already been having with the Early Access is overwhelmingly positive. Even without forced preemption, "preemptive" is far less misleading than cooperative, even considering the common confusion between preemptive scheduling and time-sharing.
    While OS threads might indeed handle 32 spinning threads on an 8-core machine more gracefully, switching between implementations of threads is easy so such a "mistake" is inconsequential, and no OS handles 320,000 spinning threads gracefully, and people know that that is the scale of threads that virtual threads exist to serve.
    
    lmm 4 years ago
    
    You're right that calling it "cooperative" would be worse. Still, I suspect Early Access users are paying a lot more attention to the details (and are more knowledgeable users in general) than GA users will; switching thread implementation might be "easy", but I suspect most users will want to use Loom without tuning anything at all. So safe defaults are very important (and I'd suggest that for the default config, safely handling 32 spinning threads on 8 cores is more important than handling 320,000 mostly-sleeping IO-bound threads).
    Which is not to say I have a better idea (other than "make the defaults magically do everything right", which is obviously hard).
    
    pron 4 years ago
    
    I think we have the best defaults currently possible for the use-cases Loom targets, say, more than a few thousand concurrent tasks. The cases where you might observe some downside compared to the OS (before we choose to expose forced preemption) are not in that class. The only thing to consider is whether you have many concurrent tasks or a few, and if the answer is many, the choice is simple. Otherwise, you can experiment with different implementations, but the few-tasks case is not our initial focus.
    Having said that, I'm interested in hearing about real-world cases (involving many tasks, not 32) where forced preemption, and possibly time sharing or maybe another strategy, can be useful. The "accidentally misbehaving subset" is a good example, but time-sharing probably isn't what we need to address it.
  - mikepurvis 4 years ago
    
    It sounds like it is: https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1....
    But the other side of that is that sometimes non-preemption is also a desirable property— like in JavaScript, or Python asyncio, knowing that you don't need to lock over every little manipulation of some shared data structure because you're never going to yield if you didn't explicitly await.
  - clhodapp 4 years ago
    
    I think that's not quite it:
    I believe that loom is implementing cooperative lightweight threads and simultaneously reworking all of the blocking IO operations in the Java standard library to include yields. I guess this means that you could, for example, hold an OS-level thread forever by writing an infinite loop that doesn't do any IO...
    
    kaba0 4 years ago
    
    I believe both the OS and the JVM is free to reschedule it. Yields are just an explicit way of possibly changing the thread.
- kaba0 4 years ago
  
  When you have a runtime, you have proper information whether there is work being done on a given virtual thread - So in case of Loom, afaik any blocking call will turn into non-blocking auto-magically (other than FFI, but that is very rare in Java), since the JVM is free to wait on that asynchronously behind the scenes and do some other work in the meantime.
- neeleshs 4 years ago
  :) sleep 0! I was trying to see if there is a way to preempt stuck threads (infinite loops etc), and wrote a small while loop replacement
  pwhile(()-> loop predicate, ()-> {loop body});
  All it does is add a thread.isinterrupted check to the predicate. At this point, best to switch to Erlang !
christkv 4 years ago

Are we coming full circle going back a variant of the original Java green threads?
- dragonwriter 4 years ago
  
  > Are we coming full circle going back a variant of the original Java green threads?
  There are basically two kinds of green threads:
  (1) N:1, where one OS thread hosts all the application threads, and
  (2) M:N, where M application threads are hosted on N OS threads.
  Original Java (and Ruby, and lots of other systems before every microcomputer was a multicore parallel system) green threads were N:1, which provide concurrency but not parallelism, which is fine when your underlying system can't do real parallelism anyway.)
  Wanting to take advantage of multicore systems (at least, in the Ruby case, for underlying native code) drove a transition to native threads (which you could call an N:N threading model, as application and OS threads are identical.)
  But this limits the level of concurrency to the level of parallelism, which can be a regression compared to N:1 models for applications where the level of concurrency that is useful is greater than the level of parallelism available.
  What lots of newer systems are driving toward, to solve that, are M:N models, which can leverage all available parallelism but also provide a higher degree of concurrency.
  - cbsmith 4 years ago
    
    Java had M:N green thread models a LOOOOONG time ago.
    And Linux tried M:N thread implementations specifically to improve thread performance.
    In both cases, it turned out that just using 1:1 native threads ended up being a net win.
    
    hawk_ 4 years ago
    
    i am not aware of M:N thread builtin in Java even from long time ago, at least not in a way that you could control N
    
    cbsmith 4 years ago
    
    The old JDK 1.1 Developer's Guide had a page on the different thread models: https://docs.oracle.com/cd/E19455-01/806-3461/6jck06gqk/inde...
    At the time, Solaris had the only "certified" JVM that did M:N threads, so they really liked to make a big deal about it.
    You could control N through a JNI call to thr_setconcurrency. Not portable, but it worked. That particular capability was almost always not helpful.
    
    truffdog 4 years ago
    
    It was Solaris only, so there is definitely an asterisk somewhere.
    
    cbsmith 4 years ago
    
    It was all very long ago, but the NGPT project did M:N threading on Linux (https://web.archive.org/web/20020408103057/http://www-124.ib...).
    There were also a number of M:N JVM implementations that were particularly popular in the soft-realtime space back in the early 2000's.
    One of the fun trends with computing is that as hardware, software, and applications evolve, ideas that were once not terribly useful suddenly become useful again. It's entirely possible that M:N threads for the JVM is one of those cases, but it's NOT a new idea.
  - jjtheblunt 4 years ago
    
    I worked in Solaris internals for a while at Sun during the early java era, and Solaris threading definitely did multiplexing of userspace onto os, and then os onto cores.
    Do you have a citation (because I can't find one) specifying your assertion that original Java green threads were not analogous to Solaris user -> os -> hardware multiplexing?
    
    dragonwriter 4 years ago
    
    > Do you have a citation (because I can't find one) specifying your assertion that original Java green threads were not analogous to Solaris user -> os -> hardware multiplexing?
    I was writing from memory of second-hand after-the-fact recitations of the history. Doing some followup research prompted by your question, if I understand this document [0] correctly, Java initially had N:1 green threads on Solaris, then M:N green threads on Solaris with 1:1 native threads on Unix and Windows.
    [0] https://docs.oracle.com/cd/E19455-01/806-3461/6jck06gqe/
    
    jjtheblunt 4 years ago
    
    thank you!
- AtlasBarfed 4 years ago
  
  Basically yes.
  Longer answer: devs back in the day couldn't really grok the difference between green and real threads. Java made its bones as an enterprise language, which can have smart programmers, but they will conversely not be closer-to-metal knowledgewise. Too many devs back in the day expected a java thread to be a real thread, so java re-engineered to accomodate this.
  I think the JDK/JVM teams also viewed it as a maturation of the JVM to be directly using OS resources so closely across platforms, rather than "hacking" it with green threads.
  These days, our high performance fanciness means the devs are demanding green thread analogues, and go/elixir/others are seemingly superior because of those.
  So to remain competitive in the marketplace, Java now needs threads that aren't threads even though Java used to have threads that weren't threads.
- cogman10 4 years ago
  
  Yes and no.
  The new Loom threads will be much lighter weight than the original Java green threads. Further, the entire IO infrastructure of the JVM is being reworked for Loom to make sure the OS doesn't block the VM's thread. What's more, Loom does M:N threading.
  Same concept, very different implementation.
  - iamcreasy 4 years ago
    
    So, with Loom now we can tell exactly in which order theses threads were executed as it's not up to OS to decide thread execution order anymore?
    
    cratermoon 4 years ago
    
    > we can tell exactly in which order theses threads were executed
    Everything I've ever been taught about multi-threading, parallelism, and concurrency says never to make any assumptions about execution order. What are you doing that your care about it?
    
    iamcreasy 4 years ago
    
    Replay system.
- neeleshs 4 years ago
  
  This gives some color - https://blogs.oracle.com/javamagazine/going-inside-javas-pro...
  - carimura 4 years ago
    
    and also many resources direct from the source (those that are working on Loom): https://inside.java/tag/loom
- hashmash 4 years ago
  
  Not quite. The original green threads were seen as more of a hack until Solaris supported true threads. Green threads could only support one CPU core, and so without a major redesign, it was a dead end.
- avereveard 4 years ago
  
  More like the many-to-many threading model of the Solaris implementation of the JVM
lmilcin 4 years ago

I have discovered ReactiveX for Java and Reactor in particular.
I am working with Kafka and MongoDB and it is normal for my app to have a million in flight transactions at various stages of completion.
In the past it required a lot of planning (and a lot of code) but Reactor let's me build these processes as pipelines with whatever concurrency or scheduler I desire, at any stage of the processing.
We are even doing tricks like merging unrelated queries to MongoDB so that sometimes thousands of same queries are executed together (one query with huge in() or one bulk write rather than separate ones).
This is improving our throughputs by orders of magnitude while the pipeline pulls millions of documents per second from the database.
I just don't see how Loom helps.
Loom could help if you had blocking APIs to start, but you get much better results if you just resolve to use async, non-blocking wrapped in ReactiveX.
- geodel 4 years ago
  
  Loom will help folks who prefer writing straightforward Java code instead of some random reactive library with obscure exception handling and poor to impossible debuggability.
  Now I get it is hard for many folks to understand that part. Just like at my workplace people think it is impossible to write micro service without SpringBoot.
  > Loom could help if you had blocking APIs to start, but you get much better results if you just resolve to use async, non-blocking wrapped in ReactiveX.
  There might be billions of lines of legacy code which would adapt to Loom with minimal changes but impossible to turn in ReactiveX etc without enormous investment and risk. Your ideas are rather simplistic for real world.
dikei 4 years ago

Yup, Loom will simplify a lot the Producer-Consumer pattern on I/O operation. With virtual threads, it's basically free to block on consumer threads, so you would need only 1 bounded pool for the consumers.
Currently for efficiency, you would need at least 2 pools: 1 small bounded pool for dequeuing the requests and create the IO operation, and 1 unbounded pool for actually executing the IO operation.
ovisOP 4 years ago

What benefits does loom provide vs using something like cats-effect fibres?
- _old_dude_ 4 years ago
  
  You can actually debug the code you write because you get a real stacktrace, not few frames that shows the underlying implementation.
  - clhodapp 4 years ago
    
    Admittedly, loom will do much better but cats-effect does try its best within the limitations of the current JVM: https://typelevel.org/cats-effect/docs/2.x/guides/tracing
  - Nullabillity 4 years ago
    
    On the other hand, you'll spend a lot more time debugging Loom code, because it reuses the same broken-by-design thread API.
    
    elygre 4 years ago
    
    What is broken-by-design about the api?
    
    Nullabillity 4 years ago
    
    Fundamentally, an async API is either data-oriented (Futures/Promises: tell me what data this task produced) or job-oriented (Threads: tell me when this task is done). You can think of it like functions vs subroutines.
    Since you typically care about the data produced by the task, threads require you to sort out your own backchannel for communicating this data back (such as: a channel, a mutexed variable, or something else). Unscientifically speaking, getting this backchannel wrong is the source of ~99% of multithreading bugs, and they are a huge pain to fix.
    You can implement futures on top of threads by using a thread + oneshot channel, but that requires that you know about it, and keep them coupled. The point of futures is that this becomes the default correct-by-default API, unless someone goes out of their way to do it some other way.
    On the other hand, implementing threads on top of futures is trivial: just return an empty token value.
    There are also some performance implications: depending on your runtime it might be able to detect that future A is only used by future B, and fuse them into one scheduling unit. This becomes harder when the channels are decoupled from the scheduling.
    
    azth 4 years ago
    
    Good points, but as far as I can tell, there's nothing preventing you from spawning a bunch of Loom-thread backed `CompletableFuture`s and waiting on them.
    
    Nullabillity 4 years ago
    
    True, but Loom won't really help you there since that already CompletableFuture.runAsync already uses a pooling scheduler. Same for cats-effect and zio, for that matter.
    (And that's aside from CompletableFuture having its own separate problems, like the obtrude methods)
    
    derefr 4 years ago
    
    > already uses a pooling scheduler
    A bounded pooling scheduler. (The ForkJoinPool.commonPool.)
    Loom, I believe, "dummies out" the ForkJoinPool.commonPool — ForkJoinTasks/CompletableFutures/etc. by default just execute on Loom's unbounded virtual-thread executor.
    (Which happens to be built on top of a ForkJoinPool, because it's a good scheduler. Don't fix what ain't broke.)
    
    samus 4 years ago
    
    Project Loom's scope explicitly encompasses not only virtual threads. To do that, the concept of structured concurrency[1] was introduced. There /are/ going to be new APIs.
    [1]: http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.h...
- ackfoobar 4 years ago
  
  For the team that I am in, I can see a huge productivity boost if my teammates can write in direct style instead of wrapping their heads around monads.
  - hamandcheese 4 years ago
    
    Scala for-expressions make it pretty easy to write "direct style" code. Someone on the team should probably understand whats going on, though. I've had decent success with ZIO on my team, and it seems perfectly teachable/learnable.
    
    ackfoobar 4 years ago
    
    I am the someone who "understand whats going on". My experience of the knowledge transfer was not pleasant at all. Maybe it's my ability of explaining, maybe it's my teammates, maybe it's ZIO having better names for combinators than Cats.
    For-comprehension does help. But the alternative is callback hell all the way, so that's not saying much. It is still clunky compared to the regular syntax.
bestinterest 4 years ago

Whats the difference between goroutines and project loom? Is their any?
- cogman10 4 years ago
  
  Terminology mostly :D
  I've not looked into the goroutine implementation, so I couldn't tell you how it compares to what I've read loom is doing.
  Loom is looking to have some extremely compact stacks which means each new "virtual thread" as they are calling them will end up having bytes worth of memory allocated.
  Another thing coming with loom that go lacks is "structured concurrency". It's the notion that you might have a group of tasks that need to finish before moving on from a method (rather than needing to worry about firing and forgetting causing odd things to happen at odd times).
  - jayd16 4 years ago
    
    >structured concurrency
    That's good to hear. You see a lot of these Loom discussions talk about implicit and magical asynchronous execution. I was afraid fine grained thread control would be left out. Its super useful if you want to interface with how most GUI frameworks function (ie a Main thread), or important OS threads like a thread with a bound GL context or what have you.
    
    cogman10 4 years ago
    
    Yeah, while virtual threads are the bread and butter of Loom, they are also adding a lot of QoL things. In particular, the notion of "ScopedVariables" will be a godsend to a lot of concurrent work I do. It's the notion of "I want this bit of context to be carried through from one thread of execution to the next".
    Beyond that, one thing the loom authors have suggested is that when you want to limit concurrency the better way to do that is using concurrency constructs like semaphores rather than relying on a fixed pool size.
  - ccday 4 years ago
    
    Not sure if it counts as structured concurrency but Go has the feature you describe: https://gobyexample.com/waitgroups
- _old_dude_ 4 years ago
  
  Unlike go routine, Loom virtual threads are not preempted by the scheduler. I believe you may be able to explicitly preempt a virtual thread but the last time i checked it was not part of the public API
  - vips7L 4 years ago
    
    Unless I'm misunderstanding, virtual threads are preemptive: https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1....
    
    _old_dude_ 4 years ago
    
    By the OS, not by the scheduler see https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2....
    
    vips7L 4 years ago
    
    What about pron's comments here then? https://news.ycombinator.com/item?id=27885569
    > Second, Loom's virtual threads can also be forcibly preempted by the scheduler at any safepoint to implement time sharing
    
    _old_dude_ 4 years ago
    
    For me, preemption by the Java scheduler is not currently supported but may be added in the future, after all the goroutine were not preempted at the beginning in Go.
    The whole quote
    > Second, Loom's virtual threads can also be forcibly preempted by the scheduler at any safepoint to implement time sharing. Currently, this capability isn't exposed because we're yet to find a use-case for it
    I believe it's a reference to [1] but i may be wrong.
    [1] https://download.java.net/java/early_access/loom/docs/api/ja...
    
    vips7L 4 years ago
    
    > The whole quote
    Sorry I was skimming! Thanks!
- jayd16 4 years ago
  
  The biggest difference is probably that the JVM will support both OS and lightweight threads. That's really useful for certain things talking to the GPU in a single thread context.
jeffbee 4 years ago

Are you quite certain that a (linux, nptl) thread costs more memory than a goroutine? You've implied that but it's not obviously true.
- dragontamer 4 years ago
  
  Wouldn't any linux/nptl thread require at at least the register-state of the entire x86 (or ARM) CPU?
  I don't think goroutines would need such information. A goroutine knows that "int foobar;" is currently being stored in "rbx", and that "int foobar" is currently saved on the stack. Therefore, rbx doesn't need to be saved.
  ------
  Linux/NPTL threads don't know when they are interrupted. So all register state (including AVX512 state if those are being used) needs to be saved. AVX512 x 32 is 2kB alone.
  Even if AVX512 isn't being used by a thread (Linux detects all AVX512 registers to be all-zero), RAX through R15 is 128-bytes, plus SSE-registers (another 128-bytes) or ~256 bytes of space that the goroutines don't need. Plus whatever other process-specific information needs to be saved off (CPU time and other such process / thread details that Linux needs to decide which threads to process next)
  - jeffbee 4 years ago
    
    I don't think the question is dominated by machine state, I think it would be more of a question of stack size. They are demand-paged and 4k by default for native threads, 2k by default for goroutines but stored on a GC'd heap that defaults to 100% overhead, so it sounds like a wash to me.
    
    dragontamer 4 years ago
    
    Hmmm.
    It seems like you're taking this from a perspective of "Pthreads in C++ vs Coroutines in Go", which is correct in some respects, but different from how I was taking the discussion.
    I guess I was taking it from a perspective of "pthreads in C++ vs Go-like coroutines reimplemented in C++", which would be pthreads vs C++20 coroutines. (Or really: it seems like this "Loom" discussion is more of a Java thing but probably a close analog to the PThreads in C++ vs C++20 Coroutines)
    I agree with you that that the garbage collector overhead is a big deal in practice. But its an aspect of the discussion I was purposefully avoiding. But I'm also not the person you responded to.
    
    jeffbee 4 years ago
    
    Right, I admit there are better ways to do it, but I don't think it's obviously true that goroutines specifically are either more compact or faster to switch between. The benefits might be imaginary. The Go runtime has a thread scheduler that kinda sucks actually (it scales badly as the number of runnable goroutines increases) and there are also ways of making native threads faster, like SwitchTo https://lkml.org/lkml/2020/7/22/1202
    
    Thaxll 4 years ago
    
    Have you tried to context switch between 100k native threads? Good luck with that, in the mean time Go has no problems doing that with 1m goroutines.

jackcviers3 4 years ago

Author mentions scala. Both ZIO[1] and Cats-Effect[2] provide fibers (coroutines) over these specific threadpool designs today, without the need for Project Loom, and give the user the capability of selecting the pool type to use without explicit reference. They are unusable from Java, sadly, as the schedulers and ExecutionContexts and runtime are implicitly provided in sealed companion objects and are therefore private and inaccessible to Java code, even when compiling with ScalaThenJava. Basically, you cannot run an IO from Java code.

You can expose a method on the scala side to enter the IO world that will take your arguments and run them in the IO environment, returning a result to you, or notifying some Java class using Observer/Observable. This can, of course take Java lambdas and datatypes, thus keeping your business code in Java should you so desire. It's clunky, though, and I wish Java had easy IO primitives like Scala.

1. https://github.com/zio/zio

2. https://typelevel.org/cats-effect/versions

rzzzt 4 years ago

Quasar has similar functionality: https://docs.paralleluniverse.co/quasar/
- cogman10 4 years ago
  
  Fun fact, one of the primary loom devs wrote quasar.
AzzieElbab 4 years ago

That gist is from D.J. Spiewak - one of the authors of cats effect :)

jfoutz 4 years ago

I'm wary of unbounded thread pools. Production has a funny way of showing that threads always consume resources. A fun example is file descriptors. An unexpected database reboot is often a short outage, but it's crazy how quickly unbounded thread pools can amplify errors and delay recovery.

Anyway, they have their place, but if you've got a fancy chain of micro services calling out to wherever, think hard before putting those calls in an unbounded thread pool.

sk5t 4 years ago

And you should be wary! Prefer instead a bounded thread pool with a bounded queue of tasks waiting for service, and also decide explicitly what should happen when the queue fills up or wait times become too high (whatever "too high" means for the application).
jeffbee 4 years ago

Unbounded thread pools are bad, bounded thread pool executors with unbounded work queues are bad, and bounded thread pools with bounded queues, FIFO policies, and silent drops are also bad. There are many bad ways to do this.
- dimitrov 4 years ago
  
  > and bounded thread pools with bounded queues, FIFO policies, and silent drops are also bad.
  Care to elaborate please? Seems like the author is recommending unbounded thread pools with bounded queues for blocking IO. Isn't that pretty similar?
  - jfoutz 4 years ago
    
    I can't speak for the parent, some things that stand out to me
    1. k8s and bare metal, when you make a bunch of threads things get slower. with the FIFO case, you can have pending requests in the queue that don't get their connection canceled event, and the same user puts another request in the queue.
    2. Silently dropping is bad, you want an alert - really you want an alert when you get close, so you can add more capacity
    3. bounded queue with unbounded threads is really just an unbounded queue - a short line with a mob pushing to get in line
    Then, you know, memory on k8s, pod gets OOM killed. that sucks cause you have to reschedule and restart. all the pending requests are dropped.
    It's very easy to make something that works, but is actually quite detrimental when things are on fire. little extra gasoline helps get over the hills, but when things are on fire, gasoline makes a bigger fire.

charleslmunger 4 years ago

Another tip - If you have a dynamically-sized thread pool, make it use a minimum of two threads. Otherwise developers will get used to guaranteed serialization of tasks, and you'll never be able to change it.

0xffff2 4 years ago

This seems like good advice in general. Is any of it really specific to the JVM? If I was doing thread pooling with CPU and IO bound tasks, I would approach threading in a similar way in C++.

cogman10 4 years ago

It'll depend on if your language has either coroutines or lightweight threads.
Threadpooling only matters if you have neither of those things.
Otherwise, you should be using one or the other over a thread pool. You might still spin up a threadpool for CPU bound operations, but you wouldn't have one dedicated to IO.
As of C++ 20, there are coroutines which you should be looking at (IMO).
https://en.cppreference.com/w/cpp/language/coroutines
- dragontamer 4 years ago
  
  Threadpools are probably better on CPU-bound bound (or CPU-ish bound tasks: like RAM-bound) without any I/O.
  Coroutines / Goroutines and the like are probably better on I/O bound tasks where the CPU-effort in task-switching is significant.
  --------
  For example: Matrix Multiplication is better with a Threadpool. Handling 1000 simultaneous connections when you get Slashdotted (or "Hacker News hug of death") is better solved with coroutines.
  - cogman10 4 years ago
    
    I agree.
    Coroutines MIGHT be more efficient if what you end up building is a statemachine anyways (as that's what most of those coroutines are doing with the compiler). Otherwise, if it's just pure parallel CPU/memory burning with little state transitions/dependence then a dedicated CPU pool fixed to roughly the number of CPU cores on the box will be the most efficient.
    Heck, it can often even yield benefits to "pin" certain tasks to a thread to keep the CPU cache filled with relent data. For example, 4 threads handling the 4 quadrants of the matrix rather than having the next available thread picking up the next task.
    
    dragontamer 4 years ago
    
    The one that gives me a headache is thinking about how to oversubscribe a GPU (or worse: 4 GPUs, as in the case of the Summit supercomputer).
    Its I/O to send data to and from a GPU, and therefore its an I/O bound task somewhat. But there's also a significant amount of CPU work involved. Ideally, you want to balance CPU-work and GPU-work to maximize the work being done.
    Fortunately, CUDA-streams seems like they'd mesh pretty well with coroutines (if enough code were there to support them). But if you're reaching for the "GPU-button", everything is compute-bound (if not, you're "doing it wrong"). So now you have a question of "how much to oversubscribe?"
    Then again, that's why you just make the oversubscription-factor a #define and then test a lot to find the right factor.... EDIT: Or maybe you oversubscribe until the GPU / CPU runs out of VRAM / RAM. Oversubscription isn't really an issue with coroutines that are executed inside of a thread-pool: you aren't spending any CPU-time needlessly task-switching.
    
    cogman10 4 years ago
    
    And, TBF, a lot of the IO stuff comes down to specifically talking about what sort of device you are talking to and where.
    For a lot of the programming I do (and I'm sure a lot of others on HN) IO is almost all network IO. For that, because it's so slow and everything is working over DMA anyways, coroutines end up working really well.
    However, once you start talking about on system resources such as SSDs or the GPU, it gets more tricky. As you rightly point out, the GPU is especially bad because all GPU communication ends up being routed through the CPU. At least for a HD, there's DMA which cuts down on the amount of CPU work that needs to be done to access a bit of data.
    
    jsmith45 4 years ago
    
    Only stackless co-routines require state machine transformation. Stackfull co-routines based user mode threading generally just change out the IO primitives to issue an asynchronous version of the operation, and immediately calls into the the user mode scheduler to pick some ready-to-resume co-routine to switch the stack to and resume. They might include a preemption facility (beyond just the OS's preemption of the underlying kernel threads), but that is not required and is largely a language/runtime design decision.
    The big headaches with stackfull co-routine based user mode threading come from two sources. One is allocating the stack. If your language requires a contiguous stack then you either need to make the stacks small, and risk running out, or make them big which can be a problem on 32-bit platforms (you can run out of address space), or can be a problem on some platforms (those with strict commit-charge based memory accounting). Both can be mitigated by allowing non-contiguous stacks or re-locatable contiguous stacks (to allow small stacks to grown later without headaches), although obviously that can have performance considerations.
    The other stackfull co-routine headache is in calling into code from another language (i.e. FFI) which could be making direct blocking system calls, and end up starving you of your OS threads.
    I do agree that in purely CPU or memory bound applications a classical thread pool makes better sense. The main advantages of either type of co-routine based user mode threading primarily apply to IO-heavy or mixed workloads.
- 0xffff2 4 years ago
  
  >As of C++ 20, there are coroutines which you should be looking at (IMO).
  Ha! Maybe in 20 years. Sadly, I'm still writing new code targeting C++98 on one project. The most current project I'm a part of is on C++11.
valbaca 4 years ago

> Is any of it really specific to the JVM?
Not for languages with go/coroutines (e.g. Go, Clojure, Crystal) as those were designed specifically to help with the thread-per-IO constraint.

WatchDog 4 years ago

If your app is fully non-blocking, doesn't it make sense to just do everything on the one pool, CPU bound tasks and IO polling. Rather than passing messages between threads.

tadfisher 4 years ago

"Fully non-blocking" means "does no work". Ignoring the process' spawning thread, if your app performs CPU-bound tasks on a bounded thread pool, you will be leaving I/O throughput on the table as the number of tasks increases, since I/O-bound tasks will block on waiting for a thread.

elric 4 years ago

> you're almost always going to have some sort of singleton object somewhere in your application which just has these three pools, pre-configured for use

I'm bemused by this statement, and I can't figure out whether this is an assertion rooted in supreme confidence, or just idle, wishful thinking.

That being said, giving threading advice in a virtualized and containerized world is tricky. And while these three categories seem sensible, mapping the functions of any non-trivial system onto them is going to be difficult, unless the system was specifically designed around it.

u678u 4 years ago

With Python at first I was scared of GIL being single threaded, now I'm used to it and it works great. Thousands of threads used to be normal for my old Java projects but seems crazy to me now.

Settings

Thread Pools on the JVM

Keyboard Shortcuts