Multi-core scaling: it’s not multi-threaded
erratasec.blogspot.comHe suggests an interesting approach.
1) Tell the kernel it only has a limited set of cores to work with.
The way to fix Snort’s jitter issues is to change the Linux boot parameters. For example, set “maxcpus=2”. This will cause Linux to use only the first two CPUs of the system. Sure, it knows other CPU cores exist, it just will never by default schedule a thread to run on them.
2) Manually schedule your high priority process onto a reserved core.
Then what you do in your code is call the “pthread_setaffinity_np()” function call to put your thread on one of the inactive CPUs (there is Snort configuration option to do this per process). As long as you manually put only one thread per CPU, it will NEVER be interrupted by the Linux kernel.
3) Turn off interrupts to keep things as real time as possible.
You can still get hardware interrupts, though. Interrupt handlers are really short, so probably won’t exceed your jitter budget, but if they do, you can tweak that as well. Go into “/proc/irq/smp_affinity” and turn of the interrupts in your Snort processing threads.
4) Profit?
At this point, I’m a little hazy at what precisely happens. What I think will happen is that your thread won’t be interrupted, not even for a clock cycle.
Can anyone remove the haziness? I'm more interested in this for benchmarking than performance, and wonder how it compares to other ways of increasing priority like "chrt". Is booting with a low "maxcpus" necessary, or can the same be done at runtime?
I think he has some iffy premises. I'm not certain, but I guess pthread mutexes probably use userspace mutexes under the covers. If they don't you could select a library that does. Grabbing uncontended mutexes should be basically free. It is definitely not a good idea to write all your own abstractions, as he suggests, unless you have a really great understanding of the issues involved. You'd probably still get it wrong sometimes even with that.
Affinity is definitely valuable, but I don't think you should need to disable interrupts for most applications. I'm not even sure if it is generally possible, since interrupts are sometimes used to swap pages or notify the kernel that a page isn't mapped or that a blocked resource is now available. The reason affinity is valuable is not because of kernel interactions. It's because of NUMA and cpu cache swapping. Affinity can prevent thread migration, which is expensive mainly because data also has to be migrated or else accessed in a less efficient manner. Likewise, make sure that if you dispatch an asynchronous call, the handler runs on the same core you sent the call from.
Finally, it's a common fallacy in these kinds of posts to act as if threads can't be used to do shared-little or shared-nothing-style multi-programming. They often aren't, but there's nothing that prevents it.
I'm not certain, but I guess pthread mutexes probably use userspace mutexes under the covers.
Correct. man 7 pthreads states:
In NPTL, thread synchronization primitives (mutexes, thread joining, and so on) are implemented using the Linux futex(2) system call.
A futex is a mutex that only does a system call if there is contention; you can google it.
"Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux": http://kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf
My blog post was quite clear that I'm talking about futexes, and then when it fails to get a lock that it goes into the kernel. Seriously, that's how everyone did it from Solaris to Windows before Linux "invented" the concept AND it's been in Linux for a decade. When somebody says "mutex", you have to assume they already mean "futex".
If you fail to get the lock, then you have to wait on the lock, so you might as well go into the kernel anyway. That's the reason futexes do that, instead of just letting the thread spin.
True, there are scenarios when you might prefer to spin, but they're pretty specialized.
My blog post was quite clear that I'm talking about futexes
No, you never mention the word "futex."
When somebody says "mutex", you have to assume they already mean "futex".
I don't think that's a valid generalization at all.
The entire point of the post was talk about "lock-free" algorithms where two cores can make forward progress without either having to wait or spin.
What systems have mutexes that aren't built like futexes?
Kernels, and probably any kind of parallel user-level runtime like Erlang.
> I don't think that's a valid generalization at all.
Especially given he was complaining about syscall costs . . . which are nonexistent with futexes . . .
They are NOT non-existent, they happen a lot in contention. That's why code fails to scale: as you add CPUs, contentions happen a lot more often, and the number of syscalls shoot through the roof.
But the cost is not the syscalls . . . it's all those threads getting stopped, swapping memory in and out of the cache, etc. The syscall cost is minor by comparison.
At any rate, you've definitely overstated the case against mutexes in this post. At most, the advice should have been "avoid designs that will suffer from a lot of contention" like locking with lots of shared mutable state. It really has nothing to do with threads-vs-processes as you've made it out to do.
If there is contention, I think you mean, right?
Yeah, edited to fix it. Just a typo.
Setting a thread (or process) to a higher real-time priority than everything else (i.e., with chrt) will accomplish the same thing he's describing with maxcpus (though I wasn't aware maxcpus worked the way he said, so I can't vouch for it myself).
It's also just better, because something lower priority can run whenever the thread with real-time priority suspends (e.g., waiting for disk or network I/O).
I don't know the process for making sure interrupts don't happen on a particular CPU in Linux, but if you do that, you can only be interrupted by SMIs (system management interrupts), which are done by the BIOS, not the OS. Some machines don't have SMIs, though. It's BIOS specific.
If you do have them, you can't necessarily tell by using the TSC (as the author wanted to do), because poorly-behaved BIOSs will attempt to hide that the SMI happened by fiddling with the TSC.
> will accomplish the same thing
I haven't worked with chrt this way, but I have done the maxcpus approach mentioned in the article - with the explicit affinity combined with maxcpus you can be sure that a process/thread won't jump around to another free CPU if it goes out to disk and is then replaced with another process. From my cursory reading of chrt, I don't see anything about CPU affinity, so this is still a possibility, right? Am I missing a constraint in the Linux scheduler?
I suppose they're the same in that you can use both to basically tell the kernel to "not schedule anything that would interact with a give process in any way that would effect it's execution timeline"
maxcpus needs to be set at boot time. You can turn CPUs on/off logically at runtime but that's not equivalent to "take this CPU out of the default scheduler rotation", it's equivalent to "power it off, I may unplug the entire CPU and swap it out". The latter would seem fancier than the former, so not sure why there isn't the means to set maxcpus at runtime, but there doesn't appear to be.
The author does gloss over the fact that if you have a hyperthreading enabled processor, you want to make sure you either turn it of or don't schedule a pair of processes in different virtual cores on the same logical core (hyperthread pair), because you will see a slowdown because the two virtual cores are actually sharing boatloads of state.
Hypothesis: we will never solve multi-core for general purpose computing (there's also http://en.wikipedia.org/wiki/Amdahl%27s_law). But we can do multi-core for the embarrassing parallelizable - such as graphics (top GPUs have over 1000 cores), so instead of solving this problem, our focus will shift to those tasks for which multi-core does work - because it's only these that keep improving at Moore's Law-like rates.
Arguably, this is already happening.
I am not sure what would theoretically prevent each web page and each application from running on a separate core (or two). Considering the number of people who love multiple tabs, I can't see how that wouldn't seem like a win for even thirty or more cores.
The problem is that most web pages and programs don't actually do very much when they're in the background, and many (most?) of the software that does grind away for long periods of time without user interaction (compilers, ray tracers, video encoding etc.) are already able to make good use of many cores. But users who fit that profile are a niche market.
What the vast majority of users need in terms of speed is acute performance. A dedicated core for every webpage is not very useful because users don't load up a bunch of tabs at the same time and then flip between them many times per second, interacting with each one. Instead CPU usage tends to happen in short bursts directly following user interaction, so even if every tab has its own core, you won't usually see very much contention for CPU between different tabs. The same goes for most native programs. Users just don't multitask fast enough to make separate cores relevant for most programs.
I think the problem is that hardcore parallel programmers can't understand the needs of adhd web browsers.
Some fractions of web pages are all sort of ridiculous Javascript and seriously slow down my machine.
And I, personally, do flip pages at a significant rate.
>Some fractions of web pages are all sort of ridiculous Javascript and seriously slow down my machine.
Yes there are exceptions, but not enough to plausibly make use of more than eight cores. The returns are usually diminishing at even four cores.
>And I, personally, do flip pages at a significant rate.
More than two or three times per second? Because that's what it would take.
I thought some browsers already do this? But it still won't make use of more than a very small number of cores, one (or a couple) for the tab you're interacting with, and a small part of another one for all the other tabs that are sitting there idle.
Chrome does split pages across different processes, and this, as far as I know, makes use of multiple cores. For (I assume) memory overhead reasons, it doesn't put each tab in a separate process by default, although there is (or used to be) a flag to switch to that behavior.
I wonder if we will ever figure a way to resume improving clock cycles instead of adding more parallelism.
Parallelism has two major issues:
First, not all applications need it, in many cases you want to do just a series of operations in a single starting number, and you don't need anything else, like if you are for example calculating a factorial, if you need only one factorial, it is useless to make it more parallel.
Second, it is absurdly hard to code stuff for heavily parallelised hardware, most coders will make crap code that don't work, no matter how good we become in making helper libraries, it is totally another way of thinking.
Yes, for some things, like servers, where you can throw a user into each core, it is nice... But for many other uses, even simple single-core parallelism, like SIMD, is not much useful.
I wonder if we will ever figure a way to resume improving clock cycles instead of adding more parallelism.
IPC has been steadily improving generation over generation, but it is a slow march. Chip frequency does not seem like it is going to go anywhere without some serious breakthroughs; processors are thermally limited, and while you can work on saving power there don't seem to be any 10x improvements in power coming that could let you crank up the clock. Scaling voltage down is great for reducing switching power, but it depends on smaller and smaller transistors, so leakage power has been steadily growing and eating into those gains.
Chips are up against a lot of walls- power consumption, heat dissipation, and so on. Chip makers have and are working on pushing forwards, but short of a new kind of transistor there do not appear to be any improvements by orders of magnitude on the horizon for single-core performance.
This is why parallelism is important. It is hard, and not every workload can be parallelised well, but there is simply no other known way to secure a 4x, 8x, 16x, etc boost in performance than 4x, 8x, 16x, etc parallelism. (Assuming your code isn't terrible, in which case fix your code!)
For increasing frequency the problem is we used to be able to just bump the clock speeds when the circuits shrank, but we've gotten small enough that the transistors start to leak now, so we have to drive them with less voltage to prevent the chip from overheating, meaning the clock has to come back down </huge-oversimplification>.
There are some possible ways forward. If we're sticking with transistors we might be able to switch to a material with better electrical properties, but that needs lots of research before it'll be higher performance than silicon.
There are also funky non-transistor based ideas for doing computation, like using DNA or nano-scale clockwork or ballistic electrons. I have no idea how feasible that stuff is.
In the shorter term, computer engineers are finding ways to turn those extra transistors into better single threaded performance by better prefetching, branch prediction, and re-ordering your instructions so that more than one are executed at a time even though you never thought about parallelism when writing it. That's why a modern computer core is much faster than an old Pentium 4, even though the clock speeds might be the same.
The problem with that is that using more transistors tends to provider at most a O(sqrt(n)) speedup, whereas adding more cores potentially provides an O(n) speedup.
> First, not all applications need it, in many cases you want to do just a series of operations in a single starting number, and you don't need anything else, like if you are for example calculating a factorial, if you need only one factorial, it is useless to make it more parallel.
Sorry for nitpicking, but calculating factorial can certainly be parallelized. Easiest way to do this is multiply every n-th number on each core and then multiply n results together.
I doubt that you can have a much increase in performance as cores increase unless you are calculating numbers with huge amount of bits.
Considering that factorial of 1e6 has about 18e6 bits (and factorial of 1e3 has 8.5e3 bits)? Yes, any factorial that doesn't have a huge amount of bits will be fast enough to calculate that there's not much point to parallelizing it.
as the size of the input to the procedure increases you will indeed be calculating with numbers with a huge amount of bits
Well what we really need to continue to improve clock cycles is better cooling. Overclocking modern chips is really easy and all you really need to get a pretty solid increase in speed is a decent cpu cooler. Yes to get drastic increases you need to up voltage to the chip and there are concerns about the chip degrading faster at higher clock cycles, but for the most part a solid increase in speed can be achieved simply by telling it to go faster and making sure it doesn't overheat.
Sure parallel code is harder than sequential code, but it's not really all that much harder (maybe about 8 credit-hours at your local university?). The reason it's "too hard" is that most programs aren't slow enough to be worth the bother.
If you're interested in faster clock cycles, check out the specs on this beast: http://www-03.ibm.com/systems/z/hardware/zenterprise/zec12.h...
> If you're interested in faster clock cycles, check out the specs on this beast: http://www-03.ibm.com/systems/z/hardware/zenterprise/zec12.h....
You can overclock an old Core 2 higher than that, with a bit of luck and good gear. And it'll undoubtedly be cheaper than buying an IBM 'frame.
FWIW overclocking records are currently above 8GHz on Vishera using N2 cooling. Granted N2 cooling can't actually be used, but that gives you the limits of the chips. You can reach 5.5GHz on water, and 6+ on cascade or single stage (which do work as standard cooling solutions)
…from 33-MHz to 3-GHz, a thousand-fold increase…
There had to be a better way to write that. I suppose more work per clock cycle and increased number of cores contributes the other x10 of raw performance. But then the author goes on to say they are stuck, which isn't true of performance, only clock rate. In any event, putting an "up is down" in your sentence should generally be avoided.
Edit: The >>>proscribed<<< method for resolving this is a “lock”, where… Sigh.
The article covers a lot of ground lightly. It talks about the new Haswell transactional memory instructions, the way Linux shards network counters, and a way to make Linux not use a core so you can schedule a process on it that will never be preempted.
I think he just screwed up the math.
Tangentially related, Snort was doing research to move to a multi-threaded architecture, but decided against it due to cache synchronization problems [1]. Though, their thoughts about splitting up processing was quite different than what the OP blog post suggests.
It looks like Snort gave up on one way of doing multi-threading, but they could still go the way suggested in the OPs post.
[1] http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-re...
Yea, some engineers created a ground-up rewrite of Snort called "Suricata" that was multi-threaded, and therefore faster than Snort, which is only single-threaded. Suricata then failed to show any benchmark where they exceeded Snort's speed, and they fail to mention that Snort works fine on multi-core.
It's one of those things "everyone knows multi-threaded is better than single-threaded", but everyone's wrong.
So, this post has a number of errors, and is fundamentally wrong.
(a) pthread_mutex_t and friends use futexes, which only call into the kernel when there actually is contention.
(b) it would be better to use chrt (change to real-time priority) than the maxcpus trick, because the former accomplishes the same thing, but allows the core to still be used if the high-priority thread suspends (e.g. to do disk or network I/o).
(c) Contrary to his claim about Snort, there is no reason to prefer a multiprocess design over a multithreaded design for a particular application. There is no savings in overhead or synchronization or anything like that by going with processes. In fact, using processes and then using memory mapping to share when you could use threads, is just making things harder for yourself for no reason.
(d) What I’m trying to show you here is that “multi-core” doesn’t automatically mean “multi-threaded”. Well, in computer science terminology, a thread is a schedulable entity, and a process is a schedulable entity with memory protection. So, he's wrong. Lots of developers talk about threads and processes as orthogonal things, though, so I can see why he made that claim.
(e) The overall theme of my talk was to impress upon the audience that in order to create scalable application, you need to move your code out of the operating system kernel. You need to code everything yourself instead of letting the kernel do the heavy lifting for you. That is horrible advice that is just going to lead to lots of bugs and wasted effort. It's premature optimization. Even most people using the Linux realtime preemption patch (PREEMPT_RT) do not have such strict requirements that they need to take this advice.
(f) Your basic design should be one thread per core and lock-free synchronization that never causes a thread to stop and wait. Might apply to certain very specific real-time (as in, embedded systems or HFT) scenarios, but in general, no, you're just wasting the core when that one thread doesn't need to use it. Prefer real-time priorities if you really need it.
(g) Specifically, I’ve tried to drill into you the idea that what people call “multi-threaded” coding is not the same as “multi-core”. Multi-threaded techniques, like mutexes, don’t scale on multi-core. Again, you can only use multiple cores in parallel if there are multiple threads. And multi-threaded techniques do scale. You definitely may want to use lock-free synchronization instead of mutexes in some specialized scenarios, though.
EDIT: OK, here is one other thing I forgot in the list above.
(h) There are multiple types of locks, like spinlocks, mutexes, critical sections, semaphores, and so on. Even among these classes there are many variations. Technically, mutexes and semaphores both are ways of protecting critical sections, and a spinlock is a way of implementing a lock (including, possibly, a mutex or semaphore lock). Again, this is to some degree the difference between developers with a shared lingo and computer scientists. But if you go by that kind of lingo, you're missing part of the picture.
Agreed.
I tried to benchmark the context switching of user-level threads vs kernel-level threads [0]. Specifically, libtask vs pthread. I assumed, pthread_yield would be very expensive, because it has to do a syscall into the kernel. Something like an order of magnitude or two. Interestingly, with two threads switching back and forth was only two times as slow as the user-level threads. Linux futexes seem to be really clever. Use them!
On the other hand, pthread_yield scales linearly, while libtask has a constant time yield. So for concurrency without parallelism [1] and lots of threads, the libtask library can be recommended.
[0] https://bitbucket.org/qznc/switchcost/src
[1] http://existentialtype.wordpress.com/2011/03/17/parallelism-...
(a) What part of "In the Linux pthread_mutex_t, when code stops and waits, it does a system call to return back to the kernel" do you not understand? That's how "futex" works: when it fails to get a lock, it must stop and wait, and therefore does a system call. When it doesn't have to wait because there is no contention, then it doesn't make a system call.
(b) The entire point is that for really scalable network applications, you don't want the high-priority thread to suspend for things like disk IO. Really, you read all that and expected the thread to use blocking IO instead of asynchronous IO???
(c) The point wasn't that Snort's multi-process design is better, only that it's acceptable. It's a single-threaded design written in the 1990s that has lots of quirks that make it hard to convert to multi-threaded operation. The point is that you can still get it to multi-core without having to make it multi-threaded.
(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.
(e) I keep repeating the claim because my code written in user mode scales better than the code of people who disagree. My code scales to 20-million connections, 20-gbps, 20-million packets/second. What does your code scale to?
(f) "Real-time" is different than the "network" apps I'm talking about. The entire idea is "control plane" vs. "data plane" separation. Control operations that need real-time guarantees are very different than high-throughput data operations.
(g) Multi-threaded techniques that try to interleave multiple threads on a single core suck for high network throughput. Just ask Apache
A lot of what you're saying hinges on being specific to what you call "really scalable network applications," but you never said anything about that in your post.
(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.
This is, again, between whether you want to be "computer science correct," or use developer lingo that is not even necessarily consistent among all developer and OS subcultures. I think saying it the way you are saying it, instead of "You could use multiple processes to get parallelism, instead of multiple threads," is misleading.
What does your code scale to?
Well, I'm a real-time guy, so my thread scales down to microsecond overheads.
Did you change the text in your post that explains pthread_mutex_t? If not, I probably made a mistake to pick on that, because what you have there is pretty clear.
After reading this I am wondering if there have been any measurements taken of the time 'wasted' in the kernel. You make it sound like there's a lot less wasted than is claimed by the author.
I'm not convinced that the article has it exactly wrong. There are obvious problems, like all the extra moving parts your code has. But what about runtimes that multiplex their threads (or what have you) on to one host thread + scheduler per core? It seems to fit the description, and it has been taken up by Go and Rust and probably others (what is Erlang's multicore story these days?) It also avoids the obvious problem, since the extra moving parts are in the runtime, not your application (much like they were in the kernel, not your application in the standard multi-threaded case.)
I still don't know if it saves you much. IIUC, the choice to multiplex is motivated by other reasons.
If you want to program using a ton of cheap threads (like in Go or Erlang or something), you pretty much have to do it in the runtime (i.e., use user-level threads), because you will waste a ton of time in the kernel on initialization and context switching, and you will waste a lot of memory.
In general, if you're using any other programming paradigm (i.e., 90+% of all software), you don't need tons of cheap threads; your application probably only needs at most one per core. In that case, you aren't constantly context switching (except to legitimately mutlitask with other programs), and you aren't using up that much kernel memory to hold thread state (because you only have a few threads, not a ton). So, you should just use the kernel as it was intended.
My assumption in reading the article is that the author was very much talking about the 90+% case (he didn't really say what he was talking about in the article).
There actually is another case though, which I think is what the author is really getting at. What if you have a small number of threads that need to be extremely high performance, and they have extremely short critical sections (which is not necessarily the common case across most applications)? Then, you would not want your threads to constantly suspend in the kernel every time there is a tiny bit of contention. You'd rather have them spin for a few cycles and just wait until they can actually get the lock... or just use hardware features (like the compare and exchange instructions) to abstract away the issue.
To do either of those, yeah, you pretty much need to code it yourself or (better) get a third party library. AFAIK. You would think pthreads would have a spinlock option that can never suspend (unlike futex, which can suspend), but if it does, I'm not aware of it.
In fact, formally speaking, there are lock-free algorithms, which don't have a lock, but can have unbounded retried before they get to access the data. And then you have wait-free algorithms, which can guarantee you get access to the data within a bounded number of retries.
Yeah, I get the different thread usage scenarios, but I'm still wondering about measurements. I'm less concerned about getting enough performance than I am about wasted performance, which is why I am thinking about measurements. If it's 10% that's something, if it's 0.0000000001% then who cares?
I also wonder about that 90+% case. Is it really 1:1 threads:cores at most? In my mind and experience, the architecture of the program (and of course the nature of the problem) should have a greater impact on the number of threads than leaving it at "are we using Go/Rust/Erlang vs. C/C++/Java." (Most recently I had several hundred coroutines on a single thread because the problem demanded it. In my world, the contended resource is never the CPU.) I'm wondering whether how much of the 90+% is really 1:1 and of that, how much really ought to be. Few problems are "as concurrent as the number of processors in the system" in the abstract. This is more of a question of nice design than anything else, but that's worth something (and the same reason we shouldn't needlessly include a scheduler in our programs.)
Another point is that most of these high-concurrency languages aim to satisfy that same 90+% in addition. In doing so, I wonder how much trouble they save (if any, given the runtime overhead, though within the runtime the scheduler itself had better be better than defaulting to the OS scheduler.)
But yes, I absolutely agree that you shouldn't be writing a scheduler in your application just to eek out a little bit of CPU overhead.
Hm... (c) needs a little qualification. Certainly in a multithreaded environment it's possible to end up in a cache contention situation accidentally, simply because you or the compiler happened to put two "write often from a single thread" values in the same cache line. Think of things like global/singleton objects which doesn't really need to be shared but are. A multiprocess design that limited shared memory to stuff that's truly shared won't have that pitfall.
No, that's not right.
Cache conflicts happen between processes in the same way, and for the same reasons, and just as often, as they do between threads.
No, you misunderstand. A global Java object (or whatever) in one process will be shared across all threads. If it has fields that get written "often" they are likely to end up thrashing the L1/L2 caches with snoop ejections. Often those writes are essentially spurious (global statistics, "current" thingy pointers) and don't really need to be shared across all threads. This really happens, and it's very easy to do.
An essentially identical architecture that defined the same object within a single address space is, quite clearly, not going to be subject to that bug.
An essentially identical architecture that defined the same object within a single address space
Well, then it would be the same architecture, because the one you described (one process with multiple threads) is also a single address space.
Still, I think you're wrong. The cache system doesn't have any awareness of threads or processes. So, you get spurious evictions all the time from completely unrelated programs, no matter what you do. Or from unrelated data in the same program that just happens to map to the same cache line.
If you still disagree with me and you can give me a specific example, I'd really appreciate it, because if I am misunderstanding something, it's a big deal, and I want to double check myself.
The cache system isn't aware of threads or processes, but it's clearly aware of address spaces. Make a big global Foo object in a single process and spawn a hundred threads to write to it. They'll be contending on the same cache line. You're with me that far, right?
Now take the same architecture and spawn a hundred processes. Those Foo objects now live in different physical pages and thus writes to them from all the processes live happily in L1 cache.
Obviously not all architectures work like this. If Foo is "really" shared, then nothing can help the contention. But usually it's not, it's just that the code was written by someone who didn't think about cache contention. That kind of performance bug is really easy to write. And a reasonable fix for it is "don't use multithreaded architectures when you don't need them".
The cache system isn't aware of threads or processes, but it's clearly aware of address spaces.
No, it's definitely not aware of address spaces. Caching is based on the physical address.
In fact, it's based on the low-order bits of the physical address. So, things that are in different physical pages but have the same lower-order bits can conflict.
"Multi-tasking was the problem of making a single core run multiple tasks. The core would switch quickly from one task to the next to make it appear they were all running at the same time, even though during any particular microsecond only one task was running at a time. We now call this “multi-threading”, where “threads” are lighter weight tasks."
I must have missed something.
Multi-tasking is multiple processes, which mostly have nothing to do with each other, switched in and out of the processor(s) by the OS, which do not share in-process memory or context. The programmer does nothing to make this happen, and normally has little to no say in it.
Multi-threading is a single process, where the threads carefully share context and memory, and they're all working roughly on the same thing; the programmer makes this happen explicitly, and usually fucks it up.
"Multi-threaded software goes up to about four cores, but past that point, it fails to get any benefit from additional cores."
Is there any basis for this affirmation or just the fact that his system has only 4 cores?
My desktop is 6 hyperthreaded cores (12 "cores" total).
It's a general principle of why mobile phone CPUs aren't putting a lot more cores in their devices, and why they didn't go the Atom route of hyperhreading: code on cellphones fail to take advantage of the additional cores.
This article speaks in broad generalities that seemed to be based on the authors particular interests.
There is no reason why multi-threaded software can't scale near linearly for the right problems.
My point was to talk about the wrong problems.
Networking is actually a "right" problem, and should be nearly embarrassingly parallel since two cores and process two unrelated packets at the same time. But, if you look at network stacks on open-source projects, you see a lot of fail. That's the point of my post, as a reference to point to why that spinlock in your networking code is a bad idea, and why it's probably better to replace it with an atomic operation or a lock-free alternative.
> There is no reason why multi-threaded software can't scale near linearly for the right problems.
For the other problems, you're limited by Amdahl.
There are two fundamental ways of doing this: pipelining and worker-threads. In the pipeline model, each thread does a different task, then hands off the task to the next thread in the pipeline.
Why not just implement the pipeline entirely in one thread, and then replicate them (just like worker threads)?
What will happen is that the first worker thread will be executing stage 2, while the second worker thread is executing stage 1. The OS will automatically schedule them on different cores.
Am I missing something?
When there is high-contention for a resource, it's better that one thread do it and access it contention-free, rather than make multiple threads content for it.
Even so-called "lock-free" synchronization has locks, they are just very short (30 clock cycles). Therefore, you still want to avoid contention if you can figure out a way to do it.
I didn't really go into enough detail in my example, but pulling packets off the network is a good example. You can have one thread do it, and therefore need no contention. Then you can setup multiple single-producer/single-consumer ring-buffers to forward those packets to worker threads to complete the processing of the packet. Thus, you essentially get rid of all the atomic/lock-free contention you would otherwise have.
* When there is high-contention for a resource, it's better that one thread do it and access it contention-free, rather than make multiple threads content for it.*
Right, and that's what I'm suggesting. So, if you have a pipelined architecture, keep the pipeline inside worker threads, instead of across them, except when you need to distribute work to the workers (i.e., the first stage, where you do something like take packets from the network). I think we agree on all that. I was just curious if there was ever a reason to do it the other way, i.e., having a separate thread for each stage of the entire pipeline. It seemed like you were suggesting that was useful in some cases, but perhaps I'm reading into things too much.
If a certain stage has global state a pipeline architecture may result in lower contention because that state is only accessed by one core and thus doesn't have to be locked. Lock-free producer-consumer communication between stages can be efficient. (AFAIK LineRate just took pipelining to the bank.) If your app has no global state a run-to-completion aka worker architecture is more efficient.
I feel like this is "common knowledge," so maybe I'm just missing something, but I'm not convinced yet.
Basically, what we're saying is that, first, there is a stage that cannot be accessed concurrently, and second, multiple threads may be waiting for that stage to complete before they progress.
So the downstream, waiting threads are either going to have to wait to acquire the lock, or they're going to have to wait for the producer to produce the data.
Either way, they're waiting.
If the critical section is really short and highly contended, you may not want threads waiting on the lock to suspend; you want them to spin, or to keep retrying (which is what happens in lock-free algorithms). OK, fine. Why isn't that solution just as good as having an extra thread to be the producer, and then waiting on the producer to produce more data?
I recall similar results on nearly all applications since my late MSc study: Mutexes are bad, pipes and sockets give better scaling. Thread sync primitives just sometimes scale up to 8-12 cores, but indeed multiprocess applications usually get much faster. In the age of GCed VMs one needs to also consider sync cost of GC.
Maybe I'm missing something, but I'm not getting the point. It seems to me there's no fundamental difference between multiprocess with shared-memory regions for anything that needs to be shared and multithreaded with mostly thread-local storage plus some shared data. The kernel is going to multiplex your single-threaded processes among the available cores just the same as it will multiplex your multiple threads among the available cores.
Multi-threaded techniques, like mutexes, don’t scale on multi-core. Conversely, as Snort demonstrates, you can split a problem across multiple processes instead of threads, and still have multi-core code.
Synchronization is synchronization. There are inter-process synchronization primitives, including mutexes. And you can use lock-free synchronization in a single-process multi-threaded scenario.