Synchronous Processors (2016)
yodaiken.comThis article would make more sense if it had the result of a simulation of a workload showing how much time was lost to interrupt latency and how much processor time could be saved by a different technique.
The Transputer and its successors, the XMOS embedded SoCs, already implement a lot of the features mentioned in this blog post by Yodaiken.
…and the subject should have “(2016)” added…
Fixed.
And how would you do context switches if a CPU-bound task does not yield and you do not have interrupts to ... interrupt ... that?
The article says this about it:
> We could have a simple cycle timer switch on each core so that after the timer expires there an interrupt-like jump to a function to see what to do next. That jump would be perfectly synchronous since predicting the next jump can be done with 100% accuracy (or nearly 100%).
Cycle accurate? So now you can predict how long RAM latency is down to the cycle, refreshes and DMA be damned?
Interrupt like? So what will you do? save context of this thread, load another...hm...sure sounds like what we already do
In other words a timer interrupt - with saving of state and appropriate unwinding of pipeline state (abandoning half done or out of order instructions etc etc)
Also the "do system calls by queuing requests to another CPU" is kind of at odds with "we don't need cache coherency"
Not necessarily; it's an interesting idea. Thanks to the branch predictor the CPU already has a virtual view of the instruction stream. If we tolerate a bit of latency, all we have to do is inject a "jump to ISR" magic instruction in the predicted stream. Rather like self-modifying code, except without modifying the code in memory, just at the instruction fetch point. State still has to be saved but that can be done with PUSH instructions in the ISR.
> Also the "do system calls by queuing requests to another CPU" is kind of at odds with "we don't need cache coherency"
Can be done with mailboxes/FIFOs, but yes this requires a dedicated design. And of course the CPU that does the call is then idle I think?
Good explanation of my overly terse note. The current standard interrupt architecture imposes enormous latency and the synchronous timer could be more precise. The simplest implementation would just "fetch" jmp every N instructions (with N programmable) - just like voluntary switching but where the processor would volunteer the program.
You still need to save the PC and have a place to save it ....
One approach is to use a barrel processor which switches threads after each cycle or instruction: https://en.m.wikipedia.org/wiki/Barrel_processor
That does not solve the problem at all. It just increases the number of "hyper threads", if a new process gets started and all cores are busy that process might never run.
It solves the problem for environments where problems like interrupt latency and timing criticality usually show up - embedded and real-time systems. In many systems, the set of running tasks in a system is fixed - there are even some very simple real-time operating systems (such as some OSEK configurations in the automotive sector) which require to statically define the set of tasks at compile time. After all, you don't suddenly feel the urge to start a game of Doom on your car's ABS controller :) (though, of course, somebody will try to do this...).
The (early) XMOS chips, for example, run at 500 MHz with four threads or, if you needed more threads, you could also configure the system to run eight threads at half the speed IIRC. If you used e.g. three threads, some execution time remained unused in the four-thread mode, there was no arbitrary division of time by the number of threads.
For real-time critical systems, you could then still run up to seven critical threads at guaranteed speed and reserve the remaining one for non timing-critical tasks (which you could then to schedule using cooperative multitasking).
The RAM was a fast on-chip SRAM, so there were no problems with refresh, access latencies etc. that you have with DRAM. However, you were constrained to 64 kB RAM per core (probably not enough to run Doom...).
The XMOS development toolchain even includes a real-time analyzer for the C/C++ code you throw at it. Unfortunately, most of the XMOS toolchain is closed source.
thread state is just a bit of memory.
What if you have multiple CPUs and you don't want the task to yield (i.e. for performance or latency reasons)?
That limits your OS to numThreads < numCpus
What if we have multiple classes of threads with varying scheduling and interruption policies?
Either you have interrupts to interrupt them, you have more cores than threads, or you risk starvation. Simple.
No opportunity for nuance wherein some of the compute resources are reserved for certain types of threads?