12th November, 2025
Introduction
Tenstorrent’s AI accelerator chips consist of tiles arranged in a grid, connected with a network-on-chip for efficient dataflow processing. Each tile features a vector unit that executes a limited number of SIMD operations across 32 lanes.
where(condition, t, f, out) selects values from either
t or f, depending on the corresponding value
in condition, writing the result to out.
Below we implement where on the vector unit, achieving
optimal throughput of 3 cycles per row for the common in-place case and
4 cycles per row for the out-of-place case.
Sequential Code
Our kernel is given four offset parameters, representing
condition, t, f, and
out. Pseudocode for a relatively optimised sequential
solution on the 32-lane vector unit looks like this:
# parameters: offset0, offset1, offset2, offset3
condition = load(offset0)
result = load(offset1)
if condition == 0:
result = load(offset2)
store(result, offset3)This is equivalent to the following assembly code, achieving 6 cycles per row:
// Parameters: offset0, offset1, offset2, offset3
// ADDR_MOD_7: doesn't increment counters
// ADDR_MOD_6: autoincrements Dst counter
sfpload L0, 0, ADDR_MOD_7, offset0
sfpload L1, 0, ADDR_MOD_7, offset1
sfpsetcc 0, L0, L0, SFPSETCC_MOD1_LREG_EQ0
sfpload L1, 0, ADDR_MOD_7, offset2
sfpencc 0, L0, L0, 0
sfpstore L1, 0, ADDR_MOD_6, offset3Parallel Execution via SFPLOADMACRO
The vector unit has five sub-units:
load, simple, mad, round, and store. The only way to use more than one
sub-unit at a time is via SFPLOADMACRO,
which allows us to schedule up to one instruction per sub-unit to
execute during future cycles, subject to various constraints.
The following diagram shows the schedule for our sequential code, with register liveness on the right.
In-Place Output
The most common pattern is to call where(0, 1, 2, 0), so
that the result is written to condition.
This allows us to schedule the SFPSTORE
to write the result back to condition, while the next SFPLOAD
executes at the same time.
Note that one of the constraints of SFPLOADMACRO is that
a scheduled SFPSTORE has to write to the address its macro
loaded from, so this is only possible when the output is written to
condition.
We require two macros, followed by a regular SFPLOAD:
SFPLOADMACRO: load fromconditionto theL0register. Also, schedule two additional instructions:SFPLOADMACRO: load fromttoL0, and schedule:- After 1 cycle: re-enable all lanes via
SFPENCC.
- After 1 cycle: re-enable all lanes via
SFPLOAD: load fromf, overwriting the value inL0in lanes that are enabled. Also, auto-increment the address counters (which happens regardless of lane flags).
Finally, the SFPSTORE
scheduled in the first step will write the result in L0
back to memory. At this point, the next SFPLOADMACRO call
can be executed, simultaneously with the SFPSTORE.
The trick here is that it’s safe to schedule an instruction that reads from a register for the same time as an instruction that writes to that register: the read happens at the beginning of the cycle, and the write happens at the end of the cycle.
For example, SFPSETCC L0 L0 reads L0 while
SFPLOAD L0 1 writes to L0 during the same
cycle, as illustrated by the diagram below, with register liveness shown
on the right.
Note also that we require only one register, since the
condition value only needs to live for one cycle.
3 cycles is theoretically optimal for this case, since the operation requires loading from three different memory addresses.
Out-of-Place Output
If we are required to write the result to a distinct address,
e.g. where(0, 1, 2, 3), we can no longer use SFPLOADMACRO
to schedule the SFPSTORE,
as it can only write to the address it loaded from.
Instead, we add a regular SFPSTORE
instruction, achieving 4 cycles per input row:
Avoiding Stalls
A stall between instructions will disrupt the timing-sensitive sequence of disabling and enabling lanes, leading to incorrect results. This can occur if the issuing RISC-V thread suffers instruction cache starvation, e.g. when a large number of unrolled instructions are present.
A stall after the first instruction prevents
sfpload L0 1 from executing unconditionally, as it now
executes after sfpsetcc:
A stall after the second instruction executes
sfpload L0 2 unconditionally, as it now executes after
sfpencc:
This can be avoided by first loading the instructions into a replay
buffer, and using a single REPLAY
instruction to issue the 3 (or 4) instructions in sequence without
stalls. Alternatively, a MOP
expansion could be used. Both replay and MOP expander are contained
in the Tensix frontend, hence their generated instructions will always
be stall-free for a given REPLAY
or MOP
instruction.
Note that it’s possible to specify
DelayKind=WaitForElapsedInstructions, which only decrements
the delay counter every time a thread issues an instruction to the
vector unit, instead of every cycle. However, a scheduled instruction
executes on the cycle after the cycle where it counts down from
1 to 0; or, in the case of Delay=0, the scheduled
instruction will execute on the next cycle regardless, and so this
doesn’t help avoid the issue.
Conclusion
Leveraging SFPLOADMACRO,
we achieve a theoretically optimal 3-cycle throughput for in-place
where and a 4-cycle throughput for out-of-place
where, while ensuring stable execution by avoiding
instruction stalls.
Acknowledgements
Thanks to Tenstorrent for sponsoring this work.
Addendum
If the output is written to t, this permits a slightly
more optimised sequential version, as we only need to load
f and then make the store conditional:
condition = load(offset0)
f = load(offset2)
if condition == 0:
store(f, offset1)This achieves 5 cycles per row:
// Parameters: offset0, offset1, offset2, offset3==offset1
// ADDR_MOD_7: doesn't increment counters
// ADDR_MOD_6: autoincrements Dst counter
sfpload L0, 0, ADDR_MOD_7, offset0
sfpload L1, 0, ADDR_MOD_7, offset2
sfpsetcc 0, L0, L0, SFPSETCC_MOD1_LREG_EQ0
sfpstore L1, 0, ADDR_MOD_6, offset1
sfpencc 0, L0, L0, 0Unfortunately, this doesn’t translate into a 2-cycle
SFPLOADMACRO equivalent, since there are three distinct
addresses, and a macro-scheduled SFPSTORE can only write to
the same address as the SFPLOADMACRO load address.