20th November, 2025
Introduction
Below are optimised implementations of 32-bit integer multiplication for Tenstorrent’s Blackhole and Wormhole AI accelerators, achieving throughputs of:
- Blackhole: 8 cycles per row of 32 values.
- Wormhole: 31 cycles per row of 32 values.
Blackhole
Unlike its Wormhole predecessor, Blackhole has a dedicated
instruction for 23-bit integer multiplication: SFPMUL24,
returning the low or high 23 bits of the product of two 23-bit
integers.
Sequential Code
Given two 32-bit integer values a and b, we
split each value into two 23-bit chunks, such that
a = (a1 << 23) | a0 and
b = (b1 << 23) | b0:
a0 = a & 0x7fffff
b0 = b & 0x7fffff
a1 = a >> 23
b1 = b >> 23
result =
((a1 * b1) << 46) # discard as low 32 bits are zero
+ ((a1 * b0) << 23) # low 23 bits are zero; only need low 9 bits of product
+ ((a0 * b1) << 23) # low 23 bits are zero; only need low 9 bits of product
+ (a0 * b0) # need all 32 bits of productRewrite to use mul24_lo and mul24_hi; note
that we no longer need a0 and b0, since
mul24_lo and mul24_hi implicitly mask inputs
to 23 bits:
result =
(mul24_lo(a1, b) << 23)
+ (mul24_lo( a, b1) << 23)
+ (mul24_hi( a, b) << 23) + mul24_lo(a, b)Collect terms and do a single shift:
result =
(mul24_lo(a1, b) + mul24_lo(a, b1) + mul24_hi(a, b)) << 23
+ mul24_lo(a, b)This is equivalent to the following sequence of 13 instructions:
sfpload L0, INT32, ADDR_MOD_7, offset0 ; a = load(offset0)
sfpshft -23, L0, L2, 1|4 ; a1 = a >> 23
sfpload L1, INT32, ADDR_MOD_7, offset1 ; b = load(offset1)
sfpshft -23, L1, L3, 1|4 ; b1 = b >> 23
sfpmul24 L1, L2, L9, L2, 0 ; cross0 = mul24_lo(b, a1)
sfpmul24 L0, L3, L9, L3, 0 ; cross1 = mul24_lo(a, b1)
sfpmul24 L0, L1, L9, L4, 1 ; hi = mul24_hi(a, b)
sfpmul24 L0, L1, L9, L0, 0 ; lo = mul24_lo(a, b)
sfpiadd 0, L2, L4, 4 ; hi += cross0
sfpiadd 0, L3, L4, 4 ; hi += cross1
sfpshft 23, L4, L4, 1 ; hi = hi << 23
sfpiadd 0, L4, L0, 4 ; lo += hi
sfpstore L0, INT32, ADDR_MOD_6, offset2 ; store(lo, offset2) with autoincrementParallel Execution via SFPLOADMACRO
13 cycles is pretty good, but can we go faster? As always, SFPLOADMACRO
is our friend. The following uses four unique macros to schedule four
reusable instruction templates, achieving a throughput of 8 cycles per
row.
SFPLOADMACRO restricts us to a maximum of four unique
macros, all sharing up to four instruction templates (one per sub-unit).
Important: Each macro can override either
VB or VC for its use of a particular template,
and set it VD from that macro’s particular instantiation.
All other fields are fixed (except for VD
in the case of SFPSTORE, which can be overridden with
VD=16 or VD=0).
We rewrite our code to maximise reuse of instruction templates:
# Variable-to-register mappings:
#
# a,b = 0
# a1 = 1
# b1 = 2
# b2 = 3
# c = 4
#
# Instruction templates ([x] is replaced by SFPLOADMACRO's VD):
#
# 0: [x] = [x] >> 23 via SFPSHFT2(Imm12=-23, MOD1=SFPSHFT2_MOD1_SHFT_IMM)
# 1: [x] = mul24_lo(a or b, [x]) via SFPMUL24(VA=0, VC=9)
# 2: [x] = b1 << 23 via SFPSHFT(Imm12=23, VC=2)
# 3: [x] = [x] + c via SFPIADD(VC=3)
b = load(offset1) # L0 = b
a1 = loadmacro(offset0) # L1 = a
b1 = loadmacro(offset1) # L2 = b
a1 = a1 >> 23 # template 0; L1 = L1 >> 23
b1 = b1 >> 23 # template 0; L2 = L2 >> 23
a1 = mul24_lo(b, a1) # template 1; L1 = mul24_lo(L0, L1)
a = load(offset0) # L0 = a
b2 = loadmacro(offset1) # L3 = b
b1 = mul24_lo(a, b1) # template 1; L2 = mul24_lo(L0, L2)
c = mul24_hi(a, b2) # L4 = mul24_hi(L0, L3)
b2 = mul24_lo(a, b2) # template 1; L3 = mul24_lo(L0, L3)
b1 = b1 + a1 # L2 = L2 + L1
b1 = b1 + c # template 3; L2 = L2 + L4
c = loadmacro(offset2) # L4 = ?
c = b1 << 23 # template 2; L4 = L2 << 23
L16 = b2 + c # template 3; L16 = L3 + L4
store(L16, offset2) # store L16 to offset2In the scheduling diagram below, colours denote variables, and
instruction slots are coloured by their destination variable, except in
the case of the special destination register L16, where we
colour the instruction by its original SFPLOADMACRO
destination variable.
Note that Blackhole has automatic stalling for SFPMUL24
when issued normally (not via SFPLOADMACRO). Our
mul24_hi and the following instruction are issued as
regular instructions, which means care needs to be taken to avoid
triggering the automatic
stalling logic.
Wormhole
Since Wormhole doesn’t have a dedicated instruction for multiplying
integers, we have to SFPCAST
to FP32 and use SFPMAD
instead. FP32 can represent 24-bit integers exactly, so we have to split
the input values into multiple chunks such that their products can be
represented exactly in FP32.
8-bit Chunks
One option is to split each 32-bit value into four 8-bit chunks. This would require a total of 10 multiplications, 9 additions, and 3 shifts (not including the operations required to extract the initial 8-bit chunks):
# Given:
# a = (a3 << 24) | (a2 << 16) | (a1 << 8) | a0
# b = (b3 << 24) | (b2 << 16) | (b1 << 8) | b0
a*b =
(a3*b3 << 48)
+ (a3*b2 + a2*b3) << 40
+ (a3*b1 + a2*b2 + a1*b3) << 32
+ (a3*b0 + a2*b1 + a1*b2 + a0*b3) << 24
+ (a2*b0 + a1*b1 + a0*b2) << 16
+ (a1*b0 + a0*b1) << 8
+ (a0*b0)
# We only need the products whose low 32 bits are non-zero:
a*b & 0xffffffff =
(a3*b0 + a2*b1 + a1*b2 + a0*b3) << 24
+ (a2*b0 + a1*b1 + a0*b2) << 16
+ (a1*b0 + a0*b1) << 8
+ (a0*b0)11-bit Chunks
An alternative is to use three 11-bit chunks per
32-bit value. This has the attractive property of allowing SFPMAD
to be used to sum multiple products without loss of precision, as long
as the sum can be represented with 24 bits. Now, only 6 multiplications
are needed, 2 additions (since we get 3 additions via
SFPMAD for free), and 2 shifts:
# Given:
# a = (a2 << 22) | (a1 << 11) | a0
# b = (b2 << 22) | (b1 << 11) | b0
a * b & 0xffffffff =
(a0*b2 + a1*b1 + a2*b0) << 22 # top = max 23 bits
+ (a0*b1 + a1*b0) << 11 # mid = max 23 bits
+ (a0*b0) # low = max 22 bitsThe catch is that there is no dedicated instruction for converting FP32 values larger than 16 bits to an integer. At first glance, multiple instructions are required, e.g.:
sfpexexp 0, L5, L0, 2|8 ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L5, L5, 0 ; extract mantissa with implicit bit set (even for zero!)
sfpiadd 22-23, L0, L0, 1|4 ; calculate shift amount (shift right 23, then left 22)
sfpshft2 L5, L0, L5, 5 ; logical shift of mantissa by amount
sfpencc 0, 0, 0, 0 ; re-enable lanesAdam P. Goucher suggested this marvellous trick for implementing
fp32_to_int23 with a single instruction: if our value fits
into 23 bits, we can add 2^{23} (in fp32), which shifts the mantissa
so that it is precisely the 23-bit integer that we want! All that
remains is to extract the mantissa bits, for which there is a dedicated
SFPEXMAN
instruction.
Happily, all of our intermediate sums fit into 23 bits.
The full code is 31 cycles:
; assumes constant registers L12 = 0x7ff; L13 = -11; L14 = 0x4b000000 (2.0f**23)
sfpload L0, INT32, ADDR_MOD_3, offset0 ; a0 = load(offset0)
sfpshft2 L0, L13, L2, 5 ; a1 = a0 >> 11
sfpshft2 L2, L13, L4, 5 ; a2 = a1 >> 11
sfpand 0, L12, L2, 0 ; a1 &= mask
sfpcast L2, L2, 0 ; a1 = int_to_fp32(a1)
sfpcast L4, L4, 0 ; a2 = int_to_fp32(a2)
sfpand 0, L12, L0, 0 ; a0 &= mask
sfpcast L0, L0, 0 ; a0 = int_to_fp32(a0)
sfpload L1, INT32, ADDR_MOD_3, offset1 ; b0 = load(offset1)
sfpshft2 L1, L13, L3, 5 ; b1 = b0 >> 11
sfpshft2 L3, L13, L5, 5 ; b2 = b1 >> 11
sfpcast L5, L5, 0 ; b2 = int_to_fp32(b2)
sfpmad L0, L5, L14, L5, 0 ; top = a0*b2 + 2**23
sfpand 0, L12, L3, 0 ; b1 &= mask
sfpcast L3, L3, 0 ; b1 = int_to_fp32(b1)
sfpmad L2, L3, L5, L5, 0 ; top += a1*b1
sfpand 0, L12, L1, 0 ; b0 &= mask
sfpcast L1, L1, 0 ; b0 = int_to_fp32(b0)
sfpmad L4, L1, L5, L5, 0 ; top += a2*b0
sfpmad L0, L3, L14, L6, 0 ; mid = a0*b1 + 2**23
sfpmad L0, L1, L14, L0, 0 ; low = a0*b0 + 2**23
sfpmad L2, L1, L6, L6, 0 ; mid += a1*b0
sfpexman 0, L0, L0, 1 ; low &= 0x7fffff
sfpexman 0, L6, L6, 1 ; mid &= 0x7fffff
sfpexman 0, L5, L5, 1 ; top &= 0x7fffff
sfpshft 11, 0, L6, 1 ; mid <<= 11
sfpshft 22, 0, L5, 1 ; top <<= 22
sfpiadd 0, L6, L0, 4 ; low += mid
sfpiadd 0, L5, L0, 4 ; low += top
sfpstore L0, INT32, ADDR_MOD_2, offset2 ; store(low, offset2) with autoincrementConclusion
Leveraging SFPLOADMACRO,
we achieve 8/32c throughput for 32-bit integer multiplication on
Blackhole. On Wormhole, using SFPMAD
with 11-bit chunks achieves 31/32c throughput.
Acknowledgements
- The amazing Adam P. Goucher for suggesting the 2^{23} trick.
- Tenstorrent for sponsoring this work.