32-bit Integer Multiplication on Tenstorrent

9 min read Original article ↗

20th November, 2025

Introduction

Below are optimised implementations of 32-bit integer multiplication for Tenstorrent’s Blackhole and Wormhole AI accelerators, achieving throughputs of:

Blackhole

Unlike its Wormhole predecessor, Blackhole has a dedicated instruction for 23-bit integer multiplication: SFPMUL24, returning the low or high 23 bits of the product of two 23-bit integers.

Sequential Code

Given two 32-bit integer values a and b, we split each value into two 23-bit chunks, such that a = (a1 << 23) | a0 and b = (b1 << 23) | b0:

a0 = a & 0x7fffff
b0 = b & 0x7fffff
a1 = a >> 23
b1 = b >> 23

result =
  ((a1 * b1) << 46) # discard as low 32 bits are zero
+ ((a1 * b0) << 23) # low 23 bits are zero; only need low 9 bits of product
+ ((a0 * b1) << 23) # low 23 bits are zero; only need low 9 bits of product
+  (a0 * b0)        # need all 32 bits of product

Rewrite to use mul24_lo and mul24_hi; note that we no longer need a0 and b0, since mul24_lo and mul24_hi implicitly mask inputs to 23 bits:

result =
  (mul24_lo(a1,  b) << 23)
+ (mul24_lo( a, b1) << 23)
+ (mul24_hi( a,  b) << 23) + mul24_lo(a, b)

Collect terms and do a single shift:

result =
  (mul24_lo(a1, b) + mul24_lo(a, b1) + mul24_hi(a, b)) << 23
+  mul24_lo(a, b)

This is equivalent to the following sequence of 13 instructions:

sfpload L0, INT32, ADDR_MOD_7, offset0  ; a = load(offset0)
sfpshft -23, L0, L2, 1|4                ; a1 = a >> 23

sfpload L1, INT32, ADDR_MOD_7, offset1  ; b = load(offset1)
sfpshft -23, L1, L3, 1|4                ; b1 = b >> 23

sfpmul24 L1, L2, L9, L2, 0              ; cross0 = mul24_lo(b, a1)
sfpmul24 L0, L3, L9, L3, 0              ; cross1 = mul24_lo(a, b1)
sfpmul24 L0, L1, L9, L4, 1              ; hi = mul24_hi(a, b)
sfpmul24 L0, L1, L9, L0, 0              ; lo = mul24_lo(a, b)

sfpiadd  0, L2, L4, 4                   ; hi += cross0
sfpiadd  0, L3, L4, 4                   ; hi += cross1
sfpshft 23, L4, L4, 1                   ; hi = hi << 23
sfpiadd  0, L4, L0, 4                   ; lo += hi

sfpstore L0, INT32, ADDR_MOD_6, offset2 ; store(lo, offset2) with autoincrement

Parallel Execution via SFPLOADMACRO

13 cycles is pretty good, but can we go faster? As always, SFPLOADMACRO is our friend. The following uses four unique macros to schedule four reusable instruction templates, achieving a throughput of 8 cycles per row.

SFPLOADMACRO restricts us to a maximum of four unique macros, all sharing up to four instruction templates (one per sub-unit).

Important: Each macro can override either VB or VC for its use of a particular template, and set it VD from that macro’s particular instantiation. All other fields are fixed (except for VD in the case of SFPSTORE, which can be overridden with VD=16 or VD=0).

We rewrite our code to maximise reuse of instruction templates:

# Variable-to-register mappings:
#
# a,b = 0
# a1  = 1
# b1  = 2
# b2  = 3
# c   = 4
#
# Instruction templates ([x] is replaced by SFPLOADMACRO's VD):
#
# 0: [x] = [x] >> 23 via SFPSHFT2(Imm12=-23, MOD1=SFPSHFT2_MOD1_SHFT_IMM)
# 1: [x] = mul24_lo(a or b, [x]) via SFPMUL24(VA=0, VC=9)
# 2: [x] = b1 << 23 via SFPSHFT(Imm12=23, VC=2)
# 3: [x] = [x] + c via SFPIADD(VC=3)

b  = load(offset1)      # L0 = b
a1 = loadmacro(offset0) # L1 = a
b1 = loadmacro(offset1) # L2 = b

a1 = a1 >> 23           # template 0; L1 = L1 >> 23
b1 = b1 >> 23           # template 0; L2 = L2 >> 23
a1 = mul24_lo(b, a1)    # template 1; L1 = mul24_lo(L0, L1)

a  = load(offset0)      # L0 = a
b2 = loadmacro(offset1) # L3 = b
b1 = mul24_lo(a, b1)    # template 1; L2 = mul24_lo(L0, L2)
c  = mul24_hi(a, b2)    #             L4 = mul24_hi(L0, L3)
b2 = mul24_lo(a, b2)    # template 1; L3 = mul24_lo(L0, L3)

b1 = b1 + a1            #             L2 = L2 + L1
b1 = b1 + c             # template 3; L2 = L2 + L4
c  = loadmacro(offset2) #             L4 = ?
c  = b1 << 23           # template 2; L4 = L2 << 23
L16 = b2 + c            # template 3; L16 = L3 + L4
store(L16, offset2)     # store L16 to offset2

In the scheduling diagram below, colours denote variables, and instruction slots are coloured by their destination variable, except in the case of the special destination register L16, where we colour the instruction by its original SFPLOADMACRO destination variable.

Note that Blackhole has automatic stalling for SFPMUL24 when issued normally (not via SFPLOADMACRO). Our mul24_hi and the following instruction are issued as regular instructions, which means care needs to be taken to avoid triggering the automatic stalling logic.

Wormhole

Since Wormhole doesn’t have a dedicated instruction for multiplying integers, we have to SFPCAST to FP32 and use SFPMAD instead. FP32 can represent 24-bit integers exactly, so we have to split the input values into multiple chunks such that their products can be represented exactly in FP32.

8-bit Chunks

One option is to split each 32-bit value into four 8-bit chunks. This would require a total of 10 multiplications, 9 additions, and 3 shifts (not including the operations required to extract the initial 8-bit chunks):

# Given:
#   a = (a3 << 24) | (a2 << 16) | (a1 << 8) | a0
#   b = (b3 << 24) | (b2 << 16) | (b1 << 8) | b0

a*b =
  (a3*b3 << 48)
+ (a3*b2 + a2*b3) << 40
+ (a3*b1 + a2*b2 + a1*b3) << 32
+ (a3*b0 + a2*b1 + a1*b2 + a0*b3) << 24
+ (a2*b0 + a1*b1 + a0*b2) << 16
+ (a1*b0 + a0*b1) << 8
+ (a0*b0)

# We only need the products whose low 32 bits are non-zero:

a*b & 0xffffffff =
  (a3*b0 + a2*b1 + a1*b2 + a0*b3) << 24
+ (a2*b0 + a1*b1 + a0*b2)         << 16
+ (a1*b0 + a0*b1)                 << 8
+ (a0*b0)

11-bit Chunks

An alternative is to use three 11-bit chunks per 32-bit value. This has the attractive property of allowing SFPMAD to be used to sum multiple products without loss of precision, as long as the sum can be represented with 24 bits. Now, only 6 multiplications are needed, 2 additions (since we get 3 additions via SFPMAD for free), and 2 shifts:

# Given:
#   a = (a2 << 22) | (a1 << 11) | a0
#   b = (b2 << 22) | (b1 << 11) | b0

a * b & 0xffffffff =
  (a0*b2 + a1*b1 + a2*b0) << 22 # top = max 23 bits
+ (a0*b1 + a1*b0)         << 11 # mid = max 23 bits
+ (a0*b0)                       # low = max 22 bits

The catch is that there is no dedicated instruction for converting FP32 values larger than 16 bits to an integer. At first glance, multiple instructions are required, e.g.:

sfpexexp 0, L5, L0, 2|8     ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L5, L5, 0       ; extract mantissa with implicit bit set (even for zero!)
sfpiadd  22-23, L0, L0, 1|4 ; calculate shift amount (shift right 23, then left 22)
sfpshft2 L5, L0, L5, 5      ; logical shift of mantissa by amount
sfpencc  0, 0, 0, 0         ; re-enable lanes

Adam P. Goucher suggested this marvellous trick for implementing fp32_to_int23 with a single instruction: if our value fits into 23 bits, we can add 2^{23} (in fp32), which shifts the mantissa so that it is precisely the 23-bit integer that we want! All that remains is to extract the mantissa bits, for which there is a dedicated SFPEXMAN instruction.

Happily, all of our intermediate sums fit into 23 bits.

The full code is 31 cycles:

; assumes constant registers L12 = 0x7ff; L13 = -11; L14 = 0x4b000000 (2.0f**23)

sfpload  L0, INT32, ADDR_MOD_3, offset0 ; a0 = load(offset0)

sfpshft2 L0, L13, L2, 5                 ; a1 = a0 >> 11
sfpshft2 L2, L13, L4, 5                 ; a2 = a1 >> 11

sfpand   0, L12, L2, 0                  ; a1 &= mask
sfpcast  L2, L2, 0                      ; a1 = int_to_fp32(a1)

sfpcast  L4, L4, 0                      ; a2 = int_to_fp32(a2)

sfpand   0, L12, L0, 0                  ; a0 &= mask
sfpcast  L0, L0, 0                      ; a0 = int_to_fp32(a0)

sfpload  L1, INT32, ADDR_MOD_3, offset1 ; b0 = load(offset1)

sfpshft2 L1, L13, L3, 5                 ; b1 = b0 >> 11
sfpshft2 L3, L13, L5, 5                 ; b2 = b1 >> 11

sfpcast  L5, L5, 0                      ; b2 = int_to_fp32(b2)

sfpmad   L0, L5, L14, L5, 0             ; top = a0*b2 + 2**23

sfpand   0, L12, L3, 0                  ; b1 &= mask
sfpcast  L3, L3, 0                      ; b1 = int_to_fp32(b1)

sfpmad   L2, L3, L5, L5, 0              ; top += a1*b1

sfpand   0, L12, L1, 0                  ; b0 &= mask
sfpcast  L1, L1, 0                      ; b0 = int_to_fp32(b0)

sfpmad   L4, L1, L5, L5, 0              ; top += a2*b0
sfpmad   L0, L3, L14, L6, 0             ; mid  = a0*b1 + 2**23
sfpmad   L0, L1, L14, L0, 0             ; low  = a0*b0 + 2**23
sfpmad   L2, L1, L6, L6, 0              ; mid += a1*b0

sfpexman 0, L0, L0, 1                   ; low &= 0x7fffff
sfpexman 0, L6, L6, 1                   ; mid &= 0x7fffff
sfpexman 0, L5, L5, 1                   ; top &= 0x7fffff

sfpshft 11, 0, L6, 1                    ; mid <<= 11
sfpshft 22, 0, L5, 1                    ; top <<= 22

sfpiadd  0, L6, L0, 4                   ; low += mid
sfpiadd  0, L5, L0, 4                   ; low += top

sfpstore L0, INT32, ADDR_MOD_2, offset2 ; store(low, offset2) with autoincrement

Conclusion

Leveraging SFPLOADMACRO, we achieve 8/32c throughput for 32-bit integer multiplication on Blackhole. On Wormhole, using SFPMAD with 11-bit chunks achieves 31/32c throughput.

Acknowledgements