Parallelisation by Recurrence

13 min read Original article ↗

tl;dr;

My auto-parallelizer PoC turns a sequential loop into a parallel one with almost no extra code.

We parallelize a loop by mechanically composing its recurrence f into f⁽ᴾ⁾ without an algebraic closed form. It works with linear and non-linear recurrences (although it’s not always profitable to do so).

We do that by first generating an SSA-style IR (annotated JVM bytecode in the prototype) and placing the operations in a dependency/update graph (much like automatic differentiation systems).

In the end we get a lock-free, deterministic parallel version that is trivial to generate.

We only enable the transform when it’s profitable and because of the simplicity of the code generated it’s easy to decide. This way expensive cases (e.g. SHA-256) stay serial.

the key idea

Many count-controlled loops carry a recurrence in the for loop-value. Usually this value is easy to compute and contains a recurrence (e.g. i = i +1).

My idea is to use this recurrence to parallelize by expressing the loop step into a function of “number of threads” (e.g. i = i + 4 for 4 threads). The goal is to have each thread own a disjoint set of iteration indices. Note that uniqueness of values is not required so long as the loop body obeys A1/A2 (see below).

For all loops, there is a deterministic recurrence in the step “i” (there are some exceptions but I don’t treat them as it’s the general case). Mathematically, it’s i_{n+1} = f(i_n) and because of this recurrence we can ALWAYS express i_{n+k} = f^k(i_n). This means parallelization by recurrence is theoretically possible for every such loop – it’s not a special case but a fundamental property of loop structure. The only question is whether computing f^k is efficient enough to make parallelization worthwhile.

For instance,

#include <stdio.h>

unsigned add5(unsigned n) {
    unsigned s = 1u;
    for (unsigned i = 0; i < n; ++i) {
        s =5+s;
        printf("i %u ; s %u\n",i, s);
    }
    return s;
}

int main() {
    unsigned n = 10;
    unsigned result = add5(n);
    
    printf("add5 result after %u iterations: %u\n", n, result);
    
    return 0;
}

becomes for two threads:

#include <stdio.h>

unsigned thread(unsigned s0, unsigned start, unsigned n) {
    unsigned s = s0;
    for (unsigned i = start; i < n; i += 2) {
        s = 5 + s; 
    }
//NB Assume n is even for simplicity.

        return s; 
}

int main() {
    unsigned n = 10;
    unsigned S0 = thread(0, 0, n); // thread 0 (even i)
    unsigned S1 = thread(1, 1, n); // thread 1 (odd i)
    unsigned s =  S0 + S1;
    printf("add5 result after %u iterations: %u\n", n, s);
}

both return 51.

I’ll start with explaining all of the limitations, then I’ll explain my algorithms and provide examples. Then I’ll list what’s already out there(a lot) and what I believe is new here.

Caveat: I’m not a scientist, not an academic and compiler is not my day job. Which also explains me wanting a reality-check.

How often does this happen?

While this trick will work for any loop satisfying our assumptions (see below), it’s not always smart to run it. i.e. in these case we can always mathematically form fᵖ; whether we run the transform depends on the cost model (cheap fᵖ, enough iterations) and on proving purity/alias freedom (which is the core of what I’m doing in my PoC but it’s a bigger explanation).

This will work with almost all loops but for this “trick” to be profitable, we need:

  • large enough loops to offset the parallelization overhead. (We can add an if/else to launch the parallelized version when the code is big enough or rely on LLVM cost system).
  • computing the step functions must be asymptotically cheaper than executing the body. We can reuse a lot of the work from automatic differentiation (and adapt it of course because it’s for continuous vs. discrete).
  • all of the assumptions below must be true. They are pretty hard so until we’ve relaxed these assumptions, it’s purely a “thought experiment” exercise. Prototype enforces these now; relaxing them is future work (hence one of the examples relaxing them .)
  • Oddly the loop exit condition can be dynamic but I’m not explaining it here by lack of space (and I haven’t implemented it yet).

Why it’s still useful

This is still a useful because:

  • this is a lockless algorithm so speedup is guaranteed if the parallelization is profitable.
  • code-gen is so trivial (provided the dependency analysis is done) that this is always worth doing:

Assumptions

Because it’s just a proof of concept, I’m limiting the scope severely.

A1Purity
A2Scalar state only (for prototype only)
A3No early‑exit/Weird control-flow (for prototype only)
A4Associativity required for the final merge. Commutativity too but can be relaxed in some cases.
A5No overflow/type issues (for prototype only).

algorithm

  • Assume A1 to A5.
  • Replace the loop by a “thread launcher call” (it’s a template I copy/paste)
  • Extract the loop in a function and:
    • update the step by number. See below for the value
  • A few housekeeping tasks I don’t describe.

Updating step value

In the trivial loop, above i = i +1 to parallelize on two threads, we want to reexpress i_3 into a function of i_0 (and all i_(n+3) to i_n). It’s possible because of how loops are built.

  • i_3 = i_2 + 1 = … = i_0 + 3: we can jump from step 0 to step 3 without going through 2 and “we’ll jump 3 steps”
  • Using the same ideas, we can go to i_4 = i_1 + 3
  • Because the reduction operation is commutative (I can execute it any order), then I can compute i_1 later in the final sum.

The reason this works is because computing “2 steps ahead” is very fast. As soon as this is false, it’s a pointless optimization (e.g. Sha256).

To extract this from the code source, we build a dependency graph on top of a pseudo-SSA instructions.

Step-by-step example

We’ll be using two threads (to keep the example small).

Sequential


uint64_t pow2(uint64_t start ,unsigned N) {
    uint64_t s = start;
    for (uint64_t i = 2; i <= N; i = i * i) {
        s *= i;
    }
    return s;
}

int main() {
    unsigned N = 1000;
    printf("Running square loop with N = %u:\n", N);
    uint64_t result = pow2(2, N);
    printf("Final result: %lu\n", result);
    
    printf("Running square loop with N = %u:\n", N);
    result = pow2(1, N);
    printf("Final result: %lu\n", result);
    
    printf("Running square loop with N = %u:\n", N);
    result = pow2(3, N);
    printf("Final result: %lu\n", result);
    
    return 0;
}

Build the infrastructure

First we extract the loop body and add “thread launcher”. I use the Python notation _ to leave the value to be defined later

uint64_t thread(uint64_t red, uint64_t start, unsigned N) {
    uint64_t s = start;
    for (uint64_t i = _; i <= N; i = _) {
        s *= i;
    }
    return s;
}

int main() {
    unsigned N = 1000;
    printf("Running square loop with N = %u:\n", N);
    uint64_t t1 = thread(_, _, N);
    uint64_t t2 = thread(_, _, N);

    uint64_t result = t1 * t2;
    printf("Final result: %lu\n", result);
    
    printf("Running square loop with N = %u:\n", N);
    uint64_t t1 = thread(_, _, N);
    uint64_t t2 = thread(_, _, N);
    uint64_t result = t1 * t2 ;
    printf("Final result: %lu\n", result);
    
    printf("Running square loop with N = %u:\n", N);
    uint64_t t1 = thread(_, _, N);
    uint64_t t2 = thread(_, _, N);
    uint64_t result = t1 * t2 ;
    printf("Final result: %lu\n", result);
    
    return 0;
}

Define the update value

We replay the recurrence relationship:

  • We build the update chain of every “recurring variable”
    • We run a dependency analysis to detect which variables are used outside of the loop and depends on itself.
    • And which variable controls the exit condition of the loop.
    • Here it’s s which depends only on i and i which depends only on its previous version.
  • From the update chain, i_n = i_(n-1) * i_(n-1), I want to reexpress i_n to something we can use from different threads while still going through all the values: because we know we will go through them and we know we can compute them by term replacement.
    • i_(n+2) = i_(n +1) ^2 = i_(n ) ^2^2 = i_(n ) ^4
    • we compute the “intermediate” value but in most cases, it’s faster to compute it (e.g. +1) than to store it.
    • Also we can predict if it’s expensive because we have the update chain (and we can tie it to LLVM cost model)
uint64_t thread(uint64_t red, uint64_t start, unsigned N) {
    uint64_t s = start;
    for (uint64_t i = _; i <= N; i = i * i *i *i) {
        s *= i;
    }
    return s;
}

Setting the reduce/step values

Because we only handle the “easy case” of the reducer being associative and commutative, the reduce/iteration is: the “initial start value” and the neutral element of the operator. Which gives us:

int main() {
  unsigned N = 1000;
    printf("Running square loop with N = %u:\n", N);
    uint64_t t1 = thread(2, 2, N);
    uint64_t t2 = thread(1, 4, N);

    uint64_t result = t1 * t2;
    printf("Final result: %lu\n", result);
    
    printf("Running square loop with N = %u:\n", N);
     t1 = thread(1, 2, N);
     t2 = thread(1, 4, N);

     result = t1 * t2;
    printf("Final result: %lu\n", result);
    
    printf("Running square loop with N = %u:\n", N);
     t1 = thread(3, 2, N);
     t2 = thread(1, 4, N);
     result = t1 * t2;
    printf("Final result: %lu\n", result);
    
    return 0;
}

adding some debug printf, we get:

pow(2) actual use

Because i grows really fast, in practice there are not enough iterations to create an actual speedup. It’s still interesting because it’s non-linear and we see how we can compute the threads without knowing the closed-form update function.

LCG Example

Now that we have covered the basic, let’s actually create a useful example.

A linear congruential generator (LCG) makes pseudo-random numbers by repeatedly computing.
x_(n+1)=(ax_n+c)  mod  m

Because addition and multiplication respect congruence mod m:

(a + b) mod n = [(a mod n) + (b mod n)] mod n.

We can turn this code:

uint32_t lcg_serial(uint32_t seed, unsigned n) {
    uint32_t s = seed;
    for (unsigned i = 0; i < n; ++i) {
        s = 1664525u * s + 1013904223u;
    }
    return s;
}

NB because it’s an unsigned there’s a hidden mod 2^32

into:

uint32_t lcg_serial(uint32_t seed, unsigned n) {
    uint32_t s = seed;
    for (unsigned i = 0; i < n; i = i +2) {
        s = 1664525u*1664525u * s + 1664525u*1013904223u  + 1013904223u;
    }
    /* if n was odd, do the leftover single step */
    if (n & 1)
        s = 1664525u * s + 1013904223u;
    return s;
}

(untested code ; full code is at the bottom with test code)

The “n is odd test”, is for obvious reasons (and needed for parallelized code). The rest is the same trick applied.

The speedup is 1.21x with 2 threads which account for the overhead of thread creation.

In this case, I had to relax some of these assumptions (A1-A5) and you can see it’s working without some of them.

And yes this is a known trick already but you see how “my algorithm” nicely explains the “skip ahead math trick”.

What I’m not explaining

  • How to check for all the assumptions
  • Code-gen (because it’s trivial here)
  • My prototype works for Java bytecode. It’s here if that interests you.

If people are interested I’m happy to explain more of course!

What (I think) is original

All the core maths behind this is already known or trivial. What’s interesting are:

  1. It’s implemented as a precompiler for the JVM. See my previous post for details.
  2. Fully static compiler pass for skip-ahead—even on bounded non-linear scalar recurrences.
  3. No closed-form required. To the best of my knowledge, other approaches need a closed-form or another hard to compute version.

Related work

ThemeKey ideaLimitations our pass avoids
Skip‑/jump‑ahead RNGAlgebraically compute fᵖ to skip p steps at onceRequires a linear (or at least algebraic) recurrence; fails on i = i*i, nonlinear hashes, RNGs with carry
Polyhedral DOALL/DOACROSS (Polly, Pluto, …)Dependence tests on affine subscripts → generate tiled/parallel scheduleReject most scalar non‑linear recurrences
Speculative / optimistic passes (Doerfert ’17, Lerna, Privateer)Run ahead in parallel, roll back on dependence violationRuntime bookkeeping; mis‑speculation risk; no deterministic proof
Function‑reconstruction scans (Jiang et al. PACT 2018)Symbolically turn recurrences into parallel prefix scansNeeds to reconstruct f algebraically and still requires an associative combine
Parallel prefix scans (Blelloch ’90)Up‑/down‑sweep converts suitable recurrences into a parallel prefix‑sumNeeds an associative operator and two passes; not applicable to arbitrary loop bodies
Linear Recurrence (Kogge and Stone ’73)General parallelization by recurrence for linear reccurrencesApply fully to linear functions

Full LCG COde

This is my test code that’s why it’s a little bit ugly but it works and shows the principle at work.

/* lcg_parallel_timed.c  ─  measure serial vs. 2-thread LCG
 *
 *   s_{n+1} = a s_n + c  (mod 2^32)
 *   a = 1664525,  c = 1013904223
 *
 * Build:   gcc -O2 -pthread lcg_parallel_timed.c -o lcg_parallel_timed
 * Run:     ./lcg_parallel_timed
 */

#define _POSIX_C_SOURCE 199309L     /* for clock_gettime */

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>

#define A    1664525u
#define C    1013904223u


/* -------- utilities ---------------------------------------------------- */

static inline uint32_t lcg_step(uint32_t s)
{
    return (uint32_t)(((uint64_t)A * s + C));
}

static double ms_diff(struct timespec a, struct timespec b)
{
    return (double)(b.tv_sec - a.tv_sec) * 1000.0 +
           (double)(b.tv_nsec - a.tv_nsec) / 1e6;
}

/* -------- parallel worker --------------------------------------------- */

typedef struct {
    uint32_t *dst;
    uint32_t state;
    uint32_t mul;        /* stride-2 multiplier */
    uint32_t add;        /* stride-2 increment  */
    unsigned start;
    unsigned stride;
    unsigned n_total;
} Work;

static void *produce(void *arg)
{
    Work *w = (Work *)arg;
    uint32_t x = w->state;
    for (unsigned i = w->start; i < w->n_total; i += w->stride) {
        w->dst[i] = x;
        x = (uint32_t)(((uint64_t)w->mul * x + w->add) );
    }
    return NULL;
}

/* ---------------------------------------------------------------------- */

int main(void)
{
    const unsigned N      = 1000000u;   /* sequence length */
    const uint32_t SEED   = 12345u;
    const unsigned THREADS = 2;         /* we split even / odd */

    /* stride-2 constants: a²  and  c·(a+1) */
    const uint32_t A2 = (uint32_t)(((uint64_t)A * A) );
    const uint32_t C2 = (uint32_t)((((uint64_t)A * C) + C));

    /* output buffers */
    uint32_t *seq_serial   = malloc((size_t)N * sizeof(uint32_t));
    uint32_t *seq_parallel = malloc((size_t)N * sizeof(uint32_t));
    if (!seq_serial || !seq_parallel) { perror("malloc"); return 1; }

    /* ---------- 1. serial run ---------------------------------------- */
    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);

    uint32_t s = SEED;
    for (unsigned i = 0; i < N; ++i) {
        seq_serial[i] = s;
        s = lcg_step(s);
    }

    clock_gettime(CLOCK_MONOTONIC, &t1);
    double t_serial_ms = ms_diff(t0, t1);

    /* ---------- 2. parallel run (2 threads) -------------------------- */
    const uint32_t seed_even = SEED;
    const uint32_t seed_odd  = lcg_step(SEED);

    Work jobs[2] = {
        { seq_parallel, seed_even, A2, C2, 0, THREADS, N },
        { seq_parallel, seed_odd,  A2, C2, 1, THREADS, N }
    };
    pthread_t tid[2];

    clock_gettime(CLOCK_MONOTONIC, &t0);

    pthread_create(&tid[0], NULL, produce, &jobs[0]);
    pthread_create(&tid[1], NULL, produce, &jobs[1]);
    pthread_join(tid[0], NULL);
    pthread_join(tid[1], NULL);

    clock_gettime(CLOCK_MONOTONIC, &t1);
    double t_parallel_ms = ms_diff(t0, t1);

    /* ---------- 3. correctness check --------------------------------- */
    for (unsigned i = 0; i < N; ++i) {
        if (seq_serial[i] != seq_parallel[i]) {
            printf("Mismatch at i=%u\n", i);
            return 1;
        } 
    }

    /* ---------- 4. report ------------------------------------------- */
    printf("Sequence length      : %u\n", N);
    printf("Serial   wall-time   : %.3f ms\n", t_serial_ms);
    printf("Parallel wall-time   : %.3f ms (2 threads)\n", t_parallel_ms);
    printf("Speed-up (serial/2T) : %.2f×\n",
           t_serial_ms / t_parallel_ms);
    printf("Final value s_%u     : %u (0x%08X)\n",
           N - 1, seq_serial[N - 1], seq_serial[N - 1]);

    free(seq_serial);
    free(seq_parallel);
    return 0;
}