GitHub - xtellect/cactus: Work-stealing parallel recursion runtime for C

Cactus is a single-file C runtime for parallel recursion on Linux x86-64. It gives you fork-join parallelism with automatic load balancing via work stealing — write recursive functions as usual, sprinkle in BEGIN, FORK, JOIN, and the runtime distributes work across all cores.

Use cases:

Divide-and-conquer algorithms — parallel mergesort, quicksort, matrix multiply — anywhere you'd write fork(left); solve(right); join().
Tree-structured computation — game-tree search, recursive backtracking (N-Queens), fractal generation — problems where the branching factor is high and subtrees are independent.
Scientific computing — recursive decomposition of grids, adaptive mesh refinement, or any algorithm that naturally subdivides a domain.

Quick start

#include "cactus.h"

cactus int pfib(int n) {
    if (n < 2) return n;
    int x, y;
    BEGIN();
    FORK(&x, pfib, (n - 1));
    y = pfib(n - 2);
    JOIN();
    END();
    return x + y;
}

int main(void) {
    cactus_start(0);          // 0 = use all available cores
    int answer = pfib(42);
    cactus_stop();
    return 0;
}

Why I built this

I needed lightweight parallelism for recursive algorithms in C without pulling in a heavyweight framework. OpenMP's #pragma omp task works but is opaque, hard to tune, and drags in a large runtime. Cilk is elegant but requires a custom compiler. Intel TBB is C++ only.

I wanted something that fits in two files, compiles with stock GCC or Clang, and lets me write parallel recursion that looks almost identical to the serial version. The cactus decorator, BEGIN/FORK/JOIN/END macros, and three API calls (cactus_start, cactus_stop, cactus_workers) are the entire interface.

The name comes from "cactus stack" — the execution model where each parallel branch gets its own stack, forming a tree of stacks that looks like a cactus when drawn sideways.

How it works

Per-worker deques — each worker thread owns a ring-buffer deque. Spawned continuations are pushed to the bottom; the owner pops from the bottom (LIFO, cache-friendly).
Random-victim work stealing — idle workers pick a random peer and steal from the top of that peer's deque (FIFO, coarse-grained).
Continuation-passing — on FORK, the parent's continuation (return address + frame pointer) is posted to the deque. The current thread immediately executes the child. If a thief steals the continuation, it resumes the parent on a fresh stack slab.
Stack slab pooling — 2 MB stack slabs are allocated from a global depot with an MCS spinlock, cached in per-worker stashes. Stolen continuations run on these slabs; when they finish, the slab returns to the pool.
Direct register manipulation — x86-64 inline assembly saves and restores RSP/RBP for near-zero-overhead context switching between continuations.

Design choice	Benefit	Cost
Per-worker deque (ring buffer)	O(1) push/pop, no allocation	8 KB per worker
Random-victim stealing	simple, provably efficient	occasional wasted steal attempts
Continuation-passing (child-first)	the common no-steal path is a normal function call	parent must be resumable on any stack
2 MB mmap'd stack slabs	large enough for deep recursion, huge-page eligible	virtual address space per steal
MCS spinlock for depot	fair, no spinlock convoy	one extra pointer chase vs. CAS
Compiler-specific paths (GCC/Clang)	each gets optimal codegen	two code paths to maintain

Build

make            # builds libcactus.so + test binaries
make test       # runs all tests with correctness checks
make clean      # removes build artifacts

Compiler: GCC or Clang with C11 support.

# Link your program against the library
gcc -O3 -pthread -I. your_app.c -Lbuild/cc -lcactus -Wl,-rpath,'$ORIGIN' -o your_app

Usage

Startup and shutdown

cactus_start(0);           // 0 = auto-detect cores
cactus_start(4);           // or specify worker count
int np = cactus_workers(); // query active workers
cactus_stop();             // join all workers, clean up

Set CACTUS_NPROCS in the environment to override the worker count without recompiling.

Writing parallel functions

Annotate a function with the cactus decorator and use the four macros:

cactus int solve(int n) {
    if (n < CUTOFF) return serial_solve(n);

    int a, b;
    BEGIN();                  // save frame for potential steal
    FORK(&a, solve, (n/2));   // spawn child, post continuation
    b = solve(n - n/2);       // execute inline (child-first)
    JOIN();                   // wait for stolen children
    END();                    // clean up frame
    return a + b;
}

BEGIN() — snapshot the frame (RSP, RBP, slab) so a thief can resume it.
FORK(&ret, fn, (args...)) — spawn fn(args...), store the result in ret. Use FORK(fn, (args...)) for void calls.
JOIN() — block until all forked children complete.
END() — no-op; marks the end of the parallel region for readability.

Void fork (no return value)

When the child doesn't return a value, omit the destination pointer:

cactus void pmul(const double *a, int lda, const double *b, int ldb,
                 double *c, int ldc, int sz)
{
    if (sz <= 64) { leaf(a, lda, b, ldb, c, ldc, sz); return; }

    int h = sz / 2;
    // ...
    BEGIN();
    FORK(pmul, (a00, lda, b00, ldb, c00, ldc, h));   // void — no &ret
    FORK(pmul, (a00, lda, b01, ldb, c01, ldc, h));
    FORK(pmul, (a10, lda, b00, ldb, c10, ldc, h));
    pmul(a10, lda, b01, ldb, c11, ldc, h);            // inline last child
    JOIN();

    FORK(pmul, (a01, lda, b10, ldb, c00, ldc, h));    // second wave
    FORK(pmul, (a01, lda, b11, ldb, c01, ldc, h));
    FORK(pmul, (a11, lda, b10, ldb, c10, ldc, h));
    pmul(a11, lda, b11, ldb, c11, ldc, h);
    JOIN();
    END();
}

Multiple FORK/JOIN waves in the same BEGIN/END block are fine — useful when later forks depend on results from earlier ones.

Forking in a loop

Spawn a variable number of children from a loop. Each iteration forks one task; JOIN waits for all of them:

cactus int solve(const int *prev, int sz, int row)
{
    if (row == sz) return 1;

    int board[MAXN], counts[MAXN];
    if (prev) memcpy(board, prev, sizeof(int) * row);

    // identify legal columns
    int any = 0;
    for (int c = 0; c < sz; c++) {
        if (!safe(board, row, c)) { counts[c] = 0; continue; }
        counts[c] = -1;
        any++;
    }
    if (!any) return 0;

    // fork one child per legal placement
    BEGIN();
    for (int c = 0; c < sz; c++) {
        if (counts[c] != -1) continue;
        board[row] = c;
        int snap[MAXN];
        memcpy(snap, board, sizeof(int) * (row + 1));
        FORK(&counts[c], solve, (snap, sz, row + 1));
    }
    JOIN();
    END();

    int total = 0;
    for (int c = 0; c < sz; c++) total += counts[c];
    return total;
}

Parallel quicksort

A classic pattern: partition in place, fork one half, recurse on the other:

cactus void psort(int *v, size_t count)
{
    if (count < 2) return;

    int piv = v[count / 2];
    int *lo = v, *hi = v + count - 1;

    while (lo <= hi) {
        if (*lo < piv)       lo++;
        else if (*hi > piv)  hi--;
        else { swap(lo, hi); lo++; hi--; }
    }

    BEGIN();
    FORK(psort, (v, (size_t)(hi - v + 1)));   // left half
    psort(lo, (size_t)(v + count - lo));       // right half inline
    JOIN();
    END();
}

Tuning the cutoff

The granularity cutoff (switching to serial below some threshold) is the single most important tuning knob. Too low and you pay fork/join overhead on tiny tasks; too high and you lose parallelism.

cactus int pfib(int n) {
    if (n < 20) return serial_fib(n);   // cutoff at 20, not 2
    int x, y;
    BEGIN();
    FORK(&x, pfib, (n - 1));
    y = pfib(n - 2);
    JOIN();
    END();
    return x + y;
}

A good starting point: switch to serial when the subproblem takes less than ~10 microseconds. Profile with different thresholds — the optimal cutoff depends on your hardware and problem shape.

Multiple return values

Collect results from multiple children into separate variables:

cactus int search(Tree *node) {
    if (!node) return 0;
    int left, right;
    BEGIN();
    FORK(&left,  search, (node->left));
    FORK(&right, search, (node->right));
    int mid = process(node);
    JOIN();
    END();
    return left + mid + right;
}

Parallel for-each

Apply a function to every element of an array. Recursive bisection gives you automatic load balancing without needing to know the worker count:

cactus void foreach(int *v, size_t n, void (*fn)(int *))
{
    if (n <= 1024) {
        for (size_t i = 0; i < n; i++) fn(&v[i]);
        return;
    }
    size_t mid = n / 2;
    BEGIN();
    FORK(foreach, (v, mid, fn));
    foreach(v + mid, n - mid, fn);
    JOIN();
    END();
}

// usage
void square(int *x) { *x = (*x) * (*x); }
foreach(arr, 10000000, square);

Parallel map

Transform one array into another. Same bisection pattern, different signature:

cactus void pmap(const double *in, double *out, size_t n,
                 double (*fn)(double))
{
    if (n <= 512) {
        for (size_t i = 0; i < n; i++) out[i] = fn(in[i]);
        return;
    }
    size_t mid = n / 2;
    BEGIN();
    FORK(pmap, (in, out, mid, fn));
    pmap(in + mid, out + mid, n - mid, fn);
    JOIN();
    END();
}

Parallel reduce

Sum, min, max — any associative operation. Each half returns a partial result; the parent combines them:

cactus long psum(const int *v, size_t n)
{
    if (n <= 4096) {
        long s = 0;
        for (size_t i = 0; i < n; i++) s += v[i];
        return s;
    }
    size_t mid = n / 2;
    long left, right;
    BEGIN();
    FORK(&left, psum, (v, mid));
    right = psum(v + mid, n - mid);
    JOIN();
    END();
    return left + right;
}

Works for any associative op — swap + for max, min, *, ^, or a custom merge.

Parallel find

Search for the first element matching a predicate. Use a shared flag to short-circuit sibling subtrees:

static volatile int found;

cactus int pfind(const int *v, size_t n, int target)
{
    if (found) return -1;                       // early exit
    if (n <= 4096) {
        for (size_t i = 0; i < n; i++)
            if (v[i] == target) { found = 1; return (int)i; }
        return -1;
    }
    size_t mid = n / 2;
    int left, right;
    BEGIN();
    FORK(&left, pfind, (v, mid, target));
    right = pfind(v + mid, n - mid, target);
    JOIN();
    END();
    if (left >= 0) return left;
    if (right >= 0) return (int)mid + right;
    return -1;
}

Parallel merge sort

Stable sort with O(n) extra space — fork the two halves, merge on join:

cactus void pmergesort(int *v, int *tmp, size_t n)
{
    if (n <= 1024) { isort(v, n); return; }   // insertion sort base

    size_t mid = n / 2;
    BEGIN();
    FORK(pmergesort, (v, tmp, mid));
    pmergesort(v + mid, tmp + mid, n - mid);
    JOIN();
    END();
    merge(v, mid, v + mid, n - mid, tmp);
    memcpy(v, tmp, sizeof(int) * n);
}

Parallel tree operations

Walk, transform, or aggregate an entire tree in parallel — each node forks its children:

// parallel tree sum
cactus long tree_sum(Node *t)
{
    if (!t) return 0;
    long left, right;
    BEGIN();
    FORK(&left, tree_sum, (t->left));
    right = tree_sum(t->right);          // inline the last child
    JOIN();
    END();
    return t->val + left + right;
}

// parallel tree map — apply f to every node
cactus void tree_map(Node *t, void (*f)(Node *))
{
    if (!t) return;
    f(t);
    BEGIN();
    FORK(tree_map, (t->left,  f));
    tree_map(t->right, f);
    JOIN();
    END();
}

// parallel tree height
cactus int tree_height(Node *t)
{
    if (!t) return 0;
    int lh, rh;
    BEGIN();
    FORK(&lh, tree_height, (t->left));
    rh = tree_height(t->right);
    JOIN();
    END();
    return 1 + (lh > rh ? lh : rh);
}

Parallel image processing

Process quadrants of a 2D image independently — a natural fit for recursive subdivision:

cactus void blur(float *img, float *tmp, int x, int y, int w, int h,
                 int stride)
{
    if (w * h <= 64 * 64) { blur_serial(img, tmp, x, y, w, h, stride); return; }

    int hw = w / 2, hh = h / 2;
    BEGIN();
    FORK(blur, (img, tmp, x,      y,      hw,     hh,     stride));
    FORK(blur, (img, tmp, x + hw, y,      w - hw, hh,     stride));
    FORK(blur, (img, tmp, x,      y + hh, hw,     h - hh, stride));
    blur(img, tmp, x + hw, y + hh, w - hw, h - hh, stride);
    JOIN();
    END();
}

Parallel histogram

Divide the input, count locally in each half, merge after join. No shared mutable state — each subtree writes to its own buffer:

cactus void phist(const uint8_t *data, size_t n, int *hist)
{
    if (n <= 4096) {
        for (size_t i = 0; i < n; i++) hist[data[i]]++;
        return;
    }
    size_t mid = n / 2;
    int left[256] = {0}, right[256] = {0};
    BEGIN();
    FORK(phist, (data, mid, left));
    phist(data + mid, n - mid, right);
    JOIN();
    END();
    for (int i = 0; i < 256; i++) hist[i] = left[i] + right[i];
}

Parallel any / all

Short-circuit boolean reduce — check if any element satisfies a predicate:

static volatile int bail;

cactus int pany(const int *v, size_t n, int (*pred)(int))
{
    if (bail) return 1;
    if (n <= 4096) {
        for (size_t i = 0; i < n; i++)
            if (pred(v[i])) { bail = 1; return 1; }
        return 0;
    }
    size_t mid = n / 2;
    int left, right;
    BEGIN();
    FORK(&left, pany, (v, mid, pred));
    right = pany(v + mid, n - mid, pred);
    JOIN();
    END();
    return left || right;
}

Flip || to && and the short-circuit logic for pall.

Parallel graph BFS layers

Process each BFS layer in parallel — fork one task per discovered node in the frontier:

cactus void process_frontier(Graph *g, int *frontier, int nf,
                             int *next, int *nn, int *visited)
{
    if (nf <= 64) {
        for (int i = 0; i < nf; i++)
            expand(g, frontier[i], next, nn, visited);
        return;
    }
    int mid = nf / 2;
    BEGIN();
    FORK(process_frontier, (g, frontier, mid, next, nn, visited));
    process_frontier(g, frontier + mid, nf - mid, next, nn, visited);
    JOIN();
    END();
}

Nested parallelism

Cactus handles nested parallel regions naturally — a parallel map that calls a parallel reduce inside each element:

cactus double row_norm(const double *mat, int cols, int row)
{
    return sqrt(psum_sq(mat + row * cols, cols));   // psum_sq is itself parallel
}

cactus void all_norms(const double *mat, double *norms, int rows, int cols)
{
    if (rows <= 4) {
        for (int i = 0; i < rows; i++) norms[i] = row_norm(mat, cols, i);
        return;
    }
    int mid = rows / 2;
    BEGIN();
    FORK(all_norms, (mat, norms, mid, cols));
    all_norms(mat + mid * cols, norms + mid, rows - mid, cols);
    JOIN();
    END();
}

The runtime's work stealing distributes tasks across cores regardless of nesting depth — no manual flattening needed.

Environment variable control

# limit to 4 workers (useful for benchmarking scaling)
CACTUS_NPROCS=4 ./my_app

# single-threaded (serial execution, still uses the runtime)
CACTUS_NPROCS=1 ./my_app

Tests

Four built-in benchmarks verify correctness and measure performance:

Test	What it exercises	Default n
fib	basic fork-join, deep recursion	30
nqueens	nested parallel backtracking	11
matmul	recursive divide-and-conquer with FP	64
quicksort	parallel partition + sort, 10M elements	10000000

make test                         # run all
build/cc/fib 42                   # single test with custom n
CACTUS_NPROCS=4 build/cc/nqueens  # limit to 4 workers

Honest limitations

x86-64 Linux only — the inline assembly manipulates RSP/RBP directly. No ARM, no macOS, no Windows.
GCC or Clang required — uses compiler-specific extensions (nested functions on GCC, __label__ + disable_tail_calls on Clang).
No cancellation — once a task is forked, it runs to completion.
Fixed deque size — 1024 entries per worker. Deep recursion without a cutoff can overflow the deque.
Stack slabs are 2 MB each — a stolen continuation gets a fresh slab. Workloads with massive steal counts consume proportional virtual memory.

API reference

Runtime lifecycle

Function	Description
`int cactus_start(int nprocs)`	Start the runtime with `nprocs` workers (0 = auto)
`int cactus_stop(void)`	Shut down all workers and free resources
`int cactus_workers(void)`	Return the number of active workers

Parallel region macros

Macro	Description
`BEGIN()`	Initialize a frame for the current parallel region
`FORK(&ret, fn, (args...))`	Spawn `fn(args...)`, store result in `ret`
`FORK(fn, (args...))`	Spawn `fn(args...)` with no return value
`JOIN()`	Wait for all forked children to complete
`END()`	End the parallel region (no-op, for readability)

Function decorator

Decorator	Description
`cactus`	Annotate functions that participate in fork-join parallelism