Cactus is a single-file C runtime for parallel recursion on Linux x86-64.
It gives you fork-join parallelism with automatic load balancing via work
stealing — write recursive functions as usual, sprinkle in BEGIN,
FORK, JOIN, and the runtime distributes work across all cores.
Use cases:
- Divide-and-conquer algorithms — parallel mergesort, quicksort,
matrix multiply — anywhere you'd write
fork(left); solve(right); join(). - Tree-structured computation — game-tree search, recursive backtracking (N-Queens), fractal generation — problems where the branching factor is high and subtrees are independent.
- Scientific computing — recursive decomposition of grids, adaptive mesh refinement, or any algorithm that naturally subdivides a domain.
Quick start
#include "cactus.h" cactus int pfib(int n) { if (n < 2) return n; int x, y; BEGIN(); FORK(&x, pfib, (n - 1)); y = pfib(n - 2); JOIN(); END(); return x + y; } int main(void) { cactus_start(0); // 0 = use all available cores int answer = pfib(42); cactus_stop(); return 0; }
Why I built this
I needed lightweight parallelism for recursive algorithms in C without
pulling in a heavyweight framework. OpenMP's #pragma omp task works
but is opaque, hard to tune, and drags in a large runtime. Cilk is
elegant but requires a custom compiler. Intel TBB is C++ only.
I wanted something that fits in two files, compiles with stock GCC or
Clang, and lets me write parallel recursion that looks almost identical
to the serial version. The cactus decorator, BEGIN/FORK/JOIN/END
macros, and three API calls (cactus_start, cactus_stop,
cactus_workers) are the entire interface.
The name comes from "cactus stack" — the execution model where each parallel branch gets its own stack, forming a tree of stacks that looks like a cactus when drawn sideways.
How it works
- Per-worker deques — each worker thread owns a ring-buffer deque. Spawned continuations are pushed to the bottom; the owner pops from the bottom (LIFO, cache-friendly).
- Random-victim work stealing — idle workers pick a random peer and steal from the top of that peer's deque (FIFO, coarse-grained).
- Continuation-passing — on
FORK, the parent's continuation (return address + frame pointer) is posted to the deque. The current thread immediately executes the child. If a thief steals the continuation, it resumes the parent on a fresh stack slab. - Stack slab pooling — 2 MB stack slabs are allocated from a global depot with an MCS spinlock, cached in per-worker stashes. Stolen continuations run on these slabs; when they finish, the slab returns to the pool.
- Direct register manipulation — x86-64 inline assembly saves and restores RSP/RBP for near-zero-overhead context switching between continuations.
| Design choice | Benefit | Cost |
|---|---|---|
| Per-worker deque (ring buffer) | O(1) push/pop, no allocation | 8 KB per worker |
| Random-victim stealing | simple, provably efficient | occasional wasted steal attempts |
| Continuation-passing (child-first) | the common no-steal path is a normal function call | parent must be resumable on any stack |
| 2 MB mmap'd stack slabs | large enough for deep recursion, huge-page eligible | virtual address space per steal |
| MCS spinlock for depot | fair, no spinlock convoy | one extra pointer chase vs. CAS |
| Compiler-specific paths (GCC/Clang) | each gets optimal codegen | two code paths to maintain |
Build
make # builds libcactus.so + test binaries make test # runs all tests with correctness checks make clean # removes build artifacts
Compiler: GCC or Clang with C11 support.
# Link your program against the library gcc -O3 -pthread -I. your_app.c -Lbuild/cc -lcactus -Wl,-rpath,'$ORIGIN' -o your_app
Usage
Startup and shutdown
cactus_start(0); // 0 = auto-detect cores cactus_start(4); // or specify worker count int np = cactus_workers(); // query active workers cactus_stop(); // join all workers, clean up
Set CACTUS_NPROCS in the environment to override the worker count
without recompiling.
Writing parallel functions
Annotate a function with the cactus decorator and use the four macros:
cactus int solve(int n) { if (n < CUTOFF) return serial_solve(n); int a, b; BEGIN(); // save frame for potential steal FORK(&a, solve, (n/2)); // spawn child, post continuation b = solve(n - n/2); // execute inline (child-first) JOIN(); // wait for stolen children END(); // clean up frame return a + b; }
BEGIN()— snapshot the frame (RSP, RBP, slab) so a thief can resume it.FORK(&ret, fn, (args...))— spawnfn(args...), store the result inret. UseFORK(fn, (args...))for void calls.JOIN()— block until all forked children complete.END()— no-op; marks the end of the parallel region for readability.
Void fork (no return value)
When the child doesn't return a value, omit the destination pointer:
cactus void pmul(const double *a, int lda, const double *b, int ldb, double *c, int ldc, int sz) { if (sz <= 64) { leaf(a, lda, b, ldb, c, ldc, sz); return; } int h = sz / 2; // ... BEGIN(); FORK(pmul, (a00, lda, b00, ldb, c00, ldc, h)); // void — no &ret FORK(pmul, (a00, lda, b01, ldb, c01, ldc, h)); FORK(pmul, (a10, lda, b00, ldb, c10, ldc, h)); pmul(a10, lda, b01, ldb, c11, ldc, h); // inline last child JOIN(); FORK(pmul, (a01, lda, b10, ldb, c00, ldc, h)); // second wave FORK(pmul, (a01, lda, b11, ldb, c01, ldc, h)); FORK(pmul, (a11, lda, b10, ldb, c10, ldc, h)); pmul(a11, lda, b11, ldb, c11, ldc, h); JOIN(); END(); }
Multiple FORK/JOIN waves in the same BEGIN/END block are fine —
useful when later forks depend on results from earlier ones.
Forking in a loop
Spawn a variable number of children from a loop. Each iteration
forks one task; JOIN waits for all of them:
cactus int solve(const int *prev, int sz, int row) { if (row == sz) return 1; int board[MAXN], counts[MAXN]; if (prev) memcpy(board, prev, sizeof(int) * row); // identify legal columns int any = 0; for (int c = 0; c < sz; c++) { if (!safe(board, row, c)) { counts[c] = 0; continue; } counts[c] = -1; any++; } if (!any) return 0; // fork one child per legal placement BEGIN(); for (int c = 0; c < sz; c++) { if (counts[c] != -1) continue; board[row] = c; int snap[MAXN]; memcpy(snap, board, sizeof(int) * (row + 1)); FORK(&counts[c], solve, (snap, sz, row + 1)); } JOIN(); END(); int total = 0; for (int c = 0; c < sz; c++) total += counts[c]; return total; }
Parallel quicksort
A classic pattern: partition in place, fork one half, recurse on the other:
cactus void psort(int *v, size_t count) { if (count < 2) return; int piv = v[count / 2]; int *lo = v, *hi = v + count - 1; while (lo <= hi) { if (*lo < piv) lo++; else if (*hi > piv) hi--; else { swap(lo, hi); lo++; hi--; } } BEGIN(); FORK(psort, (v, (size_t)(hi - v + 1))); // left half psort(lo, (size_t)(v + count - lo)); // right half inline JOIN(); END(); }
Tuning the cutoff
The granularity cutoff (switching to serial below some threshold) is the single most important tuning knob. Too low and you pay fork/join overhead on tiny tasks; too high and you lose parallelism.
cactus int pfib(int n) { if (n < 20) return serial_fib(n); // cutoff at 20, not 2 int x, y; BEGIN(); FORK(&x, pfib, (n - 1)); y = pfib(n - 2); JOIN(); END(); return x + y; }
A good starting point: switch to serial when the subproblem takes less than ~10 microseconds. Profile with different thresholds — the optimal cutoff depends on your hardware and problem shape.
Multiple return values
Collect results from multiple children into separate variables:
cactus int search(Tree *node) { if (!node) return 0; int left, right; BEGIN(); FORK(&left, search, (node->left)); FORK(&right, search, (node->right)); int mid = process(node); JOIN(); END(); return left + mid + right; }
Parallel for-each
Apply a function to every element of an array. Recursive bisection gives you automatic load balancing without needing to know the worker count:
cactus void foreach(int *v, size_t n, void (*fn)(int *)) { if (n <= 1024) { for (size_t i = 0; i < n; i++) fn(&v[i]); return; } size_t mid = n / 2; BEGIN(); FORK(foreach, (v, mid, fn)); foreach(v + mid, n - mid, fn); JOIN(); END(); } // usage void square(int *x) { *x = (*x) * (*x); } foreach(arr, 10000000, square);
Parallel map
Transform one array into another. Same bisection pattern, different signature:
cactus void pmap(const double *in, double *out, size_t n, double (*fn)(double)) { if (n <= 512) { for (size_t i = 0; i < n; i++) out[i] = fn(in[i]); return; } size_t mid = n / 2; BEGIN(); FORK(pmap, (in, out, mid, fn)); pmap(in + mid, out + mid, n - mid, fn); JOIN(); END(); }
Parallel reduce
Sum, min, max — any associative operation. Each half returns a partial result; the parent combines them:
cactus long psum(const int *v, size_t n) { if (n <= 4096) { long s = 0; for (size_t i = 0; i < n; i++) s += v[i]; return s; } size_t mid = n / 2; long left, right; BEGIN(); FORK(&left, psum, (v, mid)); right = psum(v + mid, n - mid); JOIN(); END(); return left + right; }
Works for any associative op — swap + for max, min, *, ^,
or a custom merge.
Parallel find
Search for the first element matching a predicate. Use a shared flag to short-circuit sibling subtrees:
static volatile int found; cactus int pfind(const int *v, size_t n, int target) { if (found) return -1; // early exit if (n <= 4096) { for (size_t i = 0; i < n; i++) if (v[i] == target) { found = 1; return (int)i; } return -1; } size_t mid = n / 2; int left, right; BEGIN(); FORK(&left, pfind, (v, mid, target)); right = pfind(v + mid, n - mid, target); JOIN(); END(); if (left >= 0) return left; if (right >= 0) return (int)mid + right; return -1; }
Parallel merge sort
Stable sort with O(n) extra space — fork the two halves, merge on join:
cactus void pmergesort(int *v, int *tmp, size_t n) { if (n <= 1024) { isort(v, n); return; } // insertion sort base size_t mid = n / 2; BEGIN(); FORK(pmergesort, (v, tmp, mid)); pmergesort(v + mid, tmp + mid, n - mid); JOIN(); END(); merge(v, mid, v + mid, n - mid, tmp); memcpy(v, tmp, sizeof(int) * n); }
Parallel tree operations
Walk, transform, or aggregate an entire tree in parallel — each node forks its children:
// parallel tree sum cactus long tree_sum(Node *t) { if (!t) return 0; long left, right; BEGIN(); FORK(&left, tree_sum, (t->left)); right = tree_sum(t->right); // inline the last child JOIN(); END(); return t->val + left + right; } // parallel tree map — apply f to every node cactus void tree_map(Node *t, void (*f)(Node *)) { if (!t) return; f(t); BEGIN(); FORK(tree_map, (t->left, f)); tree_map(t->right, f); JOIN(); END(); } // parallel tree height cactus int tree_height(Node *t) { if (!t) return 0; int lh, rh; BEGIN(); FORK(&lh, tree_height, (t->left)); rh = tree_height(t->right); JOIN(); END(); return 1 + (lh > rh ? lh : rh); }
Parallel image processing
Process quadrants of a 2D image independently — a natural fit for recursive subdivision:
cactus void blur(float *img, float *tmp, int x, int y, int w, int h, int stride) { if (w * h <= 64 * 64) { blur_serial(img, tmp, x, y, w, h, stride); return; } int hw = w / 2, hh = h / 2; BEGIN(); FORK(blur, (img, tmp, x, y, hw, hh, stride)); FORK(blur, (img, tmp, x + hw, y, w - hw, hh, stride)); FORK(blur, (img, tmp, x, y + hh, hw, h - hh, stride)); blur(img, tmp, x + hw, y + hh, w - hw, h - hh, stride); JOIN(); END(); }
Parallel histogram
Divide the input, count locally in each half, merge after join. No shared mutable state — each subtree writes to its own buffer:
cactus void phist(const uint8_t *data, size_t n, int *hist) { if (n <= 4096) { for (size_t i = 0; i < n; i++) hist[data[i]]++; return; } size_t mid = n / 2; int left[256] = {0}, right[256] = {0}; BEGIN(); FORK(phist, (data, mid, left)); phist(data + mid, n - mid, right); JOIN(); END(); for (int i = 0; i < 256; i++) hist[i] = left[i] + right[i]; }
Parallel any / all
Short-circuit boolean reduce — check if any element satisfies a predicate:
static volatile int bail; cactus int pany(const int *v, size_t n, int (*pred)(int)) { if (bail) return 1; if (n <= 4096) { for (size_t i = 0; i < n; i++) if (pred(v[i])) { bail = 1; return 1; } return 0; } size_t mid = n / 2; int left, right; BEGIN(); FORK(&left, pany, (v, mid, pred)); right = pany(v + mid, n - mid, pred); JOIN(); END(); return left || right; }
Flip || to && and the short-circuit logic for pall.
Parallel graph BFS layers
Process each BFS layer in parallel — fork one task per discovered node in the frontier:
cactus void process_frontier(Graph *g, int *frontier, int nf, int *next, int *nn, int *visited) { if (nf <= 64) { for (int i = 0; i < nf; i++) expand(g, frontier[i], next, nn, visited); return; } int mid = nf / 2; BEGIN(); FORK(process_frontier, (g, frontier, mid, next, nn, visited)); process_frontier(g, frontier + mid, nf - mid, next, nn, visited); JOIN(); END(); }
Nested parallelism
Cactus handles nested parallel regions naturally — a parallel map that calls a parallel reduce inside each element:
cactus double row_norm(const double *mat, int cols, int row) { return sqrt(psum_sq(mat + row * cols, cols)); // psum_sq is itself parallel } cactus void all_norms(const double *mat, double *norms, int rows, int cols) { if (rows <= 4) { for (int i = 0; i < rows; i++) norms[i] = row_norm(mat, cols, i); return; } int mid = rows / 2; BEGIN(); FORK(all_norms, (mat, norms, mid, cols)); all_norms(mat + mid * cols, norms + mid, rows - mid, cols); JOIN(); END(); }
The runtime's work stealing distributes tasks across cores regardless of nesting depth — no manual flattening needed.
Environment variable control
# limit to 4 workers (useful for benchmarking scaling) CACTUS_NPROCS=4 ./my_app # single-threaded (serial execution, still uses the runtime) CACTUS_NPROCS=1 ./my_app
Tests
Four built-in benchmarks verify correctness and measure performance:
| Test | What it exercises | Default n |
|---|---|---|
| fib | basic fork-join, deep recursion | 30 |
| nqueens | nested parallel backtracking | 11 |
| matmul | recursive divide-and-conquer with FP | 64 |
| quicksort | parallel partition + sort, 10M elements | 10000000 |
make test # run all build/cc/fib 42 # single test with custom n CACTUS_NPROCS=4 build/cc/nqueens # limit to 4 workers
Honest limitations
-
x86-64 Linux only — the inline assembly manipulates RSP/RBP directly. No ARM, no macOS, no Windows.
-
GCC or Clang required — uses compiler-specific extensions (nested functions on GCC,
__label__+disable_tail_callson Clang). -
No cancellation — once a task is forked, it runs to completion.
-
Fixed deque size — 1024 entries per worker. Deep recursion without a cutoff can overflow the deque.
-
Stack slabs are 2 MB each — a stolen continuation gets a fresh slab. Workloads with massive steal counts consume proportional virtual memory.
API reference
Runtime lifecycle
| Function | Description |
|---|---|
int cactus_start(int nprocs) |
Start the runtime with nprocs workers (0 = auto) |
int cactus_stop(void) |
Shut down all workers and free resources |
int cactus_workers(void) |
Return the number of active workers |
Parallel region macros
| Macro | Description |
|---|---|
BEGIN() |
Initialize a frame for the current parallel region |
FORK(&ret, fn, (args...)) |
Spawn fn(args...), store result in ret |
FORK(fn, (args...)) |
Spawn fn(args...) with no return value |
JOIN() |
Wait for all forked children to complete |
END() |
End the parallel region (no-op, for readability) |
Function decorator
| Decorator | Description |
|---|---|
cactus |
Annotate functions that participate in fork-join parallelism |