Exposing More Parallelism Is the Hidden Reason Why Some Vectorized Loops Are Faster - Not Vectorization per se - Johnny's Software Lab

UPCOMING WORKSHOPS
Online:
• AVX Vectorization Workshop: TBD, info [at] johnnysswlab [dot] com (Express interest, Expressing Interest for AVX Workshop, Hi! I am interested in AVX Vectorization Workshop. Please inform me about the dates once this information is known.) , More info…
• NEON Vectorization Workshop: TBD, info [at] johnnysswlab [dot] com (Express interest, Expressing Interest for NEON Workshop, Hi! I am interested in NEON Vectorization Workshop. Please inform me about the dates once this information is known.) , More info…

I was preparing an article about Highway – portable vectorization library by Google – so I ported a few examples from my vectorization workshop from AVX to Highway. One of the examples was vectorized binary search.

I assume most readers are familiar with simple binary search. It looks something like this:

void binary_search(int32_t* sorted, size_t sorted_size, int32_t* data, int32_t* found_idx, size_t data_size) {
    for (size_t i = 0; i < data_size; ++i) {
        int32_t key = data[i];
        found_idx[i] = -1;

        int low = 0;
        int high = sorted_size - 1;
        int mid;

        while(low <= high) {
            mid = (low + high) / 2;

            if (sorted[mid] < key) {
                low = mid + 1; 
            } else if(sorted[mid] > key) {
                high = mid - 1;
            } else {
                found_idx[i] = mid;
                break;
            }
        }
    }
}

We take a lookup value (here called key on line 3), and then pick the left or right part of the array depending if our value is bigger or smaller than the middle part. We repeat this process, each time reducing the search space exponentially, until we are left with at most one element – which is either the result or means the value wasn’t found.

We vectorize this algorithm using outer-loop vectorization: essentially we run e.g. 4 searches in parallel. Instead of having one lookup key, we have 4. Instead of 1 mid, 1 low and 1 high value, we now have 4 mids, 4 lows and 4 highs. The finished mask accumulates all lanes that are done, either because the element was found (line 37) or because the search interval collapsed (line 22).

The Highway code looks like this:

HWY_ATTR void binary_search_vectorized(int32_t* sorted, size_t sorted_size,
                                        int32_t* data, int32_t* found_idx_arr,
                                        size_t data_size) {
    const hn::ScalableTag<int32_t> i32;
    const int N = hn::Lanes(i32);
    assert(data_size % N == 0);

    const auto one = hn::Set(i32, 1);

    for (size_t i = 0; i < data_size; i += N) {
        auto key       = hn::LoadU(i32, data + i);
        auto low       = hn::Zero(i32);
        auto high      = hn::Set(i32, static_cast<int32_t>(sorted_size - 1));

        auto found_idx  = hn::Set(i32, -1);
        auto finished  = hn::MaskFalse(i32);

        // The mask `finished` contains all the lanes for which
        // the search is finished. 
        while (true) {
            // We mark as finished those lanes for which low > high
            finished = hn::Or(finished, (low > high));

            // If search is finished for all lanes, we break
            if (hn::AllTrue(i32, finished)) break;

            // mid = (low + high) / 2
            auto mid = hn::ShiftRight<1>(low + high);
            auto val = hn::GatherIndex(i32, sorted, mid);

            auto eq_mask = val == key;

            // For those lanes where val == key, we
            // update the found_idx
            found_idx = hn::IfThenElse(eq_mask, mid, found_idx);
            // If the element was found, the search if finished for this element
            finished  = hn::Or(finished, eq_mask);

            // We don't update `high` and `low` if the search is finished
            high = hn::IfThenElse(hn::AndNot(finished, val > key) , mid - one, high);
            low  = hn::IfThenElse(hn::AndNot(finished, val < key), mid + one, low);
        }

        hn::StoreU(found_idx, i32, found_idx_arr + i);
    }
}

The code is not easy to understand, but compared to AVX this is beautifully readable code. To implement vectorized binary search efficiently, the ISA should support vector gather instructions – hn::GatherIndex (line 29), corresponding to the intrinsic _mm256_i32gather_epi32. As vectorization lovers know, this is a very expensive instruction that heavily stresses the memory subsystem – and if memory is the bottleneck, vectorization often doesn’t pay.

But then again, scalar binary search already performs random memory accesses. So what exactly are we making worse? And was vectorization worth it?

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Runtimes

If we run this code on an AVX-enabled machine (like my Zen 4 laptop with an AMD Ryzen 9 PRO 8945HS), we get the following numbers for a sorted array of 1M ints and 4M input keys, repeated 10 times:

Metric	Scalar Version	Vector Version (AVX2)
Runtime	2.80 s	1.43 s
Instructions	9.86 billion	2.58 billion
CPI	1.42	2.82

The vectorization was worth it. The runtime is 2× faster and the number of instructions decreased by a factor of 3.8. Considering that we are running 8 independent searches in parallel, but the speedup is only 2×, and since this is a memory-bound problem, we could assume that we are hitting the memory wall. Also notice that CPI nearly doubled in the vector version. This strongly suggests that we are more backend-bound – either waiting for memory, or instructions waiting for one another to complete.

But the real question is, is vectorization the real reason why binary search is faster?

Before answering, I translated this code for NEON and ran it on my Raspberry PI 4 (Cortex A72). NEON doesn’t have gathers, instead it emulates them with a small loop that does scalar load. I ran the experiments and got the following numbers:

Metric	Scalar Version	Vector Version (NEON)
Runtime	53.07 s	33.79 s
Instructions	7.31 billion	8.48 billion
CPI	10.74	5.91

Interestingly enough, the number of instructions went up but the runtime went down! Why?

Why Is the Vectorized NEON Version Faster In Spite of Having More Instructions?

The scalar binary search has a loop-carried dependency – to calculate the mid value in this iteration, we need the low and high values from the previous iteration. Even with perfect branch prediction, this instruction dependency chain remains a hard limit on the upper performance of binary search.

Although modern CPUs use branch prediction and speculation, each binary search still forms a single architectural dependency chain. The processor may speculate and issue several loads ahead of time, but it can only speculate along one predicted path. As the speculative chain becomes longer, the probability of misprediction increases, and when that happens, the speculative work must be discarded. So the limiting factor is not how many loads can be issued, but how many of those loads are actually useful.

With AVX vectorization, we decreased the total instruction count, but more importantly, we started executing multiple independent searches at the same time. The original code was performing 4 million lookups sequentially. By rewriting it using AVX vector instructions (which process 8 integers at once), we now execute 8 independent dependency chains in parallel. The CPI metric got worse, which suggests the CPU became more backend-bound – most likely waiting on memory and gather latency.

With NEON vectorization, the instruction count actually went up. But if we look at CPI, the vectorized version had a much better CPI. Why? If NEON had true hardware gathers, we would expect behavior similar to AVX, including higher CPI due to memory stalls. But since it doesn’t have them, gathers are emulated using scalar loads. This increases the instruction count, but those loads are independent and the CPU can execute many of them in parallel.

In other words, we increased memory-level parallelism. With vectorized version multiple cache accesses can be in flight at the same time, which hides memory latency far better than the strictly serial scalar version.

It wasn’t vectorization per se that caused the speedup – it was restructuring the code to expose more instruction-level and memory-level parallelism. Each individual binary search is still serial. We did not make one search faster. We made many serial searches overlap.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.