Binary Vector Search at 350GB/S Using ARM Neon
topk.iore: optimization for 1024b vectors — do you pad shorter ones, or fallback to a more general kernel?
We do a projection of the original vectors so that it matches one of our optimized kernel. This generally gives us better recall vs. simple padding since all bits are utilized.