Querying 3B Vectors
vickiboykis.comPick an embedding model that supports binary quantization and then use a SIMD-optimized Hamming Distance function. I'm doing this for Scour and doing about 1.6 billion comparisons per second.
Depending on how 'one-off' the query is, sequential read is the right answer. The alternative is indexing the data for ANN, which will generally require doing the equivalent of many queries across the dataset.
On the bright side, smart folks have already thought pretty hard about this. In my work, I ended up picking usearch for large-scale vector storage and ANN search. It's plenty fast and is happy working with vectors on disk - solutions which are /purely/ concerned with latency often don't include support for vectors on disk, which forces you into using a hell of a lot of RAM.
As it always happens, people realize there is something new is Redis in 2 years or more. With Streams it tragically took like 4 years and then everybody started to use it for this use case, with a sharp acceleration in the latest few years. I believe this is what is happening for vector sets as well. For a reduced size problem like that you just git clone Redis, add the vectors into a key with VADD, and query with VSIM. It's a 10 lines Python script that will deliver 20k/50k queries per second more or less, out of the box with zero optimizations.
But here the problem is: the scale. Billions of vectors. And I wonder if Redis should add on-disk vector sets, which I started to sketch months ago and never implemented. So my question is, the "3B" in Vicky's blog post is theoretical or is a practical need many folks have? I'm asking because at such a scale, the main problem is to generate the embeddings for your items, whatever they are.
https://gist.github.com/antirez/b3cc9af4db69b04756606ad91cab...
EDIT: I wonder if it is possible to use in memory vector sets to index discrete on disk dense blobs of nearby vectors to query with an approach like the one described in the post. It's like a H-HNSW, and resembles to certain on-disk approaches for vector similarity indeed.
> at such a scale, the main problem is to generate the embeddings for your items
Generation is often decoupled from querying, though. Consider LLMs, where training is a very expensive, slow, hardware intensive process, whereas inference is much faster and much less intensive.
But the performance of inference is in many ways more important than the performance of training, because inference is what users interact with directly.
I’m confused: I thought the original question was about doing a search for nearest neighbours. You don’t need to store the results of each comparison, just track the closest found so far?
This post is amusing to me because after solving the problem in ~2 seconds the author boils the ocean to get that down further, then finally ends with questioning what the problem statement even is?
Classic software engineer pitfall. First gather the requirements!
Second, if their initial interpretation was correct, and it's a one-shot operation, then the initial solution solves it. Done! Why go any further?
I get that it's fun to muse over solutions to these types of problems but the absurdity of it all made me laugh. Jeff's answer was the best, because it describes a solution which makes the assumptions crystal clear while outlining a straightforward implementation. If you wanted something else, it's obvious you need to clarify.
They don't actually solve the problem in 2 seconds - at that point, they are running on a sample of only 3,000 vectors! Then they get it down further, but still find it will take a loooooong time to get through all 3B:
"With these small improvements, we’ve already sped up inference to ~13 seconds for 3 million vectors, which means for 3 billion, it would take 1000x longer, or ~3216 minutes." ...which is about two days.
Vicky’s writeup is interesting, but to me the most interesting thing is Jeff Dean’s advice that sometimes doing a linear scan is the fastest approach (over any kind of indexing). This is basic advice, but modern developers might be pre-disposed to use index-based tools or data stores because the tech is now so good and ubiquitous.
Modern developers are predisposed to reach for off the shelf solutions, full stop. They're afraid of, or perhaps allergic to, just reading and writing files.
If you can learn to get past this you can unlock a whole universe of problem solving.
Or just use a vector store like LanceDB or Turbopuffer and be done with it.
Using something like duckdb could help in scenarios like this one?
There is no reason to do it on a CPU instead of a GPU. Anyway, a vector database would be a better solution, although not without downsides.