40 Million Documents. One CPU. No GPU.
How binary search plus int8 rescoring quietly rewrites the rules of large-scale semantic search.
There’s a very specific kind of frustration you hit when you work with embeddings long enough.
You finally get retrieval working. It’s accurate. It feels smart. You’re proud of it.
And then you look at the bill.
Or worse, the RAM usage.
Or worse than that — the moment you realize your “simple” semantic search needs 180GB of memory just to exist.
That’s usually where the quiet assumption sneaks in:
“Well… I guess this is just the cost of doing vector search properly.”
But maybe it isn’t.
Because here’s the thing that stopped me mid-scroll:
You can search 40 million texts in ~200ms
on CPU only
with 8GB RAM
and 45GB disk
No GPU.
No exotic hardware.
No magic.
Just a very deliberate inference strategy.
And once you see it, you can’t unsee it.
Press enter or click to view image in full size