Search 40M documents in under 200ms on a CPU using binary embeddings and int8 rescoring.

1 min read Original article ↗

Civil Learning

40 Million Documents. One CPU. No GPU.

How binary search plus int8 rescoring quietly rewrites the rules of large-scale semantic search.

There’s a very specific kind of frustration you hit when you work with embeddings long enough.

You finally get retrieval working. It’s accurate. It feels smart. You’re proud of it.
And then you look at the bill.

Or worse, the RAM usage.

Or worse than that — the moment you realize your “simple” semantic search needs 180GB of memory just to exist.

That’s usually where the quiet assumption sneaks in:
“Well… I guess this is just the cost of doing vector search properly.”

But maybe it isn’t.

Because here’s the thing that stopped me mid-scroll:

You can search 40 million texts in ~200ms
on
CPU only
with
8GB RAM
and
45GB disk

No GPU.
No exotic hardware.
No magic.

Just a very deliberate inference strategy.

And once you see it, you can’t unsee it.

Press enter or click to view image in full size