70M vectors searched in 48ms on a single consumer GPU –results you won't believe
I built a prototype GPU-based vector search system that runs locally on a consumer PC.
Hardware:
RTX 3090 consumer CPU NVMe SSD
Dataset:
~70 million vectors (384 dimensions)
Performance:
~48 ms search latency for top-k results.
This corresponds to roughly ~1.45 billion vector comparisons per second on a single GPU.
The system uses a custom GPU kernel and a two-stage search pipeline (binary filtering + floating-point reranking).
My goal was to explore whether large-scale vector search could run efficiently on consumer hardware instead of large datacenter clusters.
After thousands of hours of work and many failed attempts the results finally became stable enough to benchmark.
I'm currently exploring how far this approach can scale.
I'm currently exploring how far this approach can scale.
I'd be very interested to hear how others approach large-scale vector search on consumer hardware.
Happy to answer questions. Quick update: I've been iterating on the approach and managed to push the coarse search further. Currently seeing ~100M vectors scanned in ~10ms on a single RTX 3090 (binary stage only). Still experimenting with trade-offs between speed and recall, but it's interesting how far this can go on consumer hardware. Curious what kind of numbers others are seeing for large-scale vector search on GPUs. Is it available somewhere? Not yet — it's still a personal prototype and I'm actively experimenting with different approaches and optimizations. I’m trying to better understand the limits of what’s possible on consumer hardware before deciding how to package or share it. Happy to share more high-level insights though. One thing I'm trying to better understand is where the real limits are. At this point it feels like the bottleneck is less about raw compute and more about how efficiently data is represented and accessed on the GPU. Curious if others have seen similar behavior when pushing large-scale vector search on consumer hardware.