70M vectors searched in 48ms on a single consumer GPU –results you won't believe

1 points by EffCompute 4 months ago · 4 comments · 1 min read

I built a prototype GPU-based vector search system that runs locally on a consumer PC.

Hardware:

RTX 3090 consumer CPU NVMe SSD

Dataset:

~70 million vectors (384 dimensions)

Performance:

~48 ms search latency for top-k results.

This corresponds to roughly ~1.45 billion vector comparisons per second on a single GPU.

The system uses a custom GPU kernel and a two-stage search pipeline (binary filtering + floating-point reranking).

My goal was to explore whether large-scale vector search could run efficiently on consumer hardware instead of large datacenter clusters.

After thousands of hours of work and many failed attempts the results finally became stable enough to benchmark.

I'm currently exploring how far this approach can scale.

I'd be very interested to hear how others approach large-scale vector search on consumer hardware.

Happy to answer questions.

EffComputeOP 4 months ago

Quick update:

I've been iterating on the approach and managed to push the coarse search further.

Currently seeing ~100M vectors scanned in ~10ms on a single RTX 3090 (binary stage only).

Still experimenting with trade-offs between speed and recall, but it's interesting how far this can go on consumer hardware.

Curious what kind of numbers others are seeing for large-scale vector search on GPUs.

EffComputeOP 4 months ago

One thing I'm trying to better understand is where the real limits are.

At this point it feels like the bottleneck is less about raw compute and more about how efficiently data is represented and accessed on the GPU.

Curious if others have seen similar behavior when pushing large-scale vector search on consumer hardware.

Settings