OpenData Vector: MIT-Licensed Vector Search on Object Storage

44 points by apurvamehta 21 hours ago · 6 comments

Reader

Very interesting, thanks for sharing. This has a lot of nods to Turbopuffer's architecture [0]. My impression is they've spent a lot of time optimizing at the hardware/firmware layer to achieve extremely fast query results.

Inarticulately - how ~close is OpenData Vector to Turbopuffer in terms of performance today and where are the major gaps + mountains to scale?

Really excited to keep an eye on the repos, great read!

[0]https://turbopuffer.com/blog/turbopuffer

rohanpdes 18 hours ago

Yep! Vector provides a lot of the same benefits, just as an OSS project. They were definitely a major inspiration. Vector's performance is similar to their published benchmarks. The biggest gap is (unsurprisingly) for larger (e.g. 100s of M - 1B+) datasets. We talk about it in the post, but the main improvement there is adding quantization to reduce the overhead of loading large posting lists. There's also a bunch of storage and caching layer work to be done. That's on our roadmap along with some cool features like full-text search and better support for multi-tenancy.
apurvamehtaOP 18 hours ago

Thanks! opendata contributor here.
We're heavily inspired by Turbopuffer. I'd say we are comparable to them when they launched in terms of perf and scale. But they've obviously invested heavily since then, so we're not going to match them on raw perf at scale right now. Our goal is to be a pretty competitive OSS offering over the long term though.
The next biggest lift for us to get much closer is quantization. If we squeeze more signal into fewer bits, we will improve performance end to end.

Reubend 10 hours ago

Stupid question: I was under the impression that object storage was super expensive compared to "normal" SSDs if the QPS numbers got high.

Is that not the case for DBs based on object storage because they cache data before sending it to the object storage? Or because they do some other processing on the DB server before it hits storage?

rohanpdes 3 hours ago

Not stupid at all. API cost (especially for writes) is indeed one of the main challenges of building for object storage. DBs mitigate this by aggressively caching reads and batching writes. There's a fundamental latency/cost tradeoff here - you accept higher latency to get enough batching to amortize PUT costs. This is a very reasonable tradeoff for search systems which typically aren't as latency sensitive.

Settings

OpenData Vector: MIT-Licensed Vector Search on Object Storage

Keyboard Shortcuts