Ask HN: Vector-space database (as a service)?

5 points by kleebeesh 9 years ago · 5 comments · 2 min read

Methods like collaborative filtering via matrix factorization, Word2vec, Doc2vec, etc.. map large, sparse matrices into a low-dimensional vector space while enforcing similarity constraints. There are extensions for vectorizing various modalities (users, items, documents, audio, images, etc) into one vector-space for similarity search and recommendation ([1], [2], [3]). There is extensive research on approximate nearest-neighbor searches ([4]).

For example: it's possible to map users, songs, and artists into a common vector space ([1]). Two users who listen to similar songs have high similarity. Songs are recommended based on vector similarity to users. This pattern extends to many domains as long as there is a way to enforce similarity (likes, co-occurrences, etc.) to "train" the vectors.

In my experience, training the vectors is simpler than the engineering to efficiently query them (e.g. "select the 10 nearest neighbors to vector with ID 123"). This becomes expensive for large datasets, and correctly using the approximate nearest neighbor libraries is non-trivial.

I can't find any database to insert vectors as they're computed and then run queries against them. It seems often companies build a custom API on top of one of the approximate nearest neighbors libraries. Though the interesting queries seem pretty homogeneous.

Any ideas as to why none of the big DB players have an offering for this use-case? Like Algolia, but for vectors instead of text? Any recommendations for such a product?

[1] IHeartRadio queries various modalities of data from the same vector space: https://youtu.be/jjO1gOH-BW4?t=5m39s [2] Using a convnet to map new (cold-start) songs into an existing vector space: http://benanne.github.io/2014/08/05/spotify-cnns.html [3] Flickr similarity search: http://code.flickr.net/2017/03/07/introducing-similarity-search-at-flickr/ [4] Benchmarks for approximate nearest neighbor libs: https://github.com/erikbern/ann-benchmarks

PaulHoule 9 years ago

Hyperdimensional nearest-neighbor search is a tough problem; there are index algorithms such as ball trees that work, but they don't deliver the big wins that b-trees give in 1-d space, quadtrees in 2-d space, etc.

In many "as a service" offerings computational costs are not a big deal. For this one it would be, thus making the pricing work right for everybody would be a toughie.

billconan 9 years ago

I thought about word2vec as a service. I gave up because I think customers could easily cache (pirate) my data.

kleebeeshOP 9 years ago

That's not really a problem if the user is frequently ingesting and vectorizing new data. They need a place to store it and efficiently query it. They can cache the queries for an old vector, but still need to compute new queries every time they have a new piece of content. Also old vectors might have relationships to the new vectors that have been inserted since caching, so you don't want to serve stale results.
- billconan 9 years ago
  
  if we forget about making the service for a moment, How would you store and process high dimensional vectors locally? What ready-made library/software to use? What datastructure?
  - kleebeeshOP 9 years ago
    
    Some storage options are: - Store many vectors in a single HDF5 or LMDB file. - Store single vectors in many small binary files (e.g. using numpy save() function).
    To look up neighbors you might: - Compute neighbors exhaustively (e.g. using Scipy distance functions). - Use an Approximate Nearest Neighbors approach like one of the ones benchmarked here (https://github.com/erikbern/ann-benchmarks). These can be much faster than exhaustively computing neighbors, at the cost of some accuracy and having to periodically re-build an index.

Settings

Ask HN: Vector-space database (as a service)?

Keyboard Shortcuts