ANN Vector Search with SQL-Powered LSH and Random Projections
clickhouse.comIs anyone successfully running LSH in production? The technique seemed promising ten years ago but I'm not sure if reality matched that...
Hey author, here, so yes its not the most recent technique, but when coupled with rescoring like we propose it can be a simple speed up for linear scans. It also a pure SQL solution and requires no indices. It benefits from being easy to update and not being mem bound. We recognise all the issues with it and pros in the final section. We're defn not claiming this is ground breaking - more a useful technique which is easy to replicate in SQL.
Pinecone https://www.pinecone.io/learn/series/faiss/locality-sensitiv... seem to promote as a viable approach.
We defn don't consider this to be the final solution for ANN and hence are investing in other graph based techniques - https://clickhouse.com/docs/en/engines/table-engines/mergetr...
Understood - thanks!
The article does not use LSH as it is typically used.
Typically, LSH is for low dimensions - you split the space to cells by the hash value, use the hash value as an index, and look up one to a few values for every key (so you have some overlap).
In contrast, in the article, we use the value not as a hash but to construct a metric, approximating the original metric space. Then, we do a full scan with filtering by the approximate distance. It speeds up the search because a low amount of data is scanned, but it is not like using the hash for direct lookups.
You can name it differently, e.g., "quantization into bit-vectors".
Ah, I get it now. Thank you!