Dense Vector and Sparse Vector and Fulltext and Tensor Reranker = Best for RAG?

7 points by vissidarte_choi a year ago · 12 comments

Reader

Many vector database vendors claim sparse vector is enough for precise retrieval, bm25 is not necessary.

yingfeng a year ago

Hi, I'm one of the creators of infinity, and the article has mentioned about the sparse vector vs bm25. While the sparse vector performs well under some evaluations, it is obtained by training a model, which means that it can't fully represent all of the user's keywords/tokens, and those that don't appear in the training set, are truncated. So this is a very big impact for many enterprise vertical scenarios. And bm25 doesn't have such a limitation
- philippemnoel a year ago
  
  BM25 is indeed way more important than these vector DBs will claim. At ParadeDB, we've observed significant use cases where customers need both

What are the advantages and potential challenges of combining dense vectors, sparse vectors, and full-text search in a hybrid retrieval method, as implemented in Infinity v0.2, and how does this approach compare to traditional vector search or other retrieval methods?

yingfeng a year ago

A noticeble work to demostrate the effectiveness of hybrid search is blended rag by IBM research (https://arxiv.org/abs/2404.07220), which has shown that 3-way hybrid search could achieve STOA over multiple evaluation datasets. And also, we've reproduced the results of blended rag, as shown in this article. Additionally, blended rag + colbert based reranker could have a much better results.
The major challenges are how to implement and manage such many indices within single database. That's why we build this database start from scratch. Infinity is actually a kind of "indexing" database, based on a columnar store. The executor also requires refined design to fuse these hybrid search approaches effectively.

small-turtle a year ago

How do you compare with other vector databases? some of them have already implemented both dense vector and sparse vector search.

yingfeng a year ago

There are some vector databases that already include both dense vector search and sparse vector search, such as qdrant. A hybrid search of these 2 does not solve many problems well, such as exact queries. Moreover, according to our experiments, as seen in the article, the performance of dense vector + sparse vector, improves only a little bit. In addition to these 2 way recall, infinity offers bm25 as well as colbert reranker, which can make the ranking quality of the hybrid search much better.

fm100 a year ago

Why do you guys implement tensor data type instead of integrating colbert directly?

yingfeng a year ago

Because colbert is not an end-to-end solution. As seen for RAGatouille, it has integrated colbertv2 into this repo. However, it's not a database, we implement tensor within infinity, aim to make an end-to-end solution for late interaction based ranking models.

newpeak a year ago

What's your advantages over paradedb? it also has dense+sparse+bm25

yingfeng a year ago

paradedb could also deliver three-way hybrid search through pg_vector, pg_sparse and pg_search. Compared with paradedb, infinity has following advantages:
1. Performance
The performance of pg_vector is far slower than vector search of Infinity due to the vector index design. The performance of pg_sparse is also slower than sparse vector search of infinity. The performance of pg_search is much slower than full text search of infinity. pg_search is based on Tantivy, which is much slower than the inverted index of infinity.
Detailed benchmark could be seen in this article : https://infiniflow.org/blog/fastest-hybrid-search or github repo.
2. Infinity has all the builtin implementation of the above three search approaches. These indices could work smoothly together with the executor of infinity. The users could use any combination of the search approaches, together with the fused ranking algorithms, in a very efficient approach.
3. Infinity has also builtin support for tensor, which makes it possible to deliver an in-database colbert reranker compared with the cross encoder based reranker outside. The colbert reranker could bring much benefits for search qualities.
4. Infinity is much easier to use, it could be deployed as either a standalone server, or as an embedded python library just through pip install.
5. Infinity is designed start from scratch, it does not have the burden of postgresql, and is evolving fast. It will run on cloud in very near future which could save the cost a lot.
- philippemnoel a year ago
  
  Hey folks, ParadeDB co-founder here. Cool project! Just thought I'd chime in and clarify a few things:
  1. pg_sparse is deprecated. pgvector released native sparse vector support with the `sparsevec` datatype, and ParadeDB no longer maintains pg_sparse. It has been this way for several months already.
  I'd love to see a benchmark re: Tantivy. You claim that pg_search is much slower, but Tantivy is state-of-the-art for full-text search performance and the ParadeDB performance is robust. You can see our benchmarks in our repository README, where we compare ourselves to Elastic.
  4/5. ParadeDB is Postgres by design. If you are adopting Postgres, which many are, then ParadeDB can be installed directly as an extension via logical replication on a read replica. This removes the need for ETL to a non-Postgres system, which drastically reduces operational burden.
  Of course, if you're not using Postgres, ParadeDB is not designed for you and a tool like Infinity seems like a viable option alongside other standalone search engines.

Settings

Dense Vector and Sparse Vector and Fulltext and Tensor Reranker = Best for RAG?

Keyboard Shortcuts