Settings

Theme

Show HN: Retake – Open-Source Hybrid Search for Postgres

github.com

88 points by philippemnoel 2 years ago · 23 comments · 2 min read

Reader

Hey HN! We're Phil and Ming, co-founders of Retake (https://github.com/getretake/retake). Retake is an open source tool that adds keyword and semantic (i.e hybrid) search to databases. We’ve started by extending the capabilities of Postgres with an SDK for lightning-fast queries.

We built Retake to fix two issues: keeping vectors in sync with Postgres in real time is difficult, and most vector databases aren’t built for hybrid search.

A quick refresher: “keyword search” refers to a technique where results are scored based on the appearance of exact words or terms. “Semantic search” uses vector embeddings to understand the meaning behind those words. Hybrid search combines these two approaches to enhance the precision and relevance of results.

To implement semantic or hybrid search today, most organizations run batch jobs that update their search engine or vector database using ETL tools or custom data pipelines. We’ve seen from firsthand experience how time-consuming and costly this can be, as moving vectors often requires re-embedding the entire data source.

We’ve also seen how many vector databases lack crucial features of “traditional” search: keyword-based (BM25) search, faceting/aggregations, highlighting, efficient filtering, etc.

Here’s how Retake works - our core is built on top of OpenSearch, which acts as a search engine and vector database. We leverage logical-replication-based Change Data Capture (CDC) to stay in sync with Postgres, so documents and vectors are updated incrementally and in real time. Finally, Python and Typescript SDKs make it easy to integrate Retake into your application. There’s no need to manage separate vector databases and search engines, upload and embed documents, or run expensive reindexing jobs. All you need to think about is writing search queries.

The easiest way to get started with Retake is by running our Docker Compose stack:

  git clone https://github.com/getretake/retake.git
  cd retake/docker && docker compose up
Retake is Apache licensed and our repo is here: https://github.com/getretake/retake. For next steps, see our quick start guide: https://docs.getretake.com/quickstart

We’d love your feedback on our solution to hybrid search. Our focus right now is on nailing the basics, but we’d also love to hear what you think we should focus on next.

pk19238 2 years ago

Does the sync handle deletes? In terms of we delete data from our Postgres database and it will delete from your database as well? Can see this integrating well with our pipeline since we're syncing data from postgres to our own vector database.

jph 2 years ago

Clever idea, good work!

You asked for feedback: I see opportunities for you to nail the basics, by focusing on the value proposition so business-oriented people understand why/how to buy, and on the the middle-tier architecture so technical people understand that you're akin to OpenSearch with Faiss & vectors that auto-update.

My understanding (and please clarify as you wish) of what I've read on your site is this: you're selling the hosted version for enterprises at a price to be discussed with your sales team, and the architecture is something like this...

  ┌──────────────┐    ┌───────────────┐    ┌──────────────┐
  │Search SDK    │    │Search Engine  │    │Data Source   │
  │• Typescript  │    │• OpenSearch   │    │• Postgres    │
  │• Python      │    │• Faiss, KNN   │    │• MySQL (?)   │
  │• Java (soon) │◀──▶│• Keyword, BM25│◀──▶│• Oracle (?)  │
  │• Go (soon)   │    │• Auto-update  │    │• Mongo (?)   │
  │• Etc.        │    │• Etc.         │    │• Etc.        │
  └──────────────┘    └───────────────┘    └──────────────┘
  • pnoel 2 years ago

    That's a nice diagram! Yeah that's roughly it. We'll be adding support for more sources of truth in the future to expand coverage, like the ones you mention but also NoSQL like MongoDB

    • isaacfung 2 years ago

      So are you guys using faiss instead of the vector search of postgres?

      I think vespa also supports hybrid search(it can also use late interaction model like colbert). How is retake compared to vespa?

      Will retake supports sparse vector models like SPLADE(I heard they solve the vocab mismatch problems of keyword search).

      How do you guys implement filtering?

      • retakeming 2 years ago

        1. Correct - we don't rely on pgvector. As a result, we're compatible with more existing managed Postgres services.

        2. Probably the biggest differentiator between Vespa and Retake is the core architecture - Retake is built on top of OpenSearch. There's been quite a bit of debate regarding different search engines since Yahoo released Vespa - we leaned into OpenSearch because we saw that Open/ElasticSearch and its query language was much more familiar to more developers. Something that's coming soon to Retake is the ability to control how keyword/semantic scores are normalized and combined, which should give developers more fine-tuned control over their results.

        3. In the short term, our support for models like SPLADE is constrained by OpenSearch, which uses BM25. In the medium to long term we would definitely consider modifying OpenSearch to do stuff like this.

        4. We support both post-filtering and efficient kNN filtering, which takes place during the kNN search and guarantees that k results are returned. More details on the faiss kNN filter implementation can be found on the OpenSearch docs: https://opensearch.org/docs/latest/search-plugins/knn/filter...

  • noodlesUK 2 years ago

    Unrelated to OP, but how did you create that diagram?

Palmik 2 years ago

I would like something that can keep postgres (or other source of truth) in sync with existing search database (like Elastic, Meili, or Qdrant).

But the catch is that it's rare that there's 1:1 mapping between the source of truth and what is indexed. The simplest example would be: You have a document table, but you actually index document chunks.

Therefore I would like something that accepts a preprocessing function and keeps the search data in sync when the source changes. Ideally, it should not reinvent full-text / vector based search and plug in with existing solutions.

mdaniel 2 years ago

https://github.com/getretake/retake/pull/198 is a refreshing change given the recent rug pulls, so thank you for that

  • retakeming 2 years ago

    Thanks! We debated what the right decision was in the beginning but are glad to have settled on Apache.

seemaze 2 years ago

From the product landing page: "By connecting to your sources of truth, Retake unlocks real-time keyword and semantic search over siloed data"

I misread 'siloed data' as 'soiled data' and was like, this product gets me!

dev1l 2 years ago

Awesome project! It would be interesting to see performance tests. I know that a scientific experiment is very difficult, so approximate numbers are enough. Anyway, thanks for your work :)

benjaminsanborn 2 years ago

Thanks for sharing; this project looks very promising!

What precipitated your fork of pgsync and how do you foresee maintaining compatibility with that project?

  • retakeming 2 years ago

    Thanks, appreciate it!

    We forked pgsync for the silly reason that they hadn't published to PyPi in months, and some of their dependencies were out of date. We haven't made any modifications to pgsync so maintaining compatibility shouldn't be an issue, and we'll likely revert back to the main library once their dependencies are brought up to speed.

jaequery 2 years ago

Can’t you just use OpenSearch? What is the point of going through Postgres when you already have OpenSearch?

spleen7777 2 years ago

How it handles JSONB fields? Do I need to define all keys in JSONB field to make them indexed?

  • retakeming 2 years ago

    You don't need to define all keys in a JSON object - by default, new keys will automatically be added to the index mapping when a JSON document containing that key is added to the index.

    Details on how to query JSON objects can be found in our docs: https://docs.getretake.com/search/object

nravic 2 years ago

how often are these batch jobs run? I'm curious to know what the absolute maximum sync frequency can be.

  • retakeming 2 years ago

    We don't run any batch jobs - Retake streams changes in real time via CDC (change data capture). The only batch job you would need to run is to populate an index when it's first created.

ccleve 2 years ago

How does it differ from ZomboDB?

  • pnoel 2 years ago

    Good question -- the primary difference is the method of integration with Postgres. ZomboDB is a Postgres extension, which limits their compatibility with Postgres serivces like AWS RDS, while Retake is compatible with any service where you can enable logical replication

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection