Pg_bm25: Elastic-Quality Full Text Search Inside Postgres

docs.paradedb.com

206 points by billwashere 2 years ago · 71 comments

Reader

I checked the benchmarks and was surprised to see that native search is (a) so slow (seconds), and (b) demonstrating O(N) behavior – with indexing, it should not happen at all.

Indeed, looking at the benchmark source code (thanks for providing it!), it completely lacks index for the native case, leading to a false statement the that native full-text search indexes Postgres provides (usually GIN indexes on tsvector columns) are slow.

https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e... – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);

This mistake could be avoided if bencharks included query plans collected with EXPLAIN (ANALYZE, BUFFERS). It would quickly become clear that for the "native" case, we're dealing with SeqScan, not IndexScan.

GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.

Another point, fuzzy search also exists, via pg_trgm. Of course, dealing with these things require understanding, tuning, and usually a "lego game" to be played – building products out of the existing (or new) "bricks" totally makes sense to me.

philippemnoel 2 years ago

One of the ParadeDB authors here, hey! Thanks for pointing this out, you're completely right. That's an oversight on our end. We'll update the benchmarks and re-run them to correct this :)
- gvkhna 2 years ago
  
  Great to hear, a benchmark against trigram searching with gin index would also be great. There are multiple ways to do full text search with postgres and they’re all insanely fast and memory efficient. Benchmarking various methods for comparison would be helpful.
  https://www.crunchydata.com/blog/postgres-full-text-search-a...
  - philippemnoel 2 years ago
    
    Thanks for sharing, will look to add a benchmark for that as well
- some_developer 2 years ago
  
  I learned the hard way that Gin updates are too slow, and in my case it was not even 100 updates per seconds on average, but could peak to 1000.
  How does Pg_bm25 compare here with maintaining the index & performance?
  - unblough 2 years ago
    
    If I am understanding your experience correctly the colloquial wisdom here is to use GIN on static data and GIST on dynamic data.
    > In choosing which index type to use, GiST or GIN, consider these performance differences:
    > GIN index lookups are about three times faster than GiST
    > GIN indexes take about three times longer to build than GiST
    > GIN indexes are moderately slower to update than GiST indexes, but about 10 times slower if fast-update support was disabled (see Section 54.3.1 for details)
    > GIN indexes are two-to-three times larger than GiST indexes
    > As a rule of thumb, GIN indexes are best for static data because lookups are faster. For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.
    https://www.postgresql.org/docs/9.1/textsearch-indexes.html
arecurrence 2 years ago

This sort of thing is more common with postgres than you'd think. I interviewed a candidate once whose company completely replaced querying in their postgres with elasticsearch because they could not figure out how to speed up certain text search queries. Nothing they tried would use the index.
- westurner 2 years ago
  
  "Preferred Index Types for Text Search" https://www.postgresql.org/docs/current/textsearch-indexes.h... :
  > There are two kinds of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.
gvkhna 2 years ago

I had same thought as soon as I read the article, with a gin index the benchmarks would be wildly different and not sure why they didn’t compare against that. Of course a non indexed search is going to be slow.
I was looking for comparison against a gin index specifically, without it pros/cons unclear.
dekimir 2 years ago

I still can't figure out how pg_trgm is supposed to work for multi-term searches and how to ensure the dictionary table it needs stays up-to-date. Is there a good writeup somewhere?

retakeming 2 years ago

Blog post author and one of the pg_bm25 contributors here. Super excited to see the interest in pg_bm25!

pg_bm25 is our first step in building an Elasticsearch alternative on Postgres. We built it as a result of working on hybrid search in Postgres and becoming frustrated with Postgres' sparse feature set when it comes to full text search.

To address a few of the discussion points, today pg_bm25 can be installed on self-hosted Postgres instances. Managed Postgres providers like RDS are pretty restrictive when it comes to the Postgres extension ecosystem, which is why we're currently working on a managed Postgres database called ParadeDB which comes with pg_bm25 preinstalled. It'll be available in private beta next week and there's a waitlist on our website (https://www.paradedb.com/).

ralusek 2 years ago

For what it's worth, the single biggest selling point to a better search, for me, would be not having to deal with additional infrastructure and all the hassle that comes with keeping data in sync. I would be very reluctant to move off of RDS/Aurora, and therefore have my principal motivation to use something like this is greatly negated.
I understand that it becomes very hard to monetize if you're not able to offer your own hosted service, and I don't have a solution for that, but not supporting RDS is going to really diminish the product for many people.
- philippemnoel 2 years ago
  
  Our goal is for one day ParadeDB to be a viable alternative to AWS RDS/Aurora, so that like you say, you don't need to keep data in-sync and can just use one system (ParadeDB). Soon it will be possible for you to have ParadeDB running on your AWS (utilizing your cloud credits+all security/privacy guarantees) but be managed via the ParadeDB dashboard, similar to how Aurora works from a developer UX.
  Of course if you are 100% attached to AWS RDS itself (rather than the convenience of AWS RDS, which is replicable by ParadeDB), then there's not much we can do here, as we also need to eat :')
  - olivermuty 2 years ago
    
    Will you be providing this for bring-your-own-compute in general? There is a gaping hole in the market for this. All the big vendors that provide postgres as a service require you to be on very specific types of hosting like aws fargate, google gke etc (looking at you Crunchydata).
    We are using Scaleway (french cloud) which is heaven when it comes to GDPR and Schrems compliance, but once we grow out of their managed db offerings or if we want something their managed db offering does not provide we are out of luck.
    Been looking for a year more or less now and I am simply unable to find something that doesnt amount to us just paying a fraction of a consulting FTE to be our lightweight DBA. There are only so many ways you can set up postgres HA, it is amazing that no one has made a product out of doing it for someone else yet.
    
    philippemnoel 2 years ago
    
    Hey! Absolutely, we would love to offer as many cloud providers as possible for our compute backend. We're starting with AWS, and will be adding other clouds based on demand. I've added Scaleway to our list, and if you'd like to help us bring ParadeDB to Scaleway we would love to work together to make it happen faster.
    In the meantime, you can self-host ParadeDB on Scaleway directly by running the Docker container. Hope this helps!
    
    ddorian43 2 years ago
    
    Only one i know is elest.io
- wdb 2 years ago
  
  Yes, I have a similar feeling towards Cloud SQL for Postgres. Would be great if Azure/GCP would be supported in some manner
- paulddraper 2 years ago
  
  What are the features of RDS/Aurora that you need?
  Also, it would be possible to set up a logical PG replica.
  - ralusek 2 years ago
    
    Being in my VPC, having the support and track record of AWS, scaling to 128TB without me having to think about it, easy snapshots/backups.
raybb 2 years ago

Could this also work as an alternative to Apache Solr? If so might be worth while to market it that way a bit.
I don't really know much about Solr but just started using it while helping with a project for openlibrary.org and it seems pretty alright but I'm still not totally sure I understand what makes it popular.
- rdhatt 2 years ago
  
  Solr and Elasticsearch are both Java servers built on top of the Java search library Lucene. There are plenty of articles on the internet describing how they differ. However since they share the same core, so they are very similar as well. For the context of this discussion, you can consider Solr & Elasticsearch as interchangeable - a potayto, potahto situation.

phamilton 2 years ago

With an AGPL license, does that make it unlikely to be included in hosted environments like RDS?

My understanding of the spirit of the license is that it should be fine as long as modifications are made available. Anyone know of any existing extensions in RDS that are AGPL?

allan_s 2 years ago

Related question, could it be possible that at some point postgresql natively implements that algorithm ? Or as there is already an extension doing it , regardless of the licence , it is unlikely that patches in that direction will be accepted ?
- j45 2 years ago
  
  Running it for your own purposes as part of a solution that includes search should be fine under AGPL.
  If your product is elastic search built into Postgres as a repackaged and direct competitor to this search plug-in, that’s where my understanding is over the line.
  - allan_s 2 years ago
    
    yes I understand I can do that, and I also understand why the authors chose to do that, I would have done the same.
    My point of view is more from a small saas company perspective (i.e 100% pragmatic):
    1. I want as less vendor as possible, especially on something as mission critical as my database 2. I already use AWS RDS and it comes with a LOT of nice things (managed, multi-az, easy backup/restore story, etc.)
    In that situation:
    1. hosting myself is not an option because I will loose all the niceties that I will have to reimplement 2. buying from a 3rd party is not an option either because: 1. What if they go bankrupt ? 2. We are ISO 27001 and they may be not ISO 27001 themselves or forever. 3. If I choose a vendor because it's "postgres + feature A" then if there's an other vendor selling "postgres + feature B" (timescaledb etc.) what do I do ?
    That's why I was more interested in knowing if that specific could one day be implemented in postgres directly (as there's already tsvector).
    Once again I'm 100% behind them to have chosen a restrictive license if they plan on selling it, but in that case their interested and mine are not aligned, and that's fine.
    
    j45 2 years ago
    
    That's a really fair use case acknlowedging personal preference to interpret how you like it.
    I find some of the built in services on clouds are just open source libraries that are packaged up to increase tie in to that platform.
    I like cloud, but cloud agnostically, and hybrid/private clouds in the mix with that seem like a good skill to at least be able to consider thinking through.
  - philippemnoel 2 years ago
    
    ParadeDB author here -- correct! We plan to offer a hosted version soon and the idea behind picking AGPL is to be as permissive as possible so that people can use the product for free, but also protect ourselves from abuse in case a large company, say AWS, were to want to ship it in their own environment.
    In fact, we went through much questioning wondering to go with ELv2, Apache, AGPL, etc. before settling on AGPL
    
    j45 2 years ago
    
    Appreciate the response! This would be a great blog post btw.
adobrawy 2 years ago

See who made pg_bm25 - vendor of database based on PostgreSQL. Most likely they would like offer that as hosted solution itself, so they attempt avoid Elasticsearch / Terraform-like drama using AGPL license from beginning.
klysm 2 years ago

I forget, does AWS let you use custom extensions from pgrx?
- adobrawy 2 years ago
  
  No, they allow use Rust for custom functions (alternatively to PL/SQL) only.

hardwaresofton 2 years ago

pgrx is one of the greatest enabling innovations in the PG ecosystem in a long time.

Awesome to see so many high quality extensions come out of it.

https://github.com/pgcentralfoundation/pgrx

philippemnoel 2 years ago

pgrx is awesome and making pg_bm25 would've been infinitely more challenging without it. Check them out if you want to make a Postgres extension, we can't recommend them enough
zombodb 2 years ago

Thank you. I’ll pass this on to the team.

wkoszek 2 years ago

Hey guys. Congratulations - this is an exciting development. Can you show some benchmarks around showing the count of matches -- `select count() from table where text match is there`?

This was the top reason that made us (Segmed.ai) give up on PostgreSQL FTS -- our folks require a very exact count of matches for medical conditions that are present in 20M reports. And doing COUNT() in PostgreSQL was crazy, crazy slow. If your extension could do simple len(invertedindex[word]) that would already be a great improvement.

ELK has it immediately, but at a cost of being one more thing to maintain, and the whole Logstash thing is clunky. I'd love to use FTS inside of PostgreSQL.

benpacker 2 years ago

I’m not sure if Postgres could support that type of operation directly via count() since I don’t know if the fact that no other filters are present is available to the Index Access Method API.
It might be possible to do a separate function though, like:
select pg_bm25_direct_count(‘term’)*
- dekimir 2 years ago
  
  If you do that, I can update postgres-searchbox [1] to use it for better frontend experience.
  [1] https://www.npmjs.com/package/postgres-searchbox
- wkoszek 2 years ago
  
  That would be fine--basically any way of achieving it would be fine. As of now, in PostgreSQL's FTS, I don't think there's any way to do this fast enough to give it back to the user.
retakeming 2 years ago

Thanks!
We released support for metrics aggregations a few days ago, including count: https://docs.paradedb.com/aggregations/metrics#count.
We haven't gotten around to benchmarking aggregations - that's the focus for next week and we'll publish them once they're done. I would suspect that it's a lot faster than Postgres aggregates since it leverages Tantivy Columnar.
- francoismassot 2 years ago
  
  Nice! I would be very interested by your benchmark, don't hesitate to jump in the quickwit discord server to talk about the results. https://discord.quickwit.io/

machty 2 years ago

What kind of "consistency" do bm25 indexes offer? e.g. I think ElasticSearch is eventually consistent and is constantly indexing in the background and classic Postgres GIN indexes have configuration like `gin_pending_list_limit` and `fastupdate` functionality to avoid slowdowns on insertions (and then you get slowdowns when an insert hits the threshold and triggers the catch-up indexing).

philippemnoel 2 years ago

ParadeDB and pg_bm25 offer weak consistency. pg_bm25 doesn't slow down transactions for indexing, and like ElasticSearch it becomes become eventually consistent shortly after (typically at most a few seconds, altough your mileage may vary based on the amount of data modified in the transaction(s)).

mkleczek 2 years ago

This is really exciting and I hope to try it out at my company ASAP.

iamdanieljohns 2 years ago

Seems really really cool. Is this a full DB, as in they have to take PG source, put in tantivy and their sauce, compile, and distribute? Or is this an extension? If it's the latter, what's the point of putting DB at the end of the name?

iamdanieljohns 2 years ago

Ok, all caught up now. Great work and best of luck!
When it comes to the business model: it seems an acqui-hire by Supabase/Neon/etc would be the best bet. It insures the team's focus is on the core product instead of the litany of things to figure out when creating a pg hosting service (payments, downtime, upgrades, customer support, ...) in this highly competitive and demanding market.

ckok 2 years ago

Does this also cover some kind of facetted search? (Counting the different colored and sized t-shirt) in an efficient way? As that is also a large part that elastic can do but PostgreSQL isn't very good at.

antman 2 years ago

An important step, could be a good combination with pg_vector if they are fast enough

pritambaral 2 years ago

I believe the parent project — paradedb — already does that, for their support of HNSW indexes.
- philippemnoel 2 years ago
  
  That's right, we do support pgvector (it is pre-installed on ParadeDB) and support full HNSW. In fact, we even have another extension, called pg_search, which is the combination of searching on pgvector and pg_bm25 for better results! Topic of another blog post to come sometime soon :)

tristan957 2 years ago

Interesting that you guys are the same people behind Whist. I once interviewed there at your behest, and never heard back. It seems like that venture fizzled out?

rawsh 2 years ago

Is it possible to use this for hybrid search in combination with pg_embedding? My understanding is that hybrid search currently requires syncing with Postgres

philippemnoel 2 years ago

Yes! We have another extension, pg_search, which is specifically for hybrid search using pg_bm25+pgvector. You can find it here: https://github.com/paradedb/paradedb/tree/dev/pg_search

anon373839 2 years ago

This is very exciting. BM25 in Postgres will enable really nice search experiences to be built in projects where Elasticsearch is just too much complexity.

est 2 years ago

looks like a cool project https://github.com/paradedb/paradedb

aiunboxed 2 years ago

I wonder how do legacy search players like elastic / solr compete against the new age startups combining semantic and regular search ?

binarymax 2 years ago

Lots of reasons:
1) switching search engines is hard when you’ve built your information needs around one. I’ve led lots of search engine migrations and they’re not fun. I even gave a talk on the problems companies face when doing so. https://haystackconf.com/us2020/search-migration-circus/
2) lots of the new search startups don’t offer full feature coverage. So just because a company is the new hotness it doesn’t mean it can fill the need of someone entrenched in Solr/elastic
3) why risk going to a startup when they haven’t proven they’ll be around in 3 to 5 years?
4) incumbent search engines eventually catch up at the speed of the enterprise market. Why spend a year migrating when the engine your using will implement the feature for you within that timeframe?
ntonozzi 2 years ago

By adding the features that those new age startups launch: https://www.elastic.co/guide/en/elasticsearch/reference/curr...
Building a classic text search engine is way harder than building a KNN engine, and bolting a KNN engine into a term search engine is easier than the other way around.
vb-8448 2 years ago

Reading "legacy" near "elastic" make me feel a little bit old :D :D
BTW, if you are one of the leaders of the market, you don't need to continuously improve, just wait and let your competitors do the research job and implement only when the feature is mature.
- aiunboxed 2 years ago
  
  :D :D
  Sorry my question was on the basis of the quality of the results, simply put .. how does players who have good semantic search turn out against "legacy" players who had good text search
jillesvangurp 2 years ago

They are part of the hype. Lucene has vector search capabilities. Elasticsearch and Opensearch have support for that (slightly different implementations). I assume solr has similar capabilities. The combination of traditional search and vector search makes a lot of sense from a cost control point of view. Vector search at scale is expensive. The smaller the result set, the cheaper it is to do vector search over it. So using a cheap traditional search to limit the results before you run vector search makes a lot of sense.
Also, bm25 holds up well against vector search. A well tuned model can outperform it but many off the shelf models struggle to do that. Vector search is a useful tool but so far it's not a one size fits all solution that "just works". It's something that can work really well if you know what you are doing and with a lot of tuning. With things like Elasticsearch you can try both approaches.
philippemnoel 2 years ago

pg_bm25/ParadeDB author here. What we're doing is building an opinionated alternative within PostgreSQL. If you are not using Postgres, or want your system to be separate, Elastic is still the best choice and will likely remain so.
Other people have brought up great points for why or why not to switch. Our vision for this is that ParadeDB is not merely "better" than Elastic, but rather different. Elastic will never be a PostgreSQL database, and we'll never be a NoSQL search engine. If you want one or the other, you'll pick either ParadeDB or Elastic.
kriz9 2 years ago

Who is the competition besides Algolia? Last I checked most of the competition is either very expensive or very feature limited compared to Elastic/Solr.
- pg_bot 2 years ago
  
  Meilisearch seems like it is the best open source option.
  https://www.meilisearch.com/
- aiunboxed 2 years ago
  
  I think pretty much all the companies who provide vector search are indirect competitors

canadiantim 2 years ago

ParadeDB and the work they’re doing with this extension is incredibly exciting. Love to see it.

eclectic29 2 years ago

Is BM25 still used by "modern" search engines? I wasn't aware.

mugivarra69 2 years ago

is this better than lucene

benpacker 2 years ago

The underlying engine, Tantivy, has better performance characteristics than Lucene.
You can compare Lucene to Tantivy and can compare Elasticsearch to pg_bm25 or ParadeDB
- fiedzia 2 years ago
  
  It's faster, but misses tons of features, starting with geosearch. Hopefully they will come with wider use.
  - PSeitz 2 years ago
    
    The issue for geo search is here: https://github.com/quickwit-oss/tantivy/issues/44

stopman 2 years ago

Excited to give this a try.

Settings

Pg_bm25: Elastic-Quality Full Text Search Inside Postgres

Keyboard Shortcuts