Improving Postgres text search speed

131 points by kuzee 4 years ago · 58 comments

Reader

> A key gotcha that tripped us up: when querying with a list of primary keys, Postgres will not return the records in the same order as the IDs provided in a “where … in” clause.

That's frightening they don't know that. So, burn this into your minds: no ordering is guaranteed in SQL unless an ORDER BY is applied to it (and any ordering in a subquery is lost). Even if it seems to work, it will fail. No guarantees on order unless... scratch that onto your eyeballs so you never forget it.

Also, will people please stop posting rainbow-on-black screenshots, especially with the screenshotted text size smaller than the main text.

silvestrov 3 years ago

Sometimes I wish for a 'debug' version of PG which adds an "order by random()" to all queries (or appends a "random()" clause to an existing "order by" clause).
Or even better: have it as a connection/session parameter.
- zasdffaa 3 years ago
  > Sometimes I wish for a 'debug' version of PG which adds an "order by random()" to all queries
  interesting! I like it.
  > or appends a "random()" clause to an existing "order by" clause
  That one I don't understand. If the Order By is there, why randomise it? I guess
  order by p, q, r, random()
  then it sort of allows you to find out if it's been ordered sufficiently for your needs - is that it?
  - rubyist5eva 3 years ago
    
    Sufficient yes because if, for example, you’re simply sorting by a value that’s shared/duplicated like enums values, the ordering of the ordered groups is basically up to the query planner and can change (one query may use an index while another does a sequential scan, for example). The most common use case where I’ve seen this be significant is offset based pagination where the query plan actually changes depending on which page you’re on. You’ll either miss records or even get the a few rows back from the previous page because of the different ordering chosen by the planner/optimizer.
    
    zasdffaa 3 years ago
    
    That pagination issue isn't one I realised. That's very valuable, thanks.
jonatron 3 years ago

I know this now, but when I didn't and when Postgres returned the same order thousands of times, then changed for no apparent reason, it made fixing flaky tests very tricky.
irjustin 3 years ago

> That's frightening they don't know that
It's a common problem when moving from Mysql. We personally ran into that shortly after switching to Postgres, but seems like something they should've discovered a lot earlier for sure.
kuzeeOP 3 years ago

We shared the mistake so others don't accidentally make the same assumption; I'm sure a few people either learned it for the first time or appreciated the reminder. Rest easy this never made it to prod.
- zasdffaa 3 years ago
  
  Thanks. I have a lot of time for those who are willing to admit their mistakes.
icedchai 3 years ago

This does seem fundamental.

jrochkind1 4 years ago

How can querying for PK's than doing a second query on the FK be faster than a join that's semantically equivalent? Postgres optimizer gone terribly wrong or something? Or am I misunderstanding what's going on?

It would never have occurred to me to even try this, I would have assumed postgres would do as well as could be done at what looks like a very standard join. So I guess that's a lesson not to make assumptions?

Not sure, without even having a theory of why pg would behave this way, I'd be worried that as the data changes it might regress again. Or if the phase of the moon changes or something -- since it seems like magic!

Olreich 3 years ago

It’s almost never faster to do two queries. But DBs are complicated query engines with sophisticated compilers, so you have to go in and try to convince the compiler to optimize the way you want that will make the query fast. EXPLAIN_ANALYZE is your best friend for understanding what the compiler thinks is best. And then you can adjust the query to convince it that it should do something faster.
In this specific case, I’d bet that having the FROM and JOIN tables reversed would be enough to get even better performance than 2 queries: `SELECT * FROM os JOIN o …`.
zasdffaa 4 years ago

It's a failure of the optimiser. AS a DB guy (MSSQL, not postgres), there's always a risk of this, you always have to be aware of this, and don't assume. MSSQL's optimiser is pretty good but I've also seen slowdowns on plain queries after adding an index. Optimisers are complex and individual optimisation rules interact in odd and sometimes counterproductive ways. Also if your stats are not up-to-date, trouble is guaranteed.
bastawhiz 3 years ago

You're right. I had the same thought.
Frankly, it's such a simple case that I'm hesitant to pin the problem on Postgres. I'm now inclined to believe that the ORM/query builder they're using (which I'm not familiar with) is generating a query that does something unusual. I'm also having a hard time understanding the code, so it may just be a bad query in the first place. If the generated SQL was shown, it would be much easier to look into.
Generally speaking, if you're just joining some data onto the results of an expensive operation, it's exactly the same as doing a second query, just without the round trip and overhead.
The lesson here is to always look at EXPLAIN ANALYZE any time you're doing Postgres optimization work. There's a wealth of useful information that will often point you to your mistake. Databases are quite often extremely good at doing common-sense operations well.
kuzeeOP 3 years ago

We secretly agree and aren't sure why it's faster to do two queries, but it measurably is. We're going to try some of the suggestions littered in this conversation and will report back, this time with some EXPLAIN output. We appreciate the suggestions and theories.
Groxx 4 years ago

While I generally wonder the same question... I've seen basically every database follow this pattern. There are fairly frequent scenarios where doing a couple separate indexed queries performs better than an equivalent join. Particularly when queries get kinda-large and up.
I can certainly see it when running into memory limits per table, because doing the full join might require more memory at once, but it frequently happens far below where I'd expect that to occur (in the low thousands of rows). Dunno why. Maybe it uses caches more efficiently? Many simple operations are often more cache-able than a few complex ones.
- masklinn 4 years ago
  
  Maybe disk spill because some joins require a superlinear amount of space and doesn’t fit in memory buffers?
rubyist5eva 3 years ago

I’ve seen it happen when people forget to index the foreign key columns, it also may be faster on one to many joins if you need to use DISTINCT to remove duplicate rows (instead of joining on a specific subquery that eliminates the duplication implicitly).

kuzeeOP 4 years ago

I wrote up my recent experience optimizing Postgres text search on a database with a few million records without relying on a new service like Elastic Search.

canadiantim 4 years ago

Awesome thanks for the write up. Definitely cool to see breaking up the query into 2 queries had such a marked improvement! I know I can often get too obsessed with making just a single query. Cheers.

norwalkbear 4 years ago

Has anyone overcome the 16382 positional limits of tsvector?

That and the automatic stemming and lemming of search words even in phrase searches makes postgres awful for any software where accurate search is critical.

hombre_fatal 4 years ago

I remember when I ran into that first issue on my forum, I felt like I was the first person to ever use Postgres full-text search. (8?) Years ago, googling the exact error brought up nothing except some dev email/listserv chatter. Nobody else was indexing longer text documents? Wasn't very encouraging.
Oh well. The vast majority of forum posts don't hit that limit so I just excluded the exceptionally long posts from the index. I never revisited it again.
jmull 3 years ago

I don’t know ab out the positional limits, but the built-in stemming and other transformations are a default, which you can change. I think this is the relevant mechanism: https://www.postgresql.org/docs/current/textsearch-dictionar...
iav 4 years ago

I used a trigger function to detect long text and trim the source before it got indexed. That meant that any text over 500kb just got dropped from the index. I also used one index per long text field rather than combining with other fields.
manigandham 3 years ago

The lack of BM25 relevance scoring is the bigger problem. Postgres FTS is fine as a better "LIKE" filter with very little overhead, but it's a poor choice for serious search applications or scale.

eric4smith 4 years ago

PostgreSQL text search is awesome - for English and Roman type languages.

But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.

PostgreSQL is weird about joins. Joining on certain columns could be super fast but others dog slow. And this can flip depending on size of table and this index Cardinality.

That’s why it’s important on databases that grow quickly to check the performance of even simple queries as those can balloon in execution time as the profile of the data changes.

asah 4 years ago

Support for i18n: https://www.google.com/search?q=pgroonga
- eric4smith 3 years ago
  
  Yeah we use it.
  It’s however a database within a database. PostgreSQL does not natively text search on these kinds of languages.
jxf 4 years ago

> But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.
I understand that there are likely to be many significant differences in graphemes in these languages from Roman text but I'm not familiar enough with any Asian language to construct an example. Can you give an illustrative example that explains why the search doesn't work as well (or at all) in those cases?
- devonkim 3 years ago
  
  Not intimately familiar with how text search works in indexing but in most Romanized / Latin script text determiners, articles, etc. are space separated from the nouns which can be confusing and introduce state into queries due to the need to perform some splitting within character sequences. This isn’t the same thing as finding the roots of words / stemming for fuzzy search purposes either. “짬뽕이 맛있습니다” has a plain noun 짬뽕 with case marking via -이 and the ending copula is parseable as a run on phrase but Finnish has case marking without space separation too and doesn’t seem to be cited as a parse / representation problem last I saw. In English it’s “the 짬뽕 is delicious” where noun is obvious and if you split by spaces you can quickly throw away “the” and “is” while it’s not clear in the Korean until you check for the case marker and prior glyphs for a parse. Now, where I think there can be issues is in Unicode glyph representations where multiple codes can wind up to the same symbol.
- wahnfrieden 4 years ago
  
  no spaces in japanese, it's tricky to determine word boundaries and conjugations and multiple "spellings" of the same word. the libs that tokenize and de-inflect languages are usually highly specialized technologies for east asian languages (maybe others but i'm particularly familiar with japanese, korean, chinese)
- eric4smith 3 years ago
  
  No spaces in Thai also. Difficult to discern word boundaries.
SnowHill9902 4 years ago

Can you give more information about your JOINs performance? It’s mostly dependent on the presence and optimization of INDEXes and cardinality as you mention. JOIN ON indexed integers is usually fastest.
- eric4smith 3 years ago
  
  It’s not such an easy thing to describe.
  Yes we join on indexed integers too.
  The thing is, we believe as the amount of data changes the planner will give different instructions. Even after lots of analyzing and vacuuming.
  It’s happened so many times and so randomly and because of growth we really have to re-examine joins that were very fast 6 months ago that are now really slow.

rdoherty 4 years ago

I'd love to see some details on the why using EXPLAIN ANALYZE on each query and schema. It seems like the changes were done with a hunch as to why they were slow?

hinkley 4 years ago

> It seems like the changes were done with a hunch as to why they were slow?
You say that like it’s a bad thing. If the hunch works does that cheapen the effect? The why can come after. It often does.
Mastery isn’t pondering things faster, it’s using system 1 thinking for most of the process with some system 2 sprinkled on top. Or to use different terminology, intuition with a bit of high level executive function guiding things.
Or a lot of hunches and a little thinking.
- dealforager 4 years ago
  
  If you're discussing performance of SQL queries, showing the output of EXPLAIN ANALYZE is the bare minimum. There's too many variables that can affect performance and if you can't see what's happening under the hood it's not very useful.
  - jerryjerryjerry 4 years ago
    
    Same here, and I'd also like to see what explain analyze shows about the plan and execution details. Also, some system setup may also help, e.g., memory size and check if spill kicks in, etc.
- nl 4 years ago
  
  It's a good point though. EXPLAIN should point exactly what was going wrong in their first attempt (where it was slower).
  Postgres should be able to do that join in a single query faster than two queries with the latency in between.
  So why isn't it?
  - tpetry 4 years ago
    
    The point is the query plan will most probably look totally different. A fulltext search or GIS search is calculated differently than a btree lookup and PostgreSQL choose a complete different joining strategy. This can also happen when combining such a condition with another btree index lookup. It doesn‘t have to be a fault of PostgreSQL, sometimes its not even possible to make it faster. E.g. all fulltext search results need to be calculated and then joined.
    
    nl 3 years ago
    
    > The point is the query plan will most probably look totally different.
    Different to what?
    > PostgreSQL choose a complete different joining strategy. This can also happen when combining such a condition with another btree index lookup. It doesn‘t have to be a fault of PostgreSQL
    Yes agree with all that is possible. EXPLAIN will tell us!
  - hinkley 4 years ago
    
    EXPLAIN shows gross problems. There are plenty of things well under an order of magnitude that won’t necessarily show up. It’s a very coarse grained yardstick. If you think it’s the only yardstick that matters, then you and I probably disagree about a great number of other things too.
    
    nl 4 years ago
    
    EXPLAIN shows you how the query is being executed.
    It won't always give you a recipe for a solution but it's always useful for an experienced Postgres developer to look at to understand why it isn't performing like you expect.
    I still don't understand why the two queries runs faster than a single query. There might be good reasons for that and they might not be able to be fixed but at the same time there might be ways to fix it. EXPLAIN gives you clues.
- stickfigure 3 years ago
  
  Sure, you could spend all day guessing what will improve the query or you can just add two words to your query and postgres will tell you. EXPLAIN ANALYZE is the bare minimum competent effort required to tune queries.
  I use postgres FTS heavily on large datasets, have done a lot of performance tuning, so I was excited by the title. Unfortunately the article failed to deliver any useful information.
  - kuzeeOP 3 years ago
    
    I'll gladly read anything you've written on the topic, sounds like you're pretty knowledgeable.

vosper 4 years ago

PGSync might be useful for those who don't mind also running Elasticsearch

https://github.com/toluaina/pgsync

kabes 4 years ago

In that case you may also want to take a look at https://github.com/zombodb/zombodb
- pbronez 3 years ago
  
  Oh that’s super neat! Looks like a more elegant way to link Postgres and Elasticsearch:
  “From a technical perspective, ZomboDB is a 100% native Postgres extension that implements Postgres' Index Access Method API. As a native Postgres index type, ZomboDB allows you to CREATE INDEX ... USING zombodb on your existing Postgres tables. At that point, ZomboDB takes over and fully manages the remote Elasticsearch index and guarantees transactionally-correct text-search query results.”
avinassh 4 years ago

This is so neat! How does it handle updates? Are there similar systems which can sync with Redis for quick caching?
netcraft 4 years ago

Thank you for sharing. Does anyone know of something similar for dynamodb?
- simlevesque 4 years ago
  
  Indexing dynamodb with elasticsearch ? Amplify makes it super easy.
  https://docs.amplify.aws/cli-legacy/graphql-transformer/sear...
  - netcraft 4 years ago
    
    Sorry, no, I mean sync pg to dynamodb instead of elasticsearch.
    
    oskarpearson 3 years ago
    
    AWS’s DMS has a dynamodb target.
    https://docs.aws.amazon.com/dms/latest/userguide/Welcome.htm...

maverwa 3 years ago

I remember achieving some quite good results when implementing a simple „fuzzyesque“ search for some B2B e-commerce system a few years back, but what hit us hard was the german dictionary used for stemming for the ts_vector index. The original one coming with psql did not support german compound words correctly and the „fix“ was to import a 3rd party one. I learned the hard ways that this comes at the cost of Postgres loading the dictionary every time a new connection uses the index. Running a simple, small php app that could not reuse/pool connections, every search query done was coming in on a clean connection and hit a >1s fee for having to load the dict.

iirc pgbouncer saved the day there.

Otherwise it worked fine.

anshul 4 years ago

You might want to test using the first query as a sub-query or cte in the second one. That would likely give you the same / better perf. It would avoid the join and save a round trip.

hobs 4 years ago

If you dont need the output of the first query you'll almost always have the best performance in sql using exists syntax eg
select * from query1 as q where exists ( select * from query2 as q2 where q.col = q2.col )
- nfcampos 4 years ago
  
  Yes although in this specific case you’d lose the order of the results with exists

Settings

Improving Postgres text search speed

Keyboard Shortcuts