Optimize full text search with BM25

16 min read Original article ↗

Install and tune pg_textsearch (v1.1.0) for BM25 full-text search on Tiger Cloud and self-hosted deployments

pg_textsearch supports PostgreSQL 17 and 18. For self-hosted installs, confirm versions against the upstream compatibility table.

PostgreSQL full-text search at scale consistently hits a wall where performance degrades catastrophically. Tiger Data's pg_textsearch brings modern BM25-based full-text search directly into PostgreSQL, with a memtable architecture for efficient indexing and ranking. pg_textsearch integrates seamlessly with SQL and provides better search quality and performance than the PostgreSQL built-in full-text search. With Block-Max WAND optimization, pg_textsearch delivers up to 4x faster top-k queries compared to native BM25 implementations. Parallel index builds reduce indexing times by 4x or more for large tables. Advanced compression using delta encoding and bitpacking reduces index sizes by 41% while improving query performance by 10-20% for shorter queries.

BM25 scores in pg_textsearch are returned as negative values, where lower (more negative) numbers indicate better matches. pg_textsearch implements the following:

  • Corpus-aware ranking: BM25 uses inverse document frequency to weight rare terms higher
  • Term frequency saturation: prevents documents with excessive term repetition from dominating results
  • Length normalization: adjusts scores based on document length relative to corpus average
  • Relative ranking: focuses on rank order rather than absolute score values

This page shows you how to install pg_textsearch, configure BM25 indexes, and optimize your search capabilities using the following best practices:

  • Parallel indexing: enable parallel workers for faster index creation on large tables
  • Language configuration: choose appropriate text search configurations for your data language
  • Hybrid search: combine with pgvector or pgvectorscale for applications requiring both semantic and keyword search
  • Query optimization: use score thresholds to filter low-relevance results
  • Index monitoring: regularly check index usage and memory consumption

Install pg_textsearch

To install this PostgreSQL extension:

  1. Connect to your Tiger Cloud service

    In Tiger Console open an SQL editor. You can also connect to your service using psql.

  2. Enable the extension on your Tiger Cloud service

    • For new services, simply enable the extension:

      CREATE EXTENSION pg_textsearch;

    • For existing services, update your instance, then enable the extension:

      The extension may not be available until after your next scheduled maintenance window. To pick up the update immediately, manually pause and restart your service.

  3. Verify the installation

    SELECT * FROM pg_extension WHERE extname = 'pg_textsearch';

You have installed pg_textsearch on Tiger Cloud.

Self-hosted: load pg_textsearch before CREATE EXTENSION

On self-managed PostgreSQL, the extension must be preloaded so the server loads it at startup:

  1. Set shared_preload_libraries in postgresql.conf (append pg_textsearch to any existing list):

    shared_preload_libraries = 'pg_textsearch'

  2. Restart PostgreSQL.

  3. In each database where you need search, run CREATE EXTENSION pg_textsearch;.

Tiger Cloud manages preloading for you; you typically only run CREATE EXTENSION in the SQL editor or client.

Create BM25 indexes on your data

BM25 indexes provide modern relevance ranking that outperforms PostgreSQL's built-in ts_rank functions by using corpus statistics and better algorithmic design.

To create a BM25 index with pg_textsearch:

  1. Create a table with text content

    CREATE TABLE products (

    id serial PRIMARY KEY,

    name text,

    description text,

    category text,

    price numeric

    );

  2. Insert sample data

    INSERT INTO products (name, description, category, price) VALUES

    ('Mechanical Keyboard', 'Durable mechanical switches with RGB backlighting for gaming and productivity', 'Electronics', 149.99),

    ('Ergonomic Mouse', 'Wireless mouse with ergonomic design to reduce wrist strain during long work sessions', 'Electronics', 79.99),

    ('Standing Desk', 'Adjustable height desk for better posture and productivity throughout the workday', 'Furniture', 599.99);

  3. Create a BM25 index

    CREATE INDEX products_search_idx ON products

    USING bm25(description)

    WITH (text_config='english');

    Each BM25 index covers a single text expression. That expression can be a plain column, or a computed expression that concatenates columns, extracts a JSONB field, or normalizes text. See Index expressions, subsets of rows, and arrays. For optimal performance, load your data first, then create the index.

You have created a BM25 index for full-text search.

Index expressions, subsets of rows, and arrays

Since v1.1.0

A BM25 index can cover more than a plain column. You can index a computed expression, restrict the index to a subset of rows, or index a text[] array.

Index an expression

Index a computed expression to search JSONB fields, combine multiple columns, or normalize text. The expression must return text and use only IMMUTABLE functions, and your queries must repeat the same expression in ORDER BY:

-- Extract a JSONB field

CREATE INDEX events_msg_idx ON events

USING bm25 ((data->>'description'))

WITH (text_config='english');

-- Combine title and body into one searchable index

CREATE INDEX articles_idx ON articles

USING bm25 ((coalesce(title, '') || ' ' || coalesce(body, '')))

WITH (text_config='english');

-- Normalize case before indexing

CREATE INDEX docs_lower_idx ON docs

USING bm25 ((lower(content)))

WITH (text_config='simple');

Index a subset of rows

Add a WHERE clause to index only the rows you search. Partial indexes are smaller and faster when queries always target the same subset:

CREATE INDEX docs_published_idx ON docs

USING bm25 (content)

WITH (text_config='english')

WHERE status = 'published';

Partial indexes require explicit index naming with to_bm25query(). The implicit content <@> 'query' syntax skips them:

SELECT * FROM docs

WHERE status = 'published'

ORDER BY content <@> to_bm25query('search terms', 'docs_published_idx')

LIMIT 10;

You can combine expression and partial indexes:

CREATE INDEX events_error_idx ON events

USING bm25 ((data->>'message'))

WITH (text_config='english')

WHERE (data->>'severity') = 'error';

Search multilingual tables

For tables that hold documents in several languages, create one partial index per language, each with the matching text search configuration. Each index then applies language-appropriate stemming and stop words:

CREATE INDEX docs_en_idx ON docs USING bm25 (content)

WITH (text_config='english') WHERE lang = 'en';

CREATE INDEX docs_de_idx ON docs USING bm25 (content)

WITH (text_config='german') WHERE lang = 'de';

CREATE INDEX docs_fr_idx ON docs USING bm25 (content)

WITH (text_config='french') WHERE lang = 'fr';

Query with the matching predicate and index name:

SELECT * FROM docs

WHERE lang = 'en'

ORDER BY content <@> to_bm25query('databases', 'docs_en_idx')

LIMIT 10;

Index a text array

pg_textsearch indexes text[] columns by scoring the array elements as a single flattened document. This is useful when you store a document as pre-split chunks:

CREATE TABLE chunked_docs (id bigserial PRIMARY KEY, content text[]);

CREATE INDEX chunked_docs_idx ON chunked_docs

USING bm25(content)

WITH (text_config='english');

BM25 scores match what you would get by concatenating the elements into a single text value.

Enable parallel indexing for faster index creation

Since v0.5.0

pg_textsearch supports parallel index builds that can significantly reduce indexing times for large tables. PostgreSQL automatically uses parallel workers based on table size and available resources.

  1. Configure parallel workers (optional)

    PostgreSQL uses server defaults, but you can adjust settings for your workload:

    -- Set number of parallel workers (uses CPU count by default)

    SET max_parallel_maintenance_workers = 4;

    -- Set memory for index builds (must be at least 64MB for parallel builds)

    SET maintenance_work_mem = '256MB';

    The planner requires maintenance_work_mem >= 64MB to enable parallel index builds. With insufficient memory, builds fall back to serial mode silently.

  2. Create index (parallel workers used automatically for large tables)

    CREATE INDEX products_search_idx ON products

    USING bm25(description)

    WITH (text_config='english');

    When parallel build is used, you see a notice:

    NOTICE: parallel index build: launched 4 of 4 requested workers

  3. Verify parallel execution in partitioned tables

    For partitioned tables, each partition builds its index independently with parallel workers if the partition is large enough. This allows efficient indexing of very large partitioned datasets.

You have configured parallel index builds for faster indexing.

Merge BM25 segments after bulk loads

After large batch inserts or sustained incremental writes, multiple on-disk segments can accumulate. Merging them into a single segment improves query speed by reducing how many segments a query must scan. This is analogous to Lucene’s forceMerge(1).

Use bm25_force_merge() after bulk loads, not during steady high write traffic:

SELECT bm25_force_merge('products_search_idx');

Optimize search queries for performance

The <@> operator returns BM25-based scores as negative values: lower (more negative) means a better match.

Ranked search with implicit query syntax (primary pattern)

For typical top-k search, pass the query as a string literal. In ORDER BY, PostgreSQL can automatically detect the BM25 index from the indexed column:

SELECT name, description, description <@> 'ergonomic work' AS score

FROM products

ORDER BY score

LIMIT 3;

You see something like:

name | description | score

----------------------------+-----------------------------------------------------------------------------------+---------------------

Ergonomic Mouse | Wireless mouse with ergonomic design to reduce wrist strain during long work sessions | -1.8132977485656738

Mechanical Keyboard | Durable mechanical switches with RGB backlighting for gaming and productivity | 0

Standing Desk | Adjustable height desk for better posture and productivity throughout the workday | 0

The bm25query type and to_bm25query()

Use an explicit bm25query value when you need the index name in the expression (for example in WHERE, or when the planner cannot infer the index from context):

FormUse
to_bm25query('query text')Query text only; use in ORDER BY with an index scan (implicit index detection from the column).
to_bm25query('query text', 'index_name')Query text and index name; required for WHERE filters and whenever you must name the index.
'index_name:query text'::bm25queryCast form (embedded index name). In PostgreSQL 18, the single-colon form helps the planner when expressions are evaluated early.
text <@> bm25querydouble precisionBM25 score (negative is better).
bm25query = bm25queryEquality comparison.

Filter results by score threshold

For WHERE clauses, use to_bm25query() with an explicit index name:

SELECT name, description <@> to_bm25query('wireless', 'products_search_idx') AS score

FROM products

WHERE description <@> to_bm25query('wireless', 'products_search_idx') < -0.5;

You see something like:

name | score

----------------+---------------------

Ergonomic Mouse | -0.9066488742828369

Combine with standard SQL operations

SELECT category, name, description <@> to_bm25query('ergonomic', 'products_search_idx') AS score

FROM products

WHERE price < 500

AND description <@> to_bm25query('ergonomic', 'products_search_idx') < -0.5

ORDER BY score

LIMIT 5;

You see something like:

category | name | score

-------------+-----------------+---------------------

Electronics | Ergonomic Mouse | -0.9066488742828369

Pre-filtering vs post-filtering

How you combine BM25 with other predicates affects performance:

  • Pre-filtering: use another index (for example B-tree on category_id) so PostgreSQL restricts rows before BM25 scoring. Best when the filter is selective (for example matches a small fraction of rows); then ORDER BY ... LIMIT scores a smaller candidate set.
  • Post-filtering: the BM25 index scan produces top-k matches first; other WHERE conditions apply after. If the filter removes most of those rows, you can get fewer rows than LIMIT. Increase the inner LIMIT and re-limit in application code if needed.

This mirrors tradeoffs familiar from approximate vector indexes: know whether your filter is shrinking the universe of rows before or after ranking.

Verify plans with EXPLAIN (and sequential scans on small tables)

EXPLAIN SELECT * FROM products

ORDER BY description <@> 'ergonomic'

LIMIT 5;

On small tables, the planner may choose a sequential scan instead of the BM25 index. To force index use while testing, you can run:

SET enable_seqscan = off;

Even when EXPLAIN shows a sequential scan, <@> and to_bm25query() still use the BM25 index for corpus statistics (document counts, average length, and related metadata) required to score results.

You have optimized your search queries for BM25 ranking.

Combine pg_textsearch with pgvector or pgvectorscale to build powerful hybrid search systems that use both semantic vector search and keyword BM25 search.

  1. Enable the vectorscale extension on your Tiger Cloud service

    CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;

  2. Create a table with both text content and vector embeddings

    CREATE TABLE articles (

    id serial PRIMARY KEY,

    title text,

    content text,

    embedding vector(3) -- Using 3 dimensions for this example; use 1536 for OpenAI ada-002

    );

  3. Insert sample data

    INSERT INTO articles (title, content, embedding) VALUES

    ('Database Query Optimization', 'Learn how to optimize database query performance using indexes and query planning', '[0.1, 0.15, 0.2]'),

    ('Performance Tuning Guide', 'A comprehensive guide to performance tuning in distributed systems and databases', '[0.12, 0.18, 0.25]'),

    ('Introduction to Indexing', 'Understanding how database indexes improve query performance and data retrieval', '[0.09, 0.14, 0.19]'),

    ('Advanced SQL Techniques', 'Master advanced SQL techniques for complex data analysis and reporting', '[0.5, 0.6, 0.7]'),

    ('Data Warehousing Basics', 'Getting started with data warehousing and analytical query processing', '[0.8, 0.9, 0.85]');

  4. Create indexes for both search types

    -- Vector index for semantic search

    CREATE INDEX articles_embedding_idx ON articles

    USING hnsw (embedding vector_cosine_ops);

    -- Keyword index for BM25 search

    CREATE INDEX articles_content_idx ON articles

    USING bm25(content)

    WITH (text_config='english');

  5. Perform hybrid search using reciprocal rank fusion

    WITH vector_search AS (

    SELECT id,

    ROW_NUMBER() OVER (ORDER BY embedding <=> '[0.1, 0.2, 0.3]'::vector) AS rank

    FROM articles

    ORDER BY embedding <=> '[0.1, 0.2, 0.3]'::vector

    LIMIT 20

    ),

    keyword_search AS (

    SELECT id,

    ROW_NUMBER() OVER (ORDER BY content <@> 'query performance') AS rank

    FROM articles

    ORDER BY content <@> 'query performance'

    LIMIT 20

    )

    SELECT a.id,

    a.title,

    COALESCE(1.0 / (60 + v.rank), 0.0) + COALESCE(1.0 / (60 + k.rank), 0.0) AS combined_score

    FROM articles a

    LEFT JOIN vector_search v ON a.id = v.id

    LEFT JOIN keyword_search k ON a.id = k.id

    WHERE v.id IS NOT NULL OR k.id IS NOT NULL

    ORDER BY combined_score DESC

    LIMIT 10;

    You see something like:

    id | title | combined_score

    ----+----------------------------+--------------------

    3 | Introduction to Indexing | 0.0325224748810153

    1 | Database Query Optimization| 0.0322664584959667

    2 | Performance Tuning Guide | 0.0320020481310804

    5 | Data Warehousing Basics | 0.0310096153846154

    4 | Advanced SQL Techniques | 0.0310096153846154

  6. Adjust relative weights for different search types

    WITH vector_search AS (

    SELECT id,

    ROW_NUMBER() OVER (ORDER BY embedding <=> '[0.1, 0.2, 0.3]'::vector) AS rank

    FROM articles

    ORDER BY embedding <=> '[0.1, 0.2, 0.3]'::vector

    LIMIT 20

    ),

    keyword_search AS (

    SELECT id,

    ROW_NUMBER() OVER (ORDER BY content <@> 'query performance') AS rank

    FROM articles

    ORDER BY content <@> 'query performance'

    LIMIT 20

    )

    SELECT

    a.id,

    a.title,

    0.7 * COALESCE(1.0 / (60 + v.rank), 0.0) + -- 70% weight to vectors

    0.3 * COALESCE(1.0 / (60 + k.rank), 0.0) -- 30% weight to keywords

    AS combined_score

    FROM articles a

    LEFT JOIN vector_search v ON a.id = v.id

    LEFT JOIN keyword_search k ON a.id = k.id

    WHERE v.id IS NOT NULL OR k.id IS NOT NULL

    ORDER BY combined_score DESC

    LIMIT 10;

    You see something like:

    id | title | combined_score

    ----+----------------------------+--------------------

    3 | Introduction to Indexing | 0.0163141195134849

    2 | Performance Tuning Guide | 0.0160522273425499

    1 | Database Query Optimization| 0.0160291438979964

    4 | Advanced SQL Techniques | 0.0155528846153846

    5 | Data Warehousing Basics | 0.0154567307692308

You have implemented hybrid search combining semantic and keyword search.

Configuration options

Customize pg_textsearch behavior for your specific use case and data characteristics.

  1. Configure memory and performance settings

    pg_textsearch keeps in-memory inverted indexes (memtables) in PostgreSQL dynamic shared memory. memory_limit bounds their growth: as usage approaches the cap, memtables spill to on-disk segments automatically, and if usage still exceeds the cap, inserts fail with an error rather than risking an out-of-memory kill.

    Crash recovery: the memtable is rebuilt from the heap on startup, so you do not lose indexed data if PostgreSQL crashes before a spill to disk completes.

    Two settings can be changed in any session, by any role:

    -- Default query limit when no LIMIT clause is present (default 1000)

    SET pg_textsearch.default_limit = 5000;

    -- Segment compression with delta encoding and bitpacking (enabled by default).

    -- Reduces index size by ~41% with a 10-20% query performance improvement for shorter queries.

    SET pg_textsearch.compress_segments = on;

    Since v0.4.0

    The remaining settings are server-level or superuser-only, so a plain SET in a regular session does not apply them:

    SettingDefaultNotes
    pg_textsearch.memory_limit2GBServer-level; takes effect on a configuration reload. A value of 0 disables the limit.
    pg_textsearch.bulk_load_threshold100000Superuser. Terms per transaction before the memtable spills to disk.
    pg_textsearch.segments_per_level8Superuser. Segments per level before automatic compaction (range 2–64).
    pg_textsearch.log_bmw_statsoffSuperuser. Logs Block-Max WAND block-skip statistics.
    pg_textsearch.log_scoresoffSuperuser. Logs BM25 scores during index scans.
    • On Tiger Cloud, change these from the Tiger Console: open your service, then Operations > Database parameters. See Configure database parameters. The tsdbadmin role cannot apply them with SET or ALTER SYSTEM.
    • On self-hosted PostgreSQL, set memory_limit in postgresql.conf, or run ALTER SYSTEM SET pg_textsearch.memory_limit = '2GB'; followed by SELECT pg_reload_conf();. As a superuser, you can change the superuser-only settings with SET in your session.

    memory_limit is applied as three tiers derived from the single value: a per-index soft limit (memory_limit / 8) spills that index's memtable to disk, a global soft limit (memory_limit / 2) evicts the largest memtable across all indexes, and the hard limit (memory_limit) refuses further cache growth so an insert fails rather than risking the out-of-memory killer. You only set memory_limit; the ratios are derived automatically.

    Since v1.1.0

  2. Configure language-specific text processing

    You can create multiple BM25 indexes on the same column with different language configurations:

    -- Create an additional index with simple tokenization (no stemming)

    CREATE INDEX products_simple_idx ON products

    USING bm25(description)

    WITH (text_config='simple');

    -- Example: French language configuration for a French products table

    -- CREATE INDEX products_fr_idx ON products_fr

    -- USING bm25(description)

    -- WITH (text_config='french');

  3. Tune BM25 parameters

    -- Adjust term frequency saturation (k1) and length normalization (b)

    CREATE INDEX products_custom_idx ON products

    USING bm25(description)

    WITH (text_config='english', k1=1.5, b=0.8);

  4. Monitor index usage and memory consumption

    • Check index usage statistics for every BM25 index

      SELECT s.schemaname, s.relname, s.indexrelname, s.idx_scan, s.idx_tup_read

      FROM pg_stat_user_indexes s

      JOIN pg_class c ON c.oid = s.indexrelid

      JOIN pg_am a ON a.oid = c.relam

      WHERE a.amname = 'bm25';

    • Check shared memory used by memtables across all indexes

      SELECT * FROM bm25_memory_usage();

    • View index summary with corpus statistics and memory usage (superuser)

      SELECT bm25_summarize_index('products_search_idx');

    • View detailed index structure (output is truncated for display) (superuser)

      SELECT bm25_dump_index('products_search_idx');

    bm25_summarize_index and bm25_dump_index require superuser. On Tiger Cloud the tsdbadmin role is not a superuser, so these two functions are unavailable there; use bm25_memory_usage() and pg_stat_user_indexes to monitor instead. The two-argument form bm25_dump_index('index_name', '/path/to/file.txt') exists only in debug builds compiled with -DDEBUG_DUMP_INDEX. Standard packages expose the single-argument form only.

    • Force memtable spill to disk (useful for testing or memory management)

      SELECT bm25_spill_index('products_search_idx');

You have configured pg_textsearch for optimal performance. For production applications, consider implementing result caching and pagination to improve user experience with large result sets.

Current limitations

  • No phrase search: you cannot search for exact multi-word phrases (no positional index; emulate with BM25 plus a post-filter if needed).
  • No compressed data support: pg_textsearch does not work with compressed chunk or table data (for example Hypercore-compressed data where the extension applies).
  • One expression per index: each BM25 index covers a single text expression. To search several columns at once, index a computed expression that concatenates them (see Index expressions, subsets of rows, and arrays).
  • No built-in faceted search: there is no dedicated faceting operator; use ordinary PostgreSQL patterns (WHERE, GROUP BY, counts) alongside BM25 filters.
  • Insert and update throughput: the memtable design supports incremental writes, but sustained write-heavy workloads are still a focus of ongoing performance work. Prefer bulk load, then index (or bm25_force_merge after large batches) when possible.
  • No background compaction: segment compaction runs synchronously during memtable spill; write-heavy workloads may see latency spikes during spills.
  • Partitioned table statistics: BM25 statistics are per partition. Scores are accurate within a partition but are not directly comparable across partitions when a query spans many partitions (different IDF and length stats). Prefer queries that target a single partition when rank comparability matters.
  • Word length limit: tokenization inherits PostgreSQL’s tsvector 2047 character word limit; very long tokens (for example base64 blobs) may be ignored with an INFO message.
  • PL/pgSQL and stored procedures: the implicit form text <@> 'query' relies on planner hooks that do not run inside PL/pgSQL DO blocks, functions, or procedures. Use to_bm25query('query', 'index_name') explicitly in those contexts.

For more detail and workarounds, see the limitations section in the upstream README.