Why don't multi-column indices help queries on the second column?

8 min read Original article ↗

SQL databases support multi-column indices, where a single index can incorporate data from multiple columns. These are powerful, but the order of the columns matters, with differences observable in practice: there can be orders of magnitude between two similar & simple queries, depending on which uses a prefix of the index columns. Why does this limitation exist? Can we have a mental model for the indices that explains the behaviour?

Yes, we can: thinking about indices as a sorted table with multiple columns provides that intuition:

  • For a query involving the first columns, successive lookups on the relevant subsets of the index always see ordered data, making an efficient binary search possible,
  • For a query involving only a later column, that column’s data is arranged haphazardly in the index, meaning there’s no hope of a binary search.

It’s the sorted nature of a database’s typical B-tree index that drives this trade-off.

Context

Imagine we’re building an app for silverware enthusiasts, so we set our table like so:

1
2
3
4
5
6
7
CREATE TABLE cutlery (
  id BIGINT PRIMARY KEY,
  -- fork, spoon, knife, straw, strork, etc.
  kind TEXT,
  -- silver, red, puce, etc.
  colour TEXT
);

It turns out some people have a lot of forks and spoons, and want to find the item with the exact right shape and colour. Thus, we’ve discovered queries like this are common0 and very slow:

1
2
SELECT * FROM cutlery
WHERE kind = 'knork' AND colour = 'mahogany';

Changes

Slow query? We need an index. There are a few options worth considering:

1
2
3
4
5
6
7
8
9
10
11
-- 1. index just kind:
CREATE INDEX idx_cutlery_kind
ON cutlery (kind);

-- 2. index just colour:
CREATE INDEX idx_cutlery_colour
ON cutlery (colour);

-- 3. index both, together
CREATE INDEX idx_cutlery_kind_colour
ON cutlery (kind, colour);

Options 1 & 2 are the simplest, and doing one (or both) may give sufficient performance. But maybe not. Option 3 is a multi-column index, that combines both kind and colour into one data structure. Let’s benchmark1 the options:

Indexing Time (ms) Speedup vs ‘None’
None 420
Just kind (1) 4.78 88×
Just colour (2) 4.60 91×
Both (1 + 2) 1.28 328×
Multi-column (3) 0.009 46700×

Looks like we can get two orders of magnitude faster using one or both simple indices, and another two orders faster with the multi-column one. Optimising something to be 50000× faster is a good day’s work!

Confusion

Now, we were hyper-focused on just one particularly slow query, but most apps query their database in more than one way. Here are two very similar queries, with each dropping one of the two filtering conditions from the WHERE clause:

1
2
3
4
-- kind-only, no colour
SELECT * FROM cutlery WHERE kind = 'knork';
-- colour-only, no kind
SELECT * FROM cutlery WHERE colour = 'mahogany';

After adding the multi-column idx_cutlery_kind_colour index, the benchmarks1 prove the kind-only query gets much faster, but the colour-only is still slow2.

Query No index (ms) With index (ms) Speedup
Both columns 420 0.009 46700×
Kind-only 403 4.79 84×
Colour-only 480 492

Both of the “only” queries just drop one and keep one column compared to our original, but there’s a gulf between them.

Why does the multi-column index improve one a lot, and the other not at all? Why is kind-only two orders of magnitude faster than the colour-only with this index?

Callouts

The database docs answer the immediate question. For instance, PostgreSQL (emphasis mine):

the index is most efficient when there are constraints on the leading (leftmost) columns

And, similarly, MySQL (emphasis still mine):

If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows. For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).

Let’s bring the index and the single-column queries together:

1
2
3
4
5
6
CREATE INDEX idx_cutlery_kind_colour
ON cutlery (kind, colour);
-- kind-only, no colour
SELECT * FROM cutlery WHERE kind = 'knork';
-- colour-only, no kind
SELECT * FROM cutlery WHERE colour = 'mahogany';

Comparing the queries and the index shows that, yes, indeed:

  • The now-faster kind-only query uses the first column kind of the index, and thus the index is applicable and efficient.
  • The still-slow colour-only query uses only the second column colour of the index. This is not using a leftmost prefix of the index’s columns (kind is missing), and thus the index isn’t applicable.

But, why? Why does an index like this have a constraint like that?

Concepts

The default index in most SQL databases is ordered, relying on binary search for efficiency3, getting that O(log n) tastiness.

We can visualise this in a flat ordered table, with the index columns plus a ‘locations’ column4, ordered by the index columns, using each column of the multi-column index as a tie-breaker if the earlier ones are equal. We’re working with text columns, so everything is alphabetical.

Here’s a version for 7 rows, with a few different kind and colour values:

kind colour locations
chopsticks mahogany 5
chopsticks zebra 3
knork mahogany 2, 7
knork salmon 4
tongs aqua 1
tongs periwinkle 6

Column-lookup: both

This index works well for filtering WHERE kind = 'knork' AND colour = 'mahogany', with two conceptual steps:

  1. Do a binary search on the kind column to find the two knork entries, efficiently skipping over the chopsticks and tongs. The column is ordered, so binary search can work.

    kind colour locations
    knork mahogany 2, 7
    knork salmon 4
  2. Do a second binary search on the reduced colour column to find the knork rows that are also mahogany. Now that we’ve restricted to only kind = 'knork' rows, the remaining two values are ordered too: 'mahogany' < 'salmon' (at least, they are alphabetically, I know nothing of your colour preferences).

    kind colour locations
    knork mahogany 2, 7

Thus, we’ve done two efficient binary searches to find the rows of interest, at locations 2 & 7.

Column-lookup: kind

Similarly, the index works well for filtering WHERE kind = 'knork'. The process for finding locations 2, 4, & 7 looks remarkably familiar, just one step shorter:

  1. Binary search the kind column to find these matching entries.

    kind colour locations
    knork mahogany 2, 7
    knork salmon 4

The index is applicable for searching by kind by just doing the first step of the search process, and ignoring that the colour column even exists.

Column-lookup: colour

However, it doesn’t work for WHERE colour = 'mahogany': the table is not globally ordered by colour. We can see this viscerally:

colour locations
mahogany 5
zebra 3
mahogany 2, 7
salmon 4
aqua 1
periwinkle 6

That’s the original ‘index’ table, ignoring the irrelevant (for this query) kind column. The colour column is in total disarray, with mahogany appearing both after zebra and before aqua.

Without sorted data, binary search can’t help us. Without binary search, this index can do nothing!

This flat table analogy thus gives some intuition for why our colour-only query remains slow, even with the multi-column index: there’s no sorted data for look-ups to use.

Conceit

I’ve tailored the numbers here: the benchmarks above were run with PostgreSQL (PG) 17. The slow colour-only comparison looks quite different with PG 181, but the other two barely change:

Query PG No index (ms) With index (ms) Speedup
Both 17.7 420 0.009 46700×
  18.1 399 0.008 49800×
Kind-only 17.7 403 4.79 84×
  18.1 386 3.76 103×
Colour-only 17.7 480 492
  18.1 466 15.6 30×

The last row shows our colour-only query gets faster with the multi-column index in the newer PostgreSQL.

It is still 4× slower than the superficially-similar kind-only query, and its speedup with the index is much smaller (“only” 30× instead of 103×)… but it is clear that the multi-column index is now helpful, despite our query being on the second column only.

This seemingly violates what we just learned: we can’t binary search on the colour column with our kind-then-colour index? Has the index structure changed? Can we no longer use a sorted table analogy to guide our intuition?

No, the index data structure is the same, but Postgres 18 has learned a ‘skip scan’ method for using the index. It’s a nifty optimisation that relies on the later columns not actually being random: those columns are partially sorted, with runs of sorted data. The flat table analogy is still relevant for building an intuition for this optimisation…

… But this article is already long enough, and it is easier to build a mental model starting from the stark difference visible in PostgreSQL 17, before moving on to the fancier PostgreSQL 18 behaviour.

Conclusion

Like any index in a database, a multi-column index makes queries faster. It’s an index on multiple columns, but it can even make queries only involving some of those columns faster, but with differing behaviour depending on which subset of columns are involved.

Thinking about the index as an ordered table of the indexed columns’ data gives some intuition for why a query with only the first indexed column is much faster than a query with the second (or later) index column.

This article is from a human: I used no AI to write it.