Settings

Theme

Index Merges vs. Composite Indexes in Postgres and MySQL

sirupsen.com

175 points by Sirupsen 3 years ago · 41 comments

Reader

magicalhippo 3 years ago

A composite index can also be used for a partial index scan[1], so if you're frequently looking for just int100 as well as the int100,int1000 combination (as per the article's example), then a composite index int100,int1000 can be used for queries just filtering on int100.

The order of the columns in the composite index might matter then. We got some nice savings by reordering columns in indexes and changing the join order in queries (our DB wasn't that smart) or adding a "useless" filters to the where clause, allowing us to consolidate multiple indexes into one composite one.

[1]: might be using the wrong terminology here, I'm not talking about a partial index[2], but using only the first N dimensions of a M-dimensional index.

[2]: https://www.postgresql.org/docs/current/indexes-partial.html

  • winrid 3 years ago

    There are some DBs where the order of fields doesn't matter in the compound index if you're just searching by one field (but of course is less efficient). I think Oracle supports this.

    • dspillett 3 years ago

      All can search an index by just one field if it is the first field covered by the index, few can use an index if your predicate concerns the second and not the first.

      IIRC Oracle does support this and calls it a skip-scan. If the first column in the index is not very selective this can be very efficient, though if the first column is good in terms of selectivity then a skip-scan will be very inefficient and the DB might as well just scan the whole index. Given that a good choice of index usually has a fairly selective column as its first, and that unless storage space is very cramped you are likely better off just defining an extra index without the column that you want to not touch in some queries, this means that a skip-scan isn't helpful as often as one might think which is one of the reasons DB engine maintainers give for not spending time implementing and maintaining the feature.

      There are some use cases where having the option to skip-scan is definitely a significant bonus, but they are rare enough that I understand why most engine maintainers have not implemented them.

      • winrid 3 years ago

        Yes, I'm talking about skip-scan. I know pretty much all DBs can use compound index predicates for querying :)

        I can think of several cases where it would have been nice to have, sometimes "good enough" for a particular read pattern is better than additional write overhead etc.

        • dspillett 3 years ago

          One arguement against implementing nice to have features in a SQL or similar engine is that it adds complexity to the query planner: to try make it bright enough to know when to use it without using it when it is significantly detrimental, without ballooning there time needed to make a good enough¹ choice, without making a less good choice often enough that the overall efficiency of the choices the planner makes is lowered.

          Of course an obvious counter is to implement it as an optional feature only used when explicitly requested by a query hint, with the counter counter being "is it worth the implementation effort given how rarely it will get used in that case?".

          Oracle's developers were seated by the benefits, other database developers have not been.

          There are ways to "emulate" the behaviour using CTEs to gain the expected benefit in at least some cases, but I've not experimented with this myself so can't speak for how well it works in practise, or if the extra legwork makes queries too convoluted to be easily maintained (or understood easily at all be future maintainers).

          ----

          [1] noting that query planners do not strive to produce the best query plan, that is a Turing complete problem in many cases, instead trying to make a good enough choice in a reasonable amount of time

          • winrid 3 years ago

            Yep :) although it's much more expensive to get a query plan wrong... increasing plan time, even just 1ms to 2ms, will also significantly hurt throughout!

            I maintain a couple fairly large db deployments totalling around 100 machines, and it's always fun to see how query planner jitter affects things on a larger scale.

    • thom 3 years ago

      Postgres is capable of doing this with some index types, but generally more slowly than a B-tree index being asked about its leftmost columns:

      https://www.postgresql.org/docs/current/indexes-multicolumn....

      Also worth noting that for very complex workloads that need to support arbitrary subsets of equality matches over columns, a Bloom filter might work best:

      https://www.postgresql.org/docs/current/bloom.html

    • mattashii 3 years ago

      This dependz on the index type, but PostgreSQL also supports queries based on arbitrary attributes of multi-attribute indexes: both GIN and BRIN can be used for queries on any of the indexed expressions.

      Maybe eventually PG will also get index skip scans for btrees, which could allow for arbitrary columns searches in the btree; but we're not there yet by a large margin.

    • SirupsenOP 3 years ago
      • winrid 3 years ago

        Nice, I did not know MySQL supported this. Someday I'll work with MySQL again, good to know.

    • magicalhippo 3 years ago

      Good point, I reworded it to be more general. Can't recall seeing our DB being that clever.

    • chrsig 3 years ago

      > (but of course is less efficient)

      This means the order does matter

      • winrid 3 years ago

        Yes...?

        Sorry my comment is not semantically accurate enough for you.

arynda 3 years ago

Comparison on Clickhouse, also runs in about 30-40ms, however there's no indexing being used and this is a full-table scan.

    create table if not exists test_table
    (
        id UInt64,
        text1 String,
        text2 String,
        int1000 UInt64,
        int100 UInt64,
        int10 UInt64,
        int10_2 UInt64
    )
    engine = MergeTree()
    order by (id)
    ;
    
    insert into test_table
    with
      repeat('b', 1024) as one_kib,
      repeat('b', 255) as bytes_255
    
      select
        number as id,
        one_kib,
        bytes_255,
        rand() % 1000 as int1000,
        rand() % 100 as int100,
        rand() % 10 as int10,
        rand() % 10 as int10_2
      from numbers(10e6)
    ;
  
  
  > select count(*) from test_table where int1000 = 1 and int100 = 1;
  
  ┌─count()─┐
  │    9949 │
  └─────────┘
  
  1 row in set. Elapsed: 0.034 sec. Processed 10.00 million rows, 160.00 MB (290.93 million rows/s., 4.65 GB/s.)
The same table but with 1B rows instead, runs in ~1800ms

  > select count(*) from test_table where int1000 = 1 and int100 = 1;

  ┌─count()─┐
  │  999831 │
  └─────────┘

  1 row in set. Elapsed: 1.804 sec. Processed 1.00 billion rows, 16.00 GB (554.24 million rows/s., 8.87 GB/s.)
[1] Converted the table create and insert logic from here: https://github.com/sirupsen/napkin-math/blob/master/newslett...
  • hodgesrm 3 years ago

    > however there's no indexing being used and this is a full-table scan.

    That first steatement about "no indexing being used" is not quite correct if the query is run exactly as you show in your nice example.

    ClickHouse performs what is known as PREWHERE processing which will effectively use the int1000 and int100 columns as indexes. It scans those columns and knocks out any blocks (technically granules containing by default 8192 rows) that do not values that match the filter conditions. It then performs a scan on the remaining blocks to get the actual counts.

    PREWHERE is effective because columns are compressed and scans are fast. If there's any pattern to the filter columns (for example monotonically increasing counters) or their values have high cardinality PREWHERE processing will remove a large number of blocks. This will make the rest of the scan far faster.

    In your dataset it may not be especially efficient because you use random values, which don't necessarily compress well, and the values will appear in many blocks. It works much better in real datasets where data are more correlated.

    EDIT: PREWHERE is much faster in cases where you are doing more complex aggregation on many columns. Counts of course don't need to scan any extra values so it's not helpful in this case.

    p.s. Scans are ridiculously fast.

    • paulmd 3 years ago

      > p.s. Scans are ridiculously fast.

      this is really the lesson of SOLR. full-scan all the things, aggregate as you go, broadcast disk IO to multiple listeners.

      why do a bunch of 4K random IO when you could full-scan at bus speed? yeah you can make the 4K random IO super fast but that's not where hardware is going, and it's also scalable/clusterable/shardable where RDBMS caps out at one machine and clustering is kinda ugly.

      • yxhuvud 3 years ago

        Huh? That is exactly where hardware is going. What has been missing is the parts in between, the ability to emit enough random IO i parallel to saturate the interfaces.

        • paulmd 3 years ago

          "huh? high clockrates is exactly where hardware is going, we just haven't figured out how to get the silicon to work"

          no, that's the opposite of where hardware is going. when was the last time flash latency or DRAM latency significantly improved? literally 15 years ago. Optane is the only improvement that has been made on that front and optane is effectively dead at this point. so actually latency and random IO is going backwards right now.

          hardware is going in the direction of sustained block transfers - that is where NVMe is still improving today.

          issuing random requests still basically sucks just as much as it did 15 years ago, and doing a lot of them in parallel is just a shitty bandaid patch on the problem. some workloads are irreducibly single-threaded, you simply can't proceed on the logic until you know the last bit of data. being able to do lots of those in parallel is nice, but it's the consolation prize on latency no longer scaling anymore.

          so, stop doing random IO and just stream your workset in large blocks and have listeners pick off their bit (retrieve individual records, or perform their aggregations) as it streams by. Effectively, do your disk IO in big sustained transfers and then do the random IO in memory.

          Not that memory has improved over the last 15 years either but it's better than doing it on disk.

          SOLR is behind most of the big webscale search and commerce systems nowadays. Nobody is doing Amazon on RDBMS-style random IO systems, not even on database-style document systems like mongodb (which is really more pseudo-RDBMS than a true document search system).

          Or on the flipside: if you want to do random IO on your SSDs, do random IO on your SSDs - forget the whole filesystem layer, and use a key-value SSD (which do exist). That's what your database provides right now, after all. But RDBMS (or, again, quasi-RDBMS random-IO document stores like Mongo) doesn't play to the strength of SSDs anymore, that's not where they're getting better, and if you want to treat it as a block store then you might as well stream big blocks and not do random IO on your device layer.

          https://www.mydistributed.systems/2020/07/towards-building-h...

    • arynda 3 years ago

      > ClickHouse performs what is known as PREWHERE processing > p.s. Scans are ridiculously fast.

      Good point, I should have mentioned this was basically a worst-case scenario for Clickhouse as the data layout is totally random (same approach as OP used in their benchmark) and isn't able to utilize any granule pruning, sorting, or skip indexing, but is still able to achieve such remarkable speeds.

      • hodgesrm 3 years ago

        What's cool is that even in this case ClickHouse is still stupid fast compared to most other databases. ;)

    • stingraycharles 3 years ago

      Out of curiosity:

      > It scans those columns and knocks out any blocks (technically granules containing by default 8192 rows) that do not values that match the filter conditions

      How is that not just a sequence scans? Of course it pre-emptively filters away entire blocks that do not contain the data, but indexes typically work differently: they’re calculated upon write, so that they can be queried really fast.

      Is there a detail missing here, e.g. like bloom filters being used or something else that makes it different from a regular sequence scan?

  • Nican 3 years ago

    I know nothing about ClickHouse, but how many cores are being used for these queries? And what is the core processing frequency?

    I would think ClickHouse is tuned for analytics workloads, so it will throw plenty of cores at the problem, and not care much for the overhead. Meanwhile, I believe PostgreSQL is more tuned to transactional workloads, where it will not pay the query parallelism overhead, but optimize for multiple parallel workloads.

  • SirupsenOP 3 years ago

    Are you aware of a good write-up on how Clickhouse/other columnar databases do the intersection?

pdhborges 3 years ago

  For these 64-bit index entries we’d expect to have to scan roughly:
  index_row_size⋅rows=2⋅64bit⋅10^5=1.5MiB
Where do the 10^5 rows come from? With a composite index and a point query doesn't the database scan just the 100 returned rows?
  • SirupsenOP 3 years ago

    You're absolutely right!

    I forgot to move this around when I updated the article's structure. This is only relevant when doing the index merge. The article has been updated

Lukas1994 3 years ago

Good stuff! What's the size difference between the composite index vs the two separate indices?

  • SirupsenOP 3 years ago

    I've added this to the article, thanks!

    Composite index (int64, int64): ~70 MiB in Postgres, ~350 MiB in MySQL

    Single index (int64): ~70 MiB in Postgres, ~240 MiB in MySQL

    If you assume the majority of an index are index entries of (int64, int64, int64) where the third number is some sort of identifier for the record on the heap, you'd expect this to be ~230 MiB. So Postgres does some compression very well here, and MySQL has a bit more overhead for its indexes it seems.

    • fabian2k 3 years ago

      Could be the deduplication newer Postgres versions have for B-Tree indexes:

      https://www.postgresql.org/docs/current/btree-implementation...

      > 67.4.3. Deduplication

      > A duplicate is a leaf page tuple (a tuple that points to a table row) where all indexed key columns have values that match corresponding column values from at least one other leaf page tuple in the same index. Duplicate tuples are quite common in practice. B-Tree indexes can use a special, space-efficient representation for duplicates when an optional technique is enabled: deduplication.

      > Deduplication works by periodically merging groups of duplicate tuples together, forming a single posting list tuple for each group. The column key value(s) only appear once in this representation. This is followed by a sorted array of TIDs that point to rows in the table. This significantly reduces the storage size of indexes where each value (or each distinct combination of column values) appears several times on average. The latency of queries can be reduced significantly. Overall query throughput may increase significantly. The overhead of routine index vacuuming may also be reduced significantly.

      • SirupsenOP 3 years ago

        Excellent, thank you! I'll add that to the article.

        • fabian2k 3 years ago

          This is a guess on my part, though I think a plausible one. To verify you'd probably have to compare an index with unique values to one with many identical ones.

    • masklinn 3 years ago

      An other thing which might be of interest: what if you use convering indexes for the two single indexes? Is postgres able to do an index-only scan then, thanks to the coverage?

      Or does it decide to use only one index and filter based on the covered values?

    • libraryofbabel 3 years ago

      As you mention in the article, this is small by modern hardware standards and means the indexes can be entirely held in memory. Curious how things look for much larger tables where the index has to be read from disk?

bawolff 3 years ago

Honestly im kind of surprised that they are even that close. I wonder if this changes at scale when the intersection is larger.

  • SirupsenOP 3 years ago

    Ideally, I would add three graphs to the post:

    (1) Table size on the x-axis, and time on the y-axis for index merge vs composite index

    (2) Number of columns on the x-axis, and time on the y-axis for both

    (3) Number of final matches on the x-axis, and time on the y-axis for both

    But ran out of time and decided to test with the table size of 10M rows, and a 100-ish result set. That's in my experience a decent representation for what you might be doing with a relational database.

scotty79 3 years ago

Isn't the cause of the difference between MySQL and PostgreSQL in this particular case is that COUNT is (was?) implemented weirdly in PostgreSQL and was always slower than in MySQL?

Can PostgreSQL respond to any query straight from index without touching the rows if the index contains all necessary fields?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection