Implement table partitioning

git.postgresql.org

190 points by rachbelaid 9 years ago · 59 comments

Reader

Any support for "rolling" partitions? e.g. A partition for data updated less than a day ago, another for data from 2-7 days ago, etc.

I miss this from Oracle; it allows nice index optimizations as the query patterns are different for recent data vs. historical data.

I think it could be set up with a mess of triggers and a cron job... but it would be nice to have a canonical way to do this.

jtc331 9 years ago

The fundamental issue here is that you'd actually have to move the rows between relations given that Postgres maintains separate storage etc. for each. There's no good way to do that.
willvarfar 9 years ago

Living with the cron jobs for a big mysql db, and wishing the DB understood this seemingly common use-case :(
- Klathmon 9 years ago
  
  Honestly i wouldn't call it "common". It's useful, and if it existed I could see it changing how I design a database, but it's not something I can say i've ever thought about needing before.
  But then again, maybe i'm the outlier here.
  - willvarfar 9 years ago
    
    Its very common to partition by a function of a date, e.g. `PARTITION BY RANGE( DAY(event_timestamp) )` etc. The docs talk a lot about partitioning by dates http://dev.mysql.com/doc/refman/5.7/en/partitioning-range.ht... but, as said, you have to have a cron job to keep adding new partitions and archiving/dropping old partitions etc. Its a shame that couldn't be automated by the DB itself.
lobster_johnson 9 years ago

How does this work in Oracle? Seeing as the partitioning constraint would be time-dependent, wouldn't it need to re-evaluate it at regular intervals in order to shuffle data around? Is the feature explicitly time-oriented?
- mulmen 9 years ago
  I don't think oracle can do this exactly but the query planner does understand time based partitions so if you do something like:
  SELECT * FROM partitioned_table WHERE partition_date_key > SYSDATE - 1;
  The query planner will only use the most recent partition. Combine this with Oracle's ability to merge partitions and you get "daily" partitions that become "weekly" partitions when the new week starts. Alternately you could wait a month and combine all the days of last month into a single partition and then even combine months into years.
  The partition intervals are based on specific dates/times, not on the relative time from query execution.
  Oracle also supports row movement which is the biggest missing feature here I believe.

rachbelaidOP 9 years ago

The conversation on the patches are really interesting: https://www.postgresql.org/message-id/flat/55D3093C.5010800@... https://www.postgresql.org/message-id/flat/ad16e2f5-fc7c-cc2...

aidos 9 years ago

I'm always amazed by the PG community - it seems like such a constructive place.
Those patches are absolutely insane. Makes you remember how much hard work goes into building the software you use on a day to day basis.
https://www.postgresql.org/message-id/attachment/45478/0001-...
- MarHoff 9 years ago
  
  I've been professionally focused on PostgreSQL based works for the last 5 years. At the highest point of the BigData hype I sometimes felt a little bit off-track, because I never got the time to investigate NoSQL solutions...
  Only recently did I realize that being focused on actual data and how to process it inside PostgreSQL was maybe the best way I could spend my working time. I really can't say what's the best part of PostgreSQL, the hyperactive community, the rock solid and clear documentation or the constant roll-out of efficient, non-disruptive, user-focused features...
- reactor 9 years ago
  
  I could see good amount of quality engineering there, kudos.

egeozcan 9 years ago

If you also didn't know what exactly partitioned tables are, here's a nice introduction from Microsoft:

https://technet.microsoft.com/en-us/library/ms190787(v=sql.1...

It is for the SQL server but I assume it would be mostly relevant. Please correct me if I'm wrong.

SideburnsOfDoom 9 years ago

So this is all about partitioning data into different storage files on the same server? What is the main benefit of that?
- pilif 9 years ago
  
  If you combine the partitions with tablespaces, you can put tables on multiple disks. Let's say you keep a record of all orders you have processed. During the day-to-day operation, you need, say, the last 2 months of data all the time, but the older data you only need for reporting here and then.
  By partitioning, you can keep the recent data on a fast disk and the older data on slower disks while still being able to run reports over the whole dataset.
  And once you really don't need the old data any more, you can just bulk-remove partitions which will get rid of everything in that partition without touching anything else.
  Even then you don't split over tablespaces: By keeping the data that's changing often separate from the data that's static and is only read, then you gain some advantages in index management and disk load when vacuum runs as it mostly wouldn't have to touch the archive partitions.
- chime 9 years ago
  
  > For example, if a current month of data is primarily used for INSERT, UPDATE, DELETE, and MERGE operations while previous months are used primarily for SELECT queries, managing this table may be easier if it is partitioned by month. This benefit can be especially true if regular maintenance operations on the table only have to target a subset of the data. If the table is not partitioned, these operations can consume lots of resources on an entire data set. With partitioning, maintenance operations, such as index rebuilds and defragmentations, can be performed on a single month of write-only data, for example, while the read-only data is still available for online access.
  The "General Ledger Entry" table in most accounting systems ends up being millions to billions of rows. Except for rare circumstances, prior periods are read-only due to business rules.
- Lozzer 9 years ago
  
  Metadata operations on partitions can be very fast. One simple example is date based housekeeping. Deleting a month of data will be quite intensive on most databases, whereby dropping a partition from the table is effectively instant.
  Partion switching is also fast. Say you have a summary table that is rolled up by month, but you want to recalculate the summaries every so often. You can build a month into a new table and then switch the new table for a partition in the summary table.
  - CWuestefeld 9 years ago
    
    When the circumstances allow it ('cause there are some limitations on where it can be used), this pattern is HUGE. We've got a few places in our system that do this, and the optimization achieved an improvement of a couple orders of magnitude.
- mioelnir 9 years ago
  
  First, if your query used the partition key in its where clause, the database knows(can calculate) which partitions can have a result and which can not. This means smaller indexes/less data to scan to find the result.
  In the MSSQL case - not sure about others, this is were I had to use it - you can also switch data segments between tables indexed over the same partition function and with the same DDL. So you recreate the existing table a second time, create all the required indexes on it (which is fast because the table is empty), and then you switch partitions between them basically via pointer manipulation. The empty partition is now in the normal table, the data partition in the recreated one. Then you drop table on the recreated table. This is much more IO efficient than a delete-from statement.
  This switching of course allows for a lot of other fun stuff as well, where you switch out a partition with a couple million rows, then work on it in isolation, switch the partitions back and then only have to "replay" the few rows that hit that partition while they were switched. Which is easy because they are now in the shadow table which is not updated further.
  It is of course data and application dependent if you can use these things without affecting your application; but if it is suitable, the gains can be immense.
  - flukus 9 years ago
    
    So this won't help when you have something like a deleted flag and need to join on data that could be in either partition?
    
    mioelnir 9 years ago
    
    Yes and no. It won't help in the sense that the query planner can not identify the affected partitions. But it can help if the database supports parallel index scans, since instead of one big index, it has to scan n smaller indexes which can be done concurrently.
- fusiongyro 9 years ago
  
  I have a scenario where we want to keep 12 months of data online. When you go to delete the 13th month of data the traditional way:
  - Postgres has to scan the whole table to find the old data
  - Postgres marks it as free, but doesn't give it back to the OS
  Handling this the naive way winds up being both slow and unproductive. With table partitioning, I just go in and DROP TABLE data_2015_11 and get on with life. It's fast and returns space to the OS.

ktopaz 9 years ago

I don't get it? Table partition is already supported in PostgreSQL now and has been for a long time now (at least since 8.1); Where I work we utilize table partitioning with PostgreSQL 9.4 on the product we're developing.

https://www.postgresql.org/docs/current/static/ddl-partition...

fabian2k 9 years ago

As far as I understand, this is about declarative partioning. So you don't have to implement all the details yourself anymore, you just declare how a table should be partioned instead of defining tables, triggers, ...
- amitlan 9 years ago
  
  Note that there is no shorthand syntax (yet), where you define a partitioned schema in just one line of DDL.
  As of now, you still need to create the root partitioned table as one command specifying the partitioning method (list or range), partitioning columns (aka PARTITION BY LIST | RANGE (<columns>)) and then a command for every partition specifying the partition bounds. No triggers or CHECK constraints anymore though. Why that way? Because we then don't have to assume any particular use case, for which to provide a shorthand syntax -- like fixed width/interval range partitions, etc.
  That said, having the syntax described at the beginning of the last paragraph in the initial version does not preclude offering a shorthand syntax in later releases, as, and if we figure out that offering some such syntax for more common use cases is useful after all.

roller 9 years ago

The linked patch notes specifically mention the difference between this and table inheritance based partitioning.

  Because table partitioning is less general than table inheritance, it
  is hoped that it will be easier to reason about properties of
  partitions, and therefore that this will serve as a better foundation
  for a variety of possible optimizations, including query planner
  optimizations.

rtkwe 9 years ago

"Supported" in so far as you could basically roll your own implementation, having it managed by the engine is massively more useful and easier to support and setup. A lot of things are supported if you're willing to bodge it together like that.
sapling 9 years ago

It sounds like this is column level partitioning.Each column or columns (based on partitioning expression) is stored as different subtable (or something similar) on disk.If only few columns are frequently accessed, they can be put on cache/faster disk or other neat optimizations for join processing.
- MarHoff 9 years ago
  
  Maybe I don't get you, but i don't think so, PostgreSQL is not a columnar database.
  If i got this patch right, each partitioned table will have the same data structure and store whole rows (it's even more restrictive than previous inheritance mechanism that allowed extending by adding additional columns).
  Column or expression should only define in which table an inserted row is supposed to be stored. A single row will never been torn apart. Still it look like a foundation that facilitate sharding BigData(Set) between multiple servers when used in conjunction with foreign data. However a lot of performance improvements will still be needed to compete against solid NoSQL projects (in which you really have a BigData use case).
  But looking a bit forward, developing performances improvements on top of an ACID compliant distributed database seems less difficult than to develop a NoSQL project for it to become ACID.
  - spacemanmatt 9 years ago
    
    Generally PostgreSQL is a row database but Citus has release column drivers that make it perform similar to other columnar databases.
    But this patch is about horizontal partitions.
- colanderman 9 years ago
  
  It is not. The partitions are by row. The novelty is in that it is truly native support, rather than a couple optimizations to make roll-your-own palatable.
  - sapling 9 years ago
    
    Sorry I am wrong.I saw the PARTITION BY {column name} and lazy me assumed things.
- lobster_johnson 9 years ago
  
  No, it's row-level partitioning — splits a single logical table into multiple physical tables.
- sapling 9 years ago
  
  I was wrong.It isn't column partitioning.

tajen 9 years ago

About donations: I believe PostgreSQL now deserves more advertising and marketing to develop its adoption in major companies and, hence, get more funding. If I donate on the website, it says it will help conferences. Where should I donate?

bigato 9 years ago

Supposing the case in which all partitions are on the same disk and that you manage to index your data well enough according to your usage that postgres does not need to do full table scans, are there any additional performance benefits on partitioning?

lobster_johnson 9 years ago

Well, anything that reduces the size of a search space helps performance.
Partitioning can drastically improve query times because the planner can use statistics only from a single partition (assuming the query works on a single partition). Postgres uses (among other things) a range histogram of cardinalities to determine the "selectivity" — how many rows a query is likely going to match. If you have a 1B-row table and you're looking for a value that only occurs once (low cardinality), the statistics won't help all that much. But if you partition it so that you're only looking at 1M rows instead of 1B, the planner can be a lot more precise.
Another point is cache efficiency. You want the cache to contain only "hot" data that's actively being queried for. If most of your queries are against new data, then without partitioning, any single page would likely contain tuples from all time periods, randomly intermixed, and so a cached page will contain a lot of data that's not used by any query. (Note that if you use table clustering, which requires the regularly running of the "CLUSTER" command, then you can achieve the same effect at the expense of having to rewrite the entire table.) If you partition by time, you'd ensure that the cache was being used more optimally.
Write access is also helped by partitioning by cold/hot data: B-tree management is cheaper and more cache-efficient if it doesn't need to reorganize cold data along with the hot. And smaller, frequently changed partitions can be vacuumed/analyzed more frequently, while unchanging partitions can be left alone.
gdulli 9 years ago

1. Flexibility/freedom to distribute partitions in the future if needed.
2. Indexing doesn't work well in all cases. You can be better off scanning entire small partition tables lacking an index on a given column than with a single very large table whether that column has an index or not. (Indexes take up space and need to be read from disk if they don't fit in a memory cache, indexes don't work well for low-cardinality columns, etc.)
3. There are operations you can parallelize on a large number of small/medium tables and perform faster or more conveniently than a single very large table. One of my favorite techniques:
# usage: seq 0 255 |parallel -j16 ./alter-tablespace.sh {}
hexn=`printf "%02x" $1`
psql -Atc "ALTER TABLE tablename_${hexn} SET TABLESPACE new_tblspace" -hxxx -Uxxx xxx
4. A nice side effect of properly/evenly partitioned data you get for free is that you can do certain types of analysis on a single partition (or a few) very quickly and have it represent a sampled version of the data set. You can think of it as another index you get for free.
takeda 9 years ago

To add to responses that you already got there's also a nice use case that partitioning helps with.
When you have table that you constantly inserting large amount of data, and simliarly you are removing old data at the same frequency (i.e. only care about month of data).
If you set partition for example per day, it's way faster to drop old tables than performing a delete.
Jweb_Guru 9 years ago

Yes. Less latch contention for nodes of a single btree index, for instance.
- bigato 9 years ago
  
  I didn't know what latch is, so I googled it and found a nice explanation:
  https://oracle2amar.wordpress.com/2010/07/09/what-are-latche...
  "A latch is a type of a lock that can be very quickly acquired and freed."
  That brings me a couple more questions:
  1. May I infer then that the only benefit from partitioning the table (fully located on the same disk) that can not be achieved by indexes is that queries will wait less time for this kind of lock to be released?
  2. May I assume while a table is only being read and not changed, there's no performance gain from partitioning a table (fully located on the same disk) that can not be achieved by indexes?
  - dhd415 9 years ago
    
    There are other possibilities as well. For example, if your partitioning strategy is such that it improves the selectivity of an index, it could improve query plans for queries that were on an index that was less selective. As an example, I once had a table with over a billion rows distributed among ~100 tenants on which queries were typically run by tenant and date range. Partitioning that table by tenant dramatically improved the performance of those queries because those queries no longer had to scan through rows of which only ~1% were for the tenant of interest.
    
    bigato 9 years ago
    
    If you had built composite index instead, wouldn't it work just the same towards improving the performance? Well I can see that an composite index would occupy more space while partitioning wouldn't, and that should be something to take into consideration. But performance-wise, wouldn't it be the same?
    
    dhd415 9 years ago
    
    No, for example, a composite index on (tenant, date) would be highly non-selective whereas an index on date in each individual tenant partition is highly selective and therefore higher-performance (in my case, much higher performance).
    
    bigato 9 years ago
    
    What if you made an index on (date, tenant) instead?
    
    Jweb_Guru 9 years ago
    
    The issue he or she is referring to is that the underlying rows are fragmented within each fetched page (so you might need to fetch 40 pages for 40 index entries, even if they would all fit on one partitioned page). Fiddling with the index order isn't going to change that (the current index order is already optimal for that type of query). There's another solution, which is to expand the index to include any covered rows so you can use index-only scans and not hit the main rows at all, but that's trading away a large amount of index size, which you usually want to avoid unless you have no other choice.
  - Jweb_Guru 9 years ago
    
    > 1. May I infer then that the only benefit from partitioning the table (fully located on the same disk) that can not be achieved by indexes is that queries will wait less time for this kind of lock to be released?
    See below for many examples (I originally tried to split up the read and write benefits, but that usually isn't very meaningful since speeding up writes often speeds up reads indirectly (by enabling more indices to be maintained, releasing resources faster, leading to reduction in index size, or improved statistics).
    > 2. May I assume while a table is only being read and not updated, there's no performance gain from partitioning a table (fully located on the same disk) that can not be achieved by indexes?
    No. First, latch contention occurs even on fetch--latches are much lower level than the kinds of logical locks that get taken on updates and are generally required in order to enforce database invariants (in Postgres btree indices they live at the page level). Second, there are many other benefits.
    An important one is separate maintenance of planner statistics. When a table has enough entries, Postgres's "most frequent values" list ends up not being able to fit all the rows with large frequencies (this can be tweaked up to storing 10,000 such buckets, but eventually even this is not enough if your table is very large). It then has to resort to its histogram, which normally only works with columns with a natural sort order. Since Postgres bases many important decisions during query planning on these statistics, you will often see order-of-magnitude improvements in query performance by improving table statistics. Note that composite indices do not keep separate statistics "per prefix" the way partitioning does (that is, an index on (a,b) does not track statistics of column b "within" column a); they just keep statistics on the individual covered columns or expressions. Not tracking cross-column correlations has a large impact in practice if your columns are not uniformly distributed throughout your table (and you have enough rows).
    Another potential benefit (depending on your partition key) can be reduction in index size. For example, if you partition a table into a different segment for each tenant in your organization, you can use selective indices that don't include the tenant id. Note that you can already do this with partial indices, but Postgres does not use their statistics, which makes them much less useful, and large numbers of them can degrade planner performance (moreover, until recently, it didn't even use them for index-only scans, but I believe this was fixed in 9.6). Reduction in index size allows more indices (or more nodes of the index) to fit into memory (or even sometimes cache, for internal nodes), which can lead to dramatic speedups, especially if you can use an index-only scan. Smaller indices are also much faster to build, maintain, and scan.
    The combined index size of indices in a partition is also slightly smaller (by a factor of around log(n) / log(n/p)) but that isn't really that significant (for 2^30 (over 1 billion) tuples, and 100 partitions, it is still just a 1.3x improvement, and can easily be overwhelmed by the size of the extra metadata for smaller numbers of tuples). For much the same reason, updating 100 100-entry histograms / most-frequent-value sets can be (slightly) faster than a single 10,000 entry one.
    Another one is that if you are fetching more than one row at a time from a partition, keeping them (more) clustered on disk and in RAM (remember that fetching a page of RAM that's not in cache is actually quite slow!) means substantial savings in I/O and greatly increased likelihood of page buffer hits, which leads to order of magnitude savings. This is probably the biggest improvement from partitioning, and it applies to more complicated queries than point lookups (such as range queries and bitmap index scans). There are also some types of index (notably BRIN) that are most effective if your data distribution is such that you know you tend to have monotonically increasing keys over the entire table. In multitenant situations, these become much more useful if you can restrict them to a single partition, and allows you to again save index space.
    There are other benefits to overall database performance (in the same vein as btree latching) from being able to address different tables in parallel, such as vacuuming being able to be performed separately on different tables (there are also some drawbacks, like operations that perform metadata scans [also VACUUM!] taking longer, but maybe this is improved in Postgres's new scheme--I'm not sure).
    One that I don't see people talk about a lot, but is critically important for our usecases where I work, is that Postgres's implementation of SSI (serializable snapshot isolation), though it tries to use gap and row-level locking, will conservatively upgrade to page-level locks, and sometimes even table-level ones. If other updates are going on in the database at the same time, this can cause false conflicts, leading to many unnecessary aborts and retries. If data are partitioned and updates tend to be specialized to a partition, the chances of this happening are greatly minimized.
    Finally, even though it contradicts your original premise: requiring an index to be used for every operation is asking for trouble. Making scans fast pays off in spades (especially if you have arbitrary custom filters in queries) and can free up your programmers to focus on things other than database performance :P A combination of intelligent partitioning (where appropriate), some understanding of Postgres's internals, and very little else can allow servicing surprisingly large numbers of users with varied query requirements, without creating many indices that aren't directly implied by the data model.
    Hopefully, this gives you some sense of why people are excited for this feature :)

gdulli 9 years ago

This message was confusing to me because I've been using/abusing Postgres inheritance for partitioning for so long that I forgot Postgres didn't technically have a feature called "partitioning".

What I'm looking forward to finding out is if I can take an arbitrary expression on a column and have it derive all the same benefits of range partitioning like constraint exclusion.

vincentdm 9 years ago

I really like this addition. We store a lot of data for different customers, and most of our queries are only about data from a single customer. If I understand it correctly, if we would partition by customer_id, once the query planner is able to take advantage of this new feature, it will be much faster to do such queries as it won't have to wade through rows of data from other customers.

Another common use case is that we want to know an average number for all/some customers. To do this, we run a subquery grouped by customer, and then calculate the average in a surrounding query. I hope that the query builder wil eventually become smart enough to use the GROUP BY clause to distribute this subquery to the different partitions.

tda 9 years ago

I just tried to implement table partitioning in PostgreSQL 9.6 this week. With some triggers and check constraints this seem to work quite nicely, but I was a bit disappointed that hash based partitioning is currently not possible (at least not without extensions).

Will hash based partitioning be included in PostgreSQL 10? The post notes

  A partitioning "column" can be an expression.

so I can assume it will be supported?

jtc331 9 years ago

As long as the expression being hashed doesn't change then yes you could make the expression a hashing function call. If the expression being hashed is mutable there would be issues since the feature doesn't currently support updates that result in rows moving between partitions.
amitlan 9 years ago

Not natively, as in there is no PARTITION BY HASH (<list-of-columns>). What limitations do you face when trying to roll-your-own hash partitioning using check constraints (in 9.6)?
- tda 9 years ago
  I wanted to partition a table by the foreign key, as the table receives a few hundred rows per foreign key per hour (it is a timeseries db).
  So I figured partitioning the table by foreign key would group all data together in a way that allows for faster access (typical access pattern would be select * where foreign_key = x). However, as the number of keys in the foreign table is unbounded and can be quite large, I wanted to partition the data to a limited number of tables, with
  mod(foreign_key, number_of_partions)
  If I understood correctly, check constraints can't operate on a calculated value
  - amitlan 9 years ago
    
    Yes, it is not possible to optimize (ie, prune useless partitions for quicker access) the query select * from tab where key = x. You'd need actual hash partitioning for that. The mechanism Postgres uses to perform partition-pruning (constraint exclusion) does not work for the hashing case.

amitlan 9 years ago

...this is the beginning, not the end... https://www.postgresql.org/message-id/CA%2BTgmobTxn2%2B0x96h...

vemv 9 years ago

While seemingly extensive, I don't quite like the commit message.

I doesn't say what TP is, and what its use cases would be. That's the first thing you should say, else how am I going to understand / keep interest in the rest of the text?

pilif 9 years ago

The commit is written by postgres developers for postgres developers. I would say that 90% of the intended audience of that commit message doesn't need an explanation what table partitioning does.
For them this would be needless clutter that's not at all relevant to the commit.
Once we're reaching the 10.0 release, human-friendly release notes, additional manual chapters and sample code will be written for the users to understand (in-fact, the commit linked by this submission already contains quite a bit of additional documentation to be added to the manual).

Settings

Implement table partitioning

Keyboard Shortcuts