PostgreSQL views and materialized views and how they influenced TimescaleDB

181 points by od0 4 years ago · 31 comments

Reader

gfody 4 years ago

> If I’d written out the query, I might have seen that I didn’t need the JOIN (or never written it in the first place). Whereas the view hides that complexity. So they can make things easier, but that can lead to performance pitfalls if we’re not careful.

you can avoid this particular pitfall by using left joins for views like this (that join stuff in for convenience that you might not select) - postgres will eliminate a left join but not an inner join since the inner join could filter rows (eg symbols in stocks_real_time that don't exist in company (commercial engines will use the presence of fk constraints to know that's impossible and go ahead and eliminate inner joins as well))

gopalv 4 years ago

> commercial engines will use the presence of fk constraints
What, postgres doesn't do FK join removals?
Like I tried it right now and it didn't remove the hash-join
http://sqlfiddle.com/#!17/073747/2
Should've been just a count off the PK Employees table.
Apache Calcite in Hive had to bake in this specific optimization because there were a ton of deep join views (coming off Teradata) where someone would run something like a "select count(*) + where" or just project 2-3 columns out of a 60+ join view from a couple of tables.
And those just ran forever without the PK-FK removals.
- michael1999 4 years ago
  
  How would that work? Your schema allows an employee to be assigned to multiple departments, and the query must count them.
  - atwebb 4 years ago
    
    It is also called Table Elimination if you want a google-able term.
  - jmalicki 4 years ago
    
    Because of the foreign key, the query should reduce to "explain select count(*) from Department d;"
  - doctor_eval 4 years ago
    
    The schema only allows one employee per department, so there is no need to look up the employee table.
- taspeotis 4 years ago
  
  I would try running that fiddle on something more recent than Postgres 9.6
  - mdavidn 4 years ago
    
    I ran into this recently on a newer version. Postgres will not use the presence of a foreign key (referencing a unique index) to skip unnecessary inner joins.
    
    taspeotis 4 years ago
    
    I have done my own (simple) testing on Postgres 13 and was unable to make Postgres elide the join.
djk447 4 years ago

NB: post author here!
Thanks yes! Totally true, was thinking about including some of that but it felt like it opened a can of worms about join types and why certain things would be included and others not (ie inner join needs to see that it's there on both sides whereas the left join doesn't) etc. and the post was already kinda long in the tooth.

garyclarke27 4 years ago

Postgres is an amazing database. It’s only significant weakness now is in Materialized views, with their lack of incremental refresh. Was disappointing to see there was no progress towards this in v15.

gunnarmorling 4 years ago

That work towards incrementally updated views is happening and progressing. For now, it's a separate extension, though: https://github.com/sraoss/pg_ivm.
cadbox1 4 years ago

I wanted incremental refresh in Postgres as well and found that you can manage your own table to get something close.
Basically you create a regular table in place of a materialised one, only aggregate data newer than what's currently in the table then store the new aggregates in table. Repeat this an interval.
https://github.com/cadbox1/prawn-stack/blob/master/src/backe...
I use this to show page view data aggregated by hour without calculating it on each request using Lambda
https://prawn.cadell.dev/
edmundsauto 4 years ago

I also wish this were in core posgres, but if you use a build tool like DBT you can enable this type of thing.
- snicker7 4 years ago
  
  Interestingly, DBT does not support creating materialized views.
  - mritchie712 4 years ago
    
    yeah, this issue has been open for years. We need them for Clickhouse for our product.
    https://github.com/dbt-labs/dbt-core/issues/1162
  - edmundsauto 4 years ago
    
    What's the effective difference between a matview and just a table?

WXLCKNO 4 years ago

Enjoyed this post! Djk447 would love to ask a question as well.

We've started working with Timescale to process historical time series data. However there is so much of it that we chose which parts we process.

It's possible that in the future we may need to go back and reprocess the dataset to add something we decided we want after all.

In your post it seems like this is handled automatically on a smaller time scale, meaning you could insert into the past without doing anything special.

What happens if you need to insert data at numerous points across two years worth of data instead? Do you have to use backfilling as described in your documentation? Or is it better to maybe rebuild the entire hypertable?

djk447 4 years ago

NB: Post author
Yes. this is generally handled automatically, there may be times though where you want to essentially pause refreshing the view for a while while you do some backfilling and then eventually let it catch up, especially if you're overwriting the same time period multiple times in a row. If you can insert in time order then it just breaks up re-calculation into smaller segments, which can be quite useful rather than having to process the whole data set again.
This can be a little bit different if you're doing compression, but with continuous aggregates I think it should work fine. I'm not 100% sure that was what you were looking for, let me know if it's not.

jbylund 4 years ago

Did you consider a type of continuous aggregate that works over pre-aggregated partitions for time buckets without modifications and swapping those buckets in the continuous aggregate view with a live view when data is modified? I guess it would mean that performance would sort of continually degrade as inserts/updates/deletes happened in more and more time buckets, but as soon as the aggregation event fired performance would recover. It seems like one could provide a more strongly consistent view of aggregates with this sort of approach.

djk447 4 years ago

NB: Post author
I'm not 100% sure I understand what you're asking, but essentially something that would look for data modifications and at query time run the query over the older regions as well?
If that's what you're asking the answer is yes, we did consider it, but basically decided that it was something that relatively few people needed and the complexity and performance tradeoffs were unlikely to be worth it for most folks.
Essentially, we could do something like this now by looking at our invalidation log and running a join against it to get to a more strongly consistent state (I haven't thought through the full implications and whether it's truly strong consistency, I think it might be, but it'd require a proof / some thinking through of all of our locking logic to really get there). It's interesting to consider though.

cribwi 4 years ago

Great approach and good write-up! I’ve implemented a similar technique before on PostgreSQL, but with the materialisation in the application backend. Still works like a charm.

>So instead, we created a special kind of trigger that tracks the minimum and maximum times modified across all the rows in a statement and writes out the range of times that were modified to a log table. We call that an invalidation log.

Does this invalidation log also take into account cases where the view has an aggregate that is based on data from a bucket other than itself? For example, a lag() or lead() might be used to calculate a delta compared to the previous bucket. Then, if a data point inside bucket 1 is added into the realtime table and bucket 1 is invalidated and re-materialised, for integrity reasons also bucket 2 needs to be re-materialised?

hodgesrm 4 years ago

This is an excellent article. I like the way the author builds up in steps to eventual consistency between source tables and their materialized views. It was fun to guess the next step.

I do have one question: how does the algorithm described in the article work when the source table is spread across multiple servers, say in multiple shards? Can TimescaleDB maintain materialized views on each shard and then run a query that reconciles them?

Edited: clarification

djk447 4 years ago

NB: Post author
So we thought about doing something like that with multinode where each of the nodes would maintain their own materialization but abandoned it for that very reason it’s very, very difficult to maintain any sort of consistency guarantees in that case, or even to reason about it.
Instead we use the access nodes as coordinators to do the materialization. right now the materialization only exists on the access node but there’s no reason we couldn’t send it back out to the data nodes, you just need a coordination point to start a distributed transaction to have some semblance of a guarantee.
- djk447 4 years ago
  
  And glad you liked the article!
  - hodgesrm 4 years ago
    
    It was excellent. I work on ClickHouse and wrote something similar a while back for ClickHouse mat views. [0] It was not nearly as good, hence the appreciation.
    Don't know if you do talks but I have a couple opportunities coming up at open source events. We like hearing about what people are doing outside of ClickHouse.
    [0] https://altinity.com/blog/clickhouse-materialized-views-illu...

Inviz 4 years ago

It’s sad that timescale aggregates don’t work on top of other aggregates. Abstractions is leaking. The ticket is left unaddressed for a while

LoriP 4 years ago

Hello would like to check in on this with you. You can find my email address in my profile, I am Timescale's community manager. Thanks!

gigatexal 4 years ago

I wonder how well these perform compared to the aggregate tables of Clickhouse fame.

Settings

PostgreSQL views and materialized views and how they influenced TimescaleDB

Keyboard Shortcuts