Running Redshift at Scale

blog.artie.so

32 points by tang8330 3 years ago · 7 comments

Reader

gregw2 3 years ago

I run Redshift at scale.

On using ra3, I agree generally, but for cost would recommend Refshift Serverless in Dev/QA for cost reasons unless you have steady workloads there.

On CPU, even 1 query will cause CPU to hit 100% so I don’t consider it that helpful a metric on a well used cluster. A subtler helpful metric is leader node CPU which you don’t really want to be over 50% regularly, as that is where your query planning, GROUP BYs consolidations and ORDER BY sorting of results occurs.

Agree you should use WLM. The article doesn’t discuss whether to use regular/classic WLM or newer AutoWLM. I’ve done both. I don’t have a strong recommendation. Classic WLM lets you guarantee each queue gets some queries run but wastes away cluster RAM for that queue if idle. AutoWLM gives higher queues priority but you can starve out lower queues quite easily if a high queue runs a lot of jobs and you don’t have concurrency scaling turned on. In practice I use AutoWLM these days hoping for cost usage optimization but not sure I get it. Oh and serverless doesn’t have WLM.

Turning on concurrency scaling and limiting it to 1 hour per day which is free is a no brainer. Agree there. But if you plan to turn it on without that limit, I’d recommend you have some way to track or limit who can/is driving the now pay-as-you-go variable expense or your bills will rise and rise as more queries from your users just for free from their standpoint but at cost for you… much likea la pay-as-you-go snowflake/bigquery. You don’t have that cost dynamic with a fixed size cluster.

Cold storage is an odd recommendation. Perhaps I am missing something but redshift storage costs with ra3 are so near s3 regular tier costs, I’m not sure it’s worth optimizing. I am unclear how/if the agreed much lower cost glacier/etc retrieval works when using redshift s3 tables; does it auto fetch them out of glacier and with what delay?

More could be said. I’m just responding to the points in the article.

Corrections welcome.

kderbe 3 years ago

The article disappointed me and did not live up to its "at scale" title. At best it repeats the happy-path official documentation, and where it deviates it's for the worse.
I have spent some time administering Redshift. I agree with all gregw2's points. Except for leader node CPU utilization which I've never seen go above 20% even momentarily.
A tip: prefer a larger RA3 node size to having more nodes, e.g. two ra3.16xlarge instead of eight ra3.4xlarge. (This goes against the advice I received from AWS support.) This will give your cluster a beefier leader node for free. Larger worker nodes can also help with a skewed query that would be starved for RAM on smaller nodes--some of my biggest analytical queries tripled in execution time when I tried them on a same-cost cluster with a smaller node size.
- gregw2 3 years ago
  
  Agree with your tip about larger RA3 node sizes. An advisor at AWS during a resizing POC clued me into it and having switched one big cluster to ra3.16xlarge, the leader node CPU utilization has definitely dropped for our peak Monday morning workloads.
  (Yes, with a very large ra3.4xlarge cluster with substantial Redshift Concurrency Scaling, our leader node was in the peak-most hour nudging above 50%.)
  I'm still not clear whether switching to SNAPSHOT_ISOLATION over SERIALIZABLE actually helped or hurt our huge workload. Anecdotally we sort of traded fewer deadlock-killed queries for longer-locked queries which may have grown our overall expenses.
  One minor thing I've anecdotally noticed recently that seems to have changed at some point, perhaps around the time of our transition to ra3.16xlarge, not sure: in the old days, Redshift Concurrency Scaling seemed to kick in the minute that one query was queued up. Nowdays it seems to kick in about 5 minutes later.
  I'd prefer it if AWS added some ability to only activate concurrency scaling when a queue has X queries queued or queries queued for Y minutes or something.

simonpantzare 3 years ago

I've used Redshift for many years and really like it, especially as a user. I'm following the development of DuckDB and one aspect that's nice about it is how predictable prices are when you run it on dedicated servers: https://fet.dev/posts/costs-of-analytics-on-dedicated-hardwa... .

If you don't have huge amounts of data I would take a look at it. Also, Cloudflare R2 might be a good alternative to S3 because there are no egress costs.

hodgesrm 3 years ago

> I'm following the development of DuckDB and one aspect that's nice about it is how predictable prices are when you run it on dedicated servers:
Are the predictable prices worth the trade-off of more labor to operate the servers? I work for a database vendor and find the economic trade-offs quite fascinating.

ryadh 3 years ago

We did a deep-dive comparison between Redshift and ClickHouse Cloud for analytics workloads at scale: https://clickhouse.com/blog/redshift-vs-clickhouse-compariso...

It's fairly technical and goes into details but overall it displayed how to achieve increased compression and query performance

bagels 3 years ago

Redshift is so crazy expensive

Settings

Running Redshift at Scale

Keyboard Shortcuts