Redshift Observatory Blog

Blog

2026-01-01

HNY from Kyiv

Happy New Year, everyone.

As an aside, I’m back in Kyiv.

Air raid sirens did go off in the late evening, but was a quiet night last night, at least in the little bit of town I can see or hear. Surprised me, was expecting a busy night.

Power was on overnight, which was nice, was able to send midnight emails and messages to everyone.

2026-01-08

Okay. So. A white paper for AutoWLM came out in mid-2023. I’ve read it, and tried to understand it. I have understood most of the individual mechanisms in AutoWLM - there’s one or two where I get the idea but I don’t quite see how it works in practice - but I have not developed in the mind a full unified overall grasp of how AutoWLM as a whole really actually works, from top to bottom, start to finish. The white paper gives higher level overview, then talks mainly about the individual mechanisms, and there seems to me to be information missing in the middle layers, where you’re joining up the bits.

As such, I currently have my best theory of how AutoWLM works, given what I’ve read, and having spent many hours trying to figure it out, and having discarded one or two earlier theories, because I found as I worked through them (while writing them up for this blog post - this is the third attempt at writing this post) they didn’t add up.

Finally, bear in mind here AutoWLM has been around for a long time and has been through a number of major rewrites. The first versions were complete failures - no docs for them, but guess is first version was trying to ensure all queries ran with same mean run time - and this white paper is 2023, so there’s likely been change since then, which we know not of.

This post is purely about who AutoWLM works.

There is no commentary - that comes in later posts (I’m going to post about SQA as well, because quite a bit is written about that in the AutoWLM paper).

So let’s get started.

AutoWLM has a concept of query priority, there are five normal levels, from highest to lowest, plus one super-duper level called “critical”.

In effect, each priority level is its own FIFO queue, because AutoWLM selects which priority to take a query from using weighted round-robin, with higher weight the higher the priority level - i.e. they act just as queues. The white paper says “proportional weights”, which on the face of it means a 1x weight for lowest priority, and a 5x weight for highest.

Critical queue is special (not mentioned in the white paper, but can be found in the official docs - this is actually one of the very very very very very very very rare occasions where I have found out something material and useful and meaningful from the docs) and you have to be superuser to use this priority level and only one critical query runs at a time.

Priorities level can be set on a per queue basis, per user basis, per session basis, or on a per query basis.

I think, but I do not know - it’s not in the white paper - that the source of a query is irrelevant. All that matters is its priority.

(Queues still matter in other ways, because for example you set CSC on a per-queue basis, and CSC is a mechanism separate to AutoWLM. Redshift has become too complicated.)

Additionally, priority levels are used by AutoWLM to decide if it should cancel an existing, running query, so that a higher priority query can execute immediately.

It’s not clear to me if this always happens (i.e. we have the set of running queries, a new query comes in, it has a higher priority than at least one of the running queries, and so will evict one of them, and if so, which one?), or only if certain conditions are met.

I do think (from another source) that critical always runs, and will always evict an existing query if it has to do so, in order to run.

AutoWLM keeps track of how often any given query has been evicted, and there’s a period of time during the next run of that query where the query cannot be evicted - a guard-rail - which exponentially increases with the number of evictions. White paper does not indicate what the initial period of time non-eviction time is.

Also, AutoWLM keeps track of total time wasted by evictions (using the estimated query times again - which could be baloney) and the total time of completed work (not clear if both are from the estimates, or if the latter is from actual time query spent running), and if the ratio of wasted to completed exceeds an unspecified, hard-coded limit, eviction stops occurring at all.

Next, the white paper doesn’t directly speak of it, but there’s a system table view which absolutely indicates that AutoWLM has two types of slot, “light” and “heavy”, and uses the memory resource estimate to decide which slot type to place a query in.

So; queries come into Redshift (probably saying something like “oh Gods, no, not Redshift, why? why?? couldn’t you have sent me to Postgres??! ahhhhhh!!!!! I’m sorry for everything!!!!!”) and for each query, before the query is run, AutoWLM produces estimates of memory usage, processor usage, and wall clock run time.

To make estimates, AutoWLM takes the query plan (i.e. from EXPLAIN or something much the same as that), and iterates over it summing the costs of each of the different types of operation - those sums being the information used to describe a query.

AutoWLM keeps this information for the last N queries - white paper doesn’t say how many. I heard a rumour some years ago that could be interpreted to mean N might have been 200.

Once a query has completed, AutoWLM then also knows how much memory and processor time was used, and the wall clock execution time.

Every now and then, white paper doesn’t say how often (maybe every 200 queries?), AutoWLM produces a XGBoost model from the current set of N queries, on the summed query plan and the known resource usages. That’s the machine learning part.

This naturally leads to the question “XGBoostawahooooie?”

I did a bit of digging. Basically, it’s a method to map a set of input parameters (in this case, the sums from the query plan) to a set of result parameters (in this case, memory used, processor time used, wall clock time), which gives you a generalized ability to then take any values for the input parameters (such as those from a new query), fire them into the model, and get estimates for the result parameters.

XGBoost apparently was a big thing in the mid-2010s, and it’s claim to fame is being lightweight when it come to the amount of processor time needed to run the thing (which is important because if AutoWLM spends ages and a lot of processor time figuring out what to do with a query, you will have a system-wide performance problem).

Now, there’s an important detail here.

The AutoWLM devs write they found that the very large majority of queries were short, and so the XGBoost model they ended up with always predicted queries would be short, because all or almost all of the queries used for the model were short queries. There was often no knowledge at all of long queries, so no predictions of long query behaviour could occur.

What was done is that queries are broken up into groups by their actual duration (not clear if processor time duration or wall clock duration), two of which are explicitly mentioned - 0-1o seconds and 10-30 seconds. When a new query goes into the data set, it replaces the oldest query in its group.

So once we have the XGBoost model, when a query comes in, the information used to describe it (sums of query plans operations) is produced, and this is fed into the XGBoost model to produce the three estimates; how much memory will be used, how much processor time will be used, what the wall clock execution time will be.

Taking these estimates, AutoWLM then quantizes the query into one of five groups, the first group being for queries which are estimates to be fast and low on resource, the other four being simply based on how long the query will run (not clear if processor or wall clock time).

The first group is for queries which will be sent to SQA. Presumably if SQA is disabled, this group does not exist, only the other four groups exist. The other groups proceed to normal AutoWLM handling (which I’m about to write about).

The definition of “fast and low on resource” is adjusted once per week and is the 70th percentile of execution times of all queries seen that week (which seems heroic to me - if you go wrong, you go wrong for a week at a time) and presumably is for light slot queries only.

The white paper says adjustment is weekly, thinking about it I might guess this happens at cluster maintenance, if it does then if you defer maintenance, you would be deferring this AutoWLM adjustment.

Anyway - I’m going to make a separate post about SQA. Suffice to say SQA seems to need AutoWLM to be running, so my guess is either AutoWLM stuff is running all the time even if you have it disabled in the console, or, if you have manual WLM but you also have SQA, then AutoWLM is running, because it’s needed for SQA to work.

So, to recap : we’ve got priorities (queues), we’ve got estimates.

AutoWLM can see which query should run next, based on the priorities/queues and AutoWLM now has to decide whether or not that query should be run.

The basic way in which AutoWLM works is that it attempts to maximize query throughput, in terms of and only of executed query seconds per second.

So when we have queued queries, taking one of them are running it obviously increases the amount of work being done - an extra query is being processed - but by running an extra query, we necessarily slow down all existing queries, increasing their run time.

If the time taken to run the new query is less than the total slow down imposed upon all other queries, then throughput is increased. AutoWLM assumes a perfectly linear slowdown in existing queries, based purely on the number of running queries - no other factors are considered.

If throughput will rise, the query is run.

I think, but I’m not sure, that AutoWLM will create a new slot, if it has to, to run the query.

I suspect creating a new slot is difficult. We’d expect all memory to already be allocated - why would you have it laying around, doing nothing? so you have to reduce the amount of memory allocated to existing slots - which are busy running queries. Awkward.

I am wondering if this is why per-stream compilation was introduced a few years back - consider, imagine you have compilation as it was, all up-front when the query starts. You know how much memory is available in the slot you will run in, so you decide to run certain steps in memory, because you know it will be okay. Then, while you’re running, AutoWLM comes along and takes half your memory away. This is a problem - you have compiled your binaries already, so they will run in memory, only now they will spill to disk and hammer the cluster. But if you have compilation on a per-stream basis, AutoWLM can take memory away when the stream begins, and compilation then occurs, and you get segments which are appropriate to the memory situation at that point.

I think no accounting is made of the slow-down that can be caused by reducing memory in existing slots, and I also think, having forensically examined a large cluster with AutoWLM, there is a hard-coded limit on the maximum number of slots (which on the cluster I examined was 28).

AutoWLM has also a method in play to reduce the number of slots.

What’s done, sayeth white paper, is the same calculation for throughput as for whether or not to make a new slot, but instead, on the basis of whether throughput would be improved by deleting the slot, and that AutoWLM here is not using the estimated times for the queries, but their actual times, since the queries have actually completed.

So I kinda have the impression of a window of completed queries, and you’re working back from those to figure out how many slots you should have had to run those queries optimally?

The white papers says that since the math to decide whether or not to make a new slot uses estimates, and the math to remove a slot uses actual times, you can find the method using estimates says throughput will increase by running the query, which the method using completed times says throughput will increase by not running the query (by deleting the slot).

When this happens, AutoWLM always to deletes the slot.

So I think/guess then when a slot becomes free, both methods run, and it is when and only when running the query and keeping the slot increase throughput that the query gets run - but I am really not clear about this.

In any event, we do have what is really central, which is the basic operating concept of AutoWLM : estimates are made for each query, and these form the basis of computing throughput, and decisions about whether to run a query or not are based on maximizing throughput.

So that’s what’s in the white paper about AutoWLM.

AutoWLM (part 2 - CSC)

There are two very interesting numbers published in the paper.

First, a queue must have queries queuing for 60 seconds (hard coded value) for CSC to be invoked.

Remember here that we can have multiple queues, and still turn on AutoWLM. CSC enable/disable is on a per-queue basis, but I think CSC is treating all queues equally - it consumes based on AutoWLM priorities.

One question is whether or not AutoWLM will route a query from a CSC-disabled queue to the CSC cluster of a CSC-enabled queue. You’d think not, but this is definitely not something to assume.

Second, that a CSC cluster remains in being for four minutes (hard coded value) after it goes idle - but the paper says the user is being charged for this linger time. The official docs say charged only for query run time.

(I am led to understand both numbers, 60s and 4 minutes, can be modified on a per-cluster basis by Support.)

CSC looks to be independent of AutoWLM. It will be invoked according to the 60 second rule above, and once it is invoked, then AutoWLM has another cluster it can send queries to for execution; but it’s the 60 second rule which invokes the cluster, not AutoWLM.

The way the white paper writes of CSC seems to me to say that AutoWLM on the main cluster controls query placement into slots on the CSC clusters; which is to say, we do not have an independent AutoWLM on each CSC, doing its thing with whatever queries come its way.

Query assignment to slots is done main cluster first, then CSC clusters in order oldest to newest, so we can try to retire the new clusters ASAP, to reduce costs (but remember here CSC is an expensive band-aid - if you’re need to use it, your use case is fundamentally inappropriate for Redshift; if you’re using Redshift correctly, you’ll get so much performance from sorting you’ll never need CSC).

AutoWLM (part 3 - SQA)

Again, no commentary here, just how it works. Commentary later.

So I necessarily wrote something about SQA in the main AutoWLM post because SQA is tied into AutoWLM; when AutoWLM thinks a query is fast and low resource, query is routed to SQA.

I think then if you have AutoWLM off and SQA on, in fact, you do have AutoWLM on, it’s that AutoWLM is being used only to decide which queries go to SQA.

I will for convenience re-iterate here the key point about how SQA works; a query must be considered “fast and low resource” to go to SQA, and the definition of “fast” is adjusted once per week and is the 70th percentile of execution times of all queries seen that week (which seems heroic to me - if you go wrong, you go wrong for a week at a time).

I suspect “weekly” actually means “on cluster reboot”.

The definition of “low resource” I think means “will run a light slot” (remembering there seems to be two slot types with AutoWLM, “light” and “heavy”).

So, on to the next stuff about SQA.

When I look at a cluster, SQA normally is a single queue with one slot, using 4% of memory. White paper says if there are a lot of short queries queued, and a lot of long queries running, AutoWLM takes resources (memory, presumably) away from long running queries (which I touched upon the main AutoWLM post - I guess memory can be taken from a query in slot only at the beginning of a stream) and puts them into SQA, to clear the short query backlog. I guess that means more slots. I have seen SQA on busy systems with multiple slots, which by the sounds of it was this.
The white paper indicates there is a limit (unspecified) on the number of queries which can use SQA at any one time. The concern the paper articulates is that AutoWLM incorrectly classifies a bunch of queries are suitable for SQA, runs them in SQA, and it turns out they then all spill to disk - bingo, cluster grinds to a halt.
If the definition of short for the current week is over 20 seconds (hard-coded value) then SQA is split into two, a super-short SQA (0-5 seconds) and a normal SQA (5-20 seconds). I’m guessing the 20 seconds time is hard-coded as a the trigger for splitting SQA, but the upper bound of the “normal” SQA is actually the weekly “short” time. On the face of it, this mode lasts until the next adjustment of the short interval.

AutoWLM and SQA are too complicated to be reliable or predictable, and so cannot be used in production systems, and this is wholly aside from the design being to my eye fundamentally questionable in the first place.

So my basic problem with all black boxes, regardless of anything, is this : you cannot knowingly correctly design a system when it contains a black box.

Something going wrong? who knows - is it you, is the black box?

You can’t tell.

You can skip doing design work by using black boxes (auto-sort, auto-dist, auto-compression, AutoWLM), but then you don’t know what system you’re running, and then you wake up one morning and it’s going wrong.

Where do you even begin to figure out what to do?

You cannot run business critical production systems in this way.

You also have no idea of the inherent behaviours of the black box - for example, AutoWLM being hostile to COPY because it has no way to estimate the cost of the query, and so assuming every COPY command will take 95th percentile duration of all `COPY commands; or that SQA decides what “short” means once a week (or maybe per maintenance reboot).

Remember this is with AWS, who’s idea of documentation is “Redshift is amazing and can do everything!”, and who do not tell you critical facts on the scale of important of “cross-database queries make a copy of the remote table into the local database and rewrite the query, so they’re MUCH MUCH MUCH more expensive than normal queries, what, wait, you weren’t expecting that? we should have mentioned it? you’re surprised?? but why?”

You also have no idea when the black boxes change, and when they do change, that is potentially existential risk to your system.

Consider here I think the devs lack real-world experience, they’re academics, they make complex, interesting systems, but which are wholly unfit for actual use in the complexities of the real world.

I think there are a lot of problems in Redshift, and you hear about almost none of them. Major issues and fixes do not turn up in the release notes. Even the release notes are basically marketing material.

Now on to AutoWLM itself.

I fundamentally question the basic premise of forming estimates from query plans. Query plans on Redshift are no good. There are two reasons; firstly, query plans elide too much information. You can go from what really happens to a query plan, but you can’t go from a query plan to what really happened. Secondly, query plan costs on Redshift are a total nonsense. On Postgres, query plan costs are best effort estimates of what something really costs. On Redshift, query plan costs are arbitrary values. For example, reading one row from a table has a cost of 1, but a sort has a cost of 100 million, plus one per row. Costs on Redshift are completely bonkers - and it is these costs which are being added up to estimate query resource use.

The authors state they know costs are bonkers (and they basically confirm my suspicion, that this was done to brutally force query planning behaviour) and they state they also understand that as the number of joins increase, query plan estimate anyway become meaningless - but they state that since the query plans are consistent, the ML model will be able to make sound correlations between query plans and resource usage.

I don’t buy this.

In Redshift, there is no meaningful correlation between query plans costs and actual costs.

The ML model works by taking query plans costs and associating them with actual costs, so that when we come to the ML model with a novel set of query plan costs, we get an estimate of actual costs.

If the query plan costs we have are not correlated to the actual costs, how can any estimate we make be sound?

So I think the estimate model is unsound.

There’s no way to check, because any stats AutoWLM produces are not available in the system table to you or I.
In the white paper, the authors describe they found producing ML models over many or even a very large number of clusters worked substantially less well than producing an ML model for a single cluster only.

What this says to me, and to the authors, is that the workloads between clusters vary sufficiently that training models over these disparate workloads leads to poor estimates.

This then naturally leads to the thought that on a single cluster the workloads can vary enough that the same problem occurs.

But you cannot reduce AutoWLM’s ML model approach to something below that of a single cluster because doing so would require distinguishing between the different workloads and in an automated fashion (this is AutoWLM after all - humans are not involved) and I don’t think that can be done.

So either you go “well a single cluster is the least worst we can do, ship it” or you say “this doesn’t work”.

I think given the impact of Snowflake on Redshift, management/marketing required something to be shipped. In the short-term they can wave that round and say “Redshift can now do this”, and that’s what they do, but in the longer term, this is harmful, because AutoWLM in fact leads to poor user experiences, and so to poor reputation, and so to people leaving Redshift.
AutoWLM is complicated. Weighted priority levels, estimates, light and heavy slots, query eviction, throughput maximization based on (what I think are spurious) estimates, slot reduction based on real costs, interaction with CSC, interaction with SQA.

I think there are going to all sorts of unexpected failure modes.

The white paper talks about one particular failure mode; that if AutoWLM is running “medium duration light queries” and the query flow rapidly switches to “medium duration heavy queries” that the system goes into “live-lock” (which I take to mean stops running queries - AutoWLM can’t get out of the situation with the methods it currently has).

Does not exactly explain why, if it’s a query duration issue (so the throughput math stops new queries running?) or a resource issue (light vs heavy, wrong slot arrangement?), or some combination of both.

Paper goes on to say there are a set of three heuristics to detect this (let’s hope they work, and also that there are no other unexpected ways in which to get into a problematic state) and when they all trigger, AutoWLM then goes into “emergency mode” (their words), with a hardcoded fixed number of slots, and runs purely shortest-job first, until one or more of the heuristics stop being triggered.

Real life is complicated. AutoWLM is complicated. This is not going to lead to a reliable system.

You need to minimize complexity, not add complexity and put it in a black box where you can’t see it.
There seems to me to be a fundamental design omission in AutoWLM, in that there is no way to express to AutoWLM that a query has no priority at all. No way sot say “run these only if you have actually spare capacity - best effort only”.

Example - client had three queues, CSC on two, AutoWLM active. Flood of queries came into the queue with no CSC. AutoWLM tries to maximize throughput - it ends up running a ton of queries from that queue. What we then see happening is the other queues now are very slow in getting their queries to run, and so CSC is kicking in, and now we have the two CSC-enabled queues running CSC clusters (at megabucks prices) so AutoWLM can do maximum throughput on a queue which has absolutely zero priority because those queries can take as long as they want.

There’s no way to say “service this query only if you have idle capacity”.

Now I know what priorities actually do I could have set the non-CSC queue to lowest priority, That would give the other queues (on normal) 3x the weighting. When the non-CSC queue gets 2k queries, would it have solved the problem? I’m not sure. Maybe? but thought crosses mind that if one of the CSC queues did get to the point where there was a 60 second delay, CSC would have come up, and then AutoWLM would keep CSC busy by using it - AutoWLM would see there were queries in the CSC queue, it has normal priority, so they would get serviced immediately by putting them on CSC (given the main cluster at any given moment is full with low priority queries as there are so many of them).

It seems hard to reason about.

With Manual WLM I can make a queue for queries like this, and I don’t care how full it becomes. The queue has one or two slots, say, so I know it’ll never hammer the system. I can queue hundreds, thousands of queries into this queue, you get a huge spike from some user abusing an API or something. This, I can reason about.
Query eviction based on priorities is a thing, and it can be deactivated. So your system can suddenly start behaving quite differently, for no apparent reason.
Overall, my sense of AutoWLM is that it’s coming out with whatever, and the reason it all seems to work is simply because Redshift is sitting there grinding through queries with a ton of hardware.

This works until it doesn’t, and when it doesn’t, you’re in trouble.

In any data system, there is data. When the data is small, the power of the hardware overwhelms the data and you can do anything you like - get it all completely wrong - and it just doesn’t matter. It all works in a second or two.

As data increases, you then discover performance is a problem - this is because you’re operating Redshift incorrectly, using all the auto stuff, and so Redshift is inefficient.

The only thing you can do is buy more hardware. You can’t make Redshift work correctly because you can’t control the black boxes; the only thing you can do is turn them off, and if you do that, you then need suddenly to do all the design work that was not done because the black boxes were used.

You now how to knowingly correctly design all your sorting orders and distributions and queues and so on, and at that point you discover that in fact your very use cases do not permit correct operation of Redshift and so it was the wrong database choice, or, you are one of the very few who do have valid use cases, and then you discover you need to re-engineer your system from the ground up, because sort and dist and compression choices and data loading strategy and VACUUM strategy cannot be put into database table design after the fact. They are the basis for database table design.

Redshift applies a lot of pressure into the data system, and into the systems around Redshift.

So you buy more hardware, because what else can you do?

So now you’ve masked the inefficiency, but your data is growing.

Buying more hardware is expensive, and it also does not scale everything (there’s still only one leader node, commits are expensive, Redshift is complex and has a number of novel and unexpected failure modes, I’m of the view all the engineering work for many years is not Big Data capable, so you can’t use it - but you won’t know that and will, and so that will give you problems as well, and so on).

Eventually in any event you reach the point where the data is large enough that buying more hardware does not work. At this point, the data has overwhelmed the hardware and now the only way to having timely SQL is correctly operate Redshift, which is to say, to correctly utilize sorting, as correctly operated sorting provides mind-bendingly enormous performance (and “for free”, in the sense that no hardware is required).

But you are now even deeper in a hole than before; you have this large scale, vital business system, which is totally wrong for correct operation of Redshift, running on Redshift, and you cannot now buy more hardware to get out of this problem.

During this process you will have been working with AWS Support, who present all users with the same list of about ten things to do, nine of which I tell users never to do.

You’ll be told to turn on AutoWLM, CSC, use data sharing, and so on.

It’s like advising a man with a broken ankle to buy new trainers. It does not address the fundamental problem. AWS never seem in my experience to actually learn about the client system, and you have to do that to resolve problems, and the problem normally is Redshift is the wrong choice and AWS will never, ever say that to you.

So eventually you hit the wall, and have to migrate to another platform, and at great cost and expense head off to BQ or Snowflake.

The other outcome is that your data is not big, it is small and remains small, but then the question is why are you on Redshift? the opportunity cost is high - for small data there are far, far better databases out there.

It was interesting to get the numbers and the explanation for what causes CSC to spin up, and how long the clusters stick around for.

AutoWLM does not control CSC, although the white paper doesn’t seem to me to be clear about this.

It seems to me though AutoWLM priorities can very much lead CSC to be invoked.

Imagine you have a flow of queries with lowest priority - it’s the best you can do to indicate “best effort only, no rush at all”.

You also have a regular flow of normal queries.

Normal queries are selected with a 3x weight, so lowest priority queries begin to queue up… and they queue for more than 60 second and if you have CSC on that queue you now have a CSC cluster.

You really don’t want a per-second priced CSC cluster spun up to run your zero priority queries.

I bet you were not expecting that what’s going on under the hood is your lowest priority queries are invoking a CSC cluster as a result of that they are lowest priority. After all, you were trying to tell AutoWLM those queries do not matter.

(Which you can all fix by disabling CSC on that queue - but my point here is unintended consequences.)

Complex system, many moving parts, unintended consequences.

I disapprove of CSC in general, as I regard it an expensive band-aid for an incorrectly operated Redshift cluster. If you are operating Redshift correctly, do you not need CSC. If you are operating Redshift incorrectly, you need to operate it correctly. If you cannot do this, for whatever reason, you need to stop using Redshift.

My starting position on SQA is that it is not necessary - you should have arranged your WLM queues correctly for your workload and by that be handling short queries already.

I observe also SQA, where it starts and then can cancel a query, makes a mess in the system tables logging queries. It’s hard enough to figure out what’s going on without needing this in there as well.

The fact the SQA cut-off is 70th percentile reset per week (or maintenance reboot) is just too arbitrary. The authors themselves say “won’t work for everyone”. Yes. Sucks to be that person, perhaps this information should be - you know - given to users.

70th percentile also a problem in that the way a mean can be a problem; say you have two distinct groups of queries, one fast, one slow. The mean is in the middle, and is incorrect for both groups.

The splitting to SQA into super-short SQA and long SQA, the hardcoded values which cause this to happen, it’s so arbitrary, and impossible to reason about in system design.

We have also when AutoWLM thinks we need a lot of SQA, and so then takes memory from long-running slots to make more SQA slots. We can then end up with a lot of SQA slots, and now AutoWLM can mis-classify a bunch of queries and we run them in SQA and now we’re hammering the system.

I think it entirely improper that this risk exists, and that users have no idea of the risks they are incurring when they turn on SQA.

My sense of SQA is like AutoWLM. I don’t think it works well, and the reason people it seems to work is simply because Redshift is sitting there grinding through queries no matter what SQA/AutoWLM do wrong.

Users have no idea how inefficiently or improperly Redshift is being operated by these black boxes. This information I think is in the system tables, but in tables we are not allowed to access.

2026-01-14

Bit of Kyiv Traffic

Defensive gunfire just kicked off now, and some bangs, you get bangs when incoming are blown up in the sky. No booms, booms are when the incoming hits the ground.

Lasted for less than a minute.

I didn’t notice the air raid siren, but I’m pretty sure it will have happened. Maybe hours ago, when I was asleep still.

All the cars and stuff outside just rolling on. I mean, seems to me what you could do anyway if you’re in a car? pull over? then find a shelter using an app? and by the time you’ve done anything it’s all over anyway.

2026-01-16

Heartfelt

Snow night silence
Black figures, black shadows
Slow cars rolling
Distant underground thunder

Curfew in two
Action in four
Siren
German autocannon
Bangs in the sky
Booms on the ground
Did someone die?

To wish heartfelt the safety of unknown strangers
This is an experience of war

Life in Kyiv

I’ve just been told by brother media painting view dim and dark view of life in Kyiv just right now.

I thought I’d write a bit about it, to give an non-media view on things.

So normally we have scheduled power outages, which when it’s really good is nothing, when it’s good four hours out per day, normal is two lots of four hours, bad is two lots of eight hours without.

Right now, emergency power cuts are in play. As such, we basically have power overnight - from about 9pm ish to about 9am ish. DTEK is busy rebuilding whatever it is the Russians hit.

So you have time for some cooking in the evening and morning, and plenty of time to charge everything up.

I currently have enough power in the laptop and power bank to last me until about 6pm, and then I head out to cafe. I could buy another power bank, but I like the cafe. Meet people, chat, drink hot chocolate, charge everything up.

Temperature is about -14C normally. If it’s clear overnight, which is pretty rare - cloudy place in winter - it’ll go down to -20C. The temperature isn’t a thing. It’s background. We’re all used to it, have the right clothes, etc. If you want to stay warm indoors and there’s no power, it’s fine, your place is basically at about 10C normally. It’s not hard to keep warm when you have clothes for -20C.

All the shops, supermarkets, etc, are fine. Everyone’s had generators for years now.

So that’s about it. It’s nice to have some power during the day, it’s convenient, but that’s about it. Temperature isn’t a thing.

2026-01-19

Autonomics

AWS have released a new feature for Redshift.

Cue everyone looked worried, and cue wondering what BS their marketing will come up with for this one.

It’s called “autonomics”.

Here’s the root doc page; https://docs.aws.amazon.com/redshift/latest/dg/c_autonomics.html

Long and short of it, is that it seems to be an amped-up version of the existing auto-systems - auto-dist, auto-sort, auto-analyze, auto-vacuum. Also seems to amp up MV refresh, which is a scary thought (MV refresh non-trivial, and if you have MV on top of MVs, you often get a cascading refresh).

You can avoid this by not using the auto-systems. Don’t use auto dist or sort, turn off auto-analyze, and don’t use MVs (you should never use MVs anyways). You can’t turn off auto-vacuum, very unfortunately.

One odd thing in the docs is this;

If extra compute resources for autonomics are disabled, Amazon Redshift temporarily suspends autonomics operations during periods of high system load in order to minimize impact on concurrent workloads, until there are enough resources to run them without negatively impacting user queries, potentially impacting performances.

So if I disable it… …it’s still active?

I always find the docs are very loosely written. I think what this might means is autonomics are always on (but you can avoid them by not using auto-stuff), but if you disable them, then they will not use additional (and billable) resources, which seems to mean and only mean CSC clusters.

(We then run into a bunch of AWS double-speak, where they refer to CSC for Provisioned using different code-words/terms that CSC on Serverless.

By default, Amazon Redshift will not bill you for resources used for autonomics. If you choose to allocate extra compute resources for more consistent autonomics, Amazon Redshift will bill you only for autonomics operations that actually use additional resources, such as concurrency-scaling clusters (or additional RPUs), which are allocated only when main cluster or base RPUs are fully utilized running user workload.

1 RPU is 1 slice, when the main cluster is full (for both provisioned and serverless, since it’s a normal cluster in both cases) you’ll get a CSC cluster (which is called “additional RPUs” for Serverless, as if the main cluster was scaling, which it is not).

2026-01-22

Signal

The Signal messenging app is going downhill.

Latest problem - they are now asking for feedback after every damn call.

I get this from Teams, and it annoys the hell out of me.

I want to just make a call.

That’s it. Simple. I call, we talk, done.

I do NOT want another task, every time I make or take a call.

This seems to me to be an example of “popupitus”.

You all know it - visit a web-site, and the very first thing you get are multiple popups, which you have to deal with, to get to the site.

Actually, relating to this, I was called on Signal today, which almost never happens.

I found I could not accept the call.

The phone was rining, but there was nothing to press to accept.

Eventually other end hung up.

After some experimenting, what you have to do is this : go to the events screen (hit power button to do this), then on that page there is a tile for Signal, which has two entries - incoming call, and that Signal has background connection.

Tap that, and it will open, and it opens out into two buttons, ‘accept’ / ‘hangup’.

Signal has developed a few other UI problems over the last year or so, as well, in particular with playback of audio messages.

Signal was great until about two or three years ago, when they got a bunch of investment.

Since then, downhill.

Sooner or later, Signal will become unusuable.

I’ve made Briar and Element accounts, to be ready for what it happens.

pg_last_query_id()

There’s a function in Redshift, pg_last_query_id(), which returns query ID of most recent query in current session.

So, two problems.

If the query you issued is leader-node, there’s no query ID. You’ll get a query ID - the most recent worker node query - but that’s not what you expected. Not documented of course.
For some queries, such as create temp table as, Redshift automatically runs another query after your query. In the case of create temp table as, there’s an automatic analyze on the table after its creation. When you use pg_last_query_id() after create temp table as, you do not get the query ID of your query, you get the query ID of the analyze. Not documented, of course.

2026-01-23

Logging Queries in Transactions

Looks like worker node queries issued in a transaction are not logged in STL_WLM_QUERY until the transaction commits.

No other observations are implied - I’ve not looked at leader node queries, for example.

2026-01-28

VACUUM FULL

VACUUM FULL on its own vacuums all tables - but with the default 95% threshold.

VACUUM FULL TO 100 PERCENT gives syntax error.

I am on a cluster brought up from snapshot for investigation.

I now want to vacuum all tables, and to 100 percent.

I must now write a Python script, to do so.

Home 3D Друк Blog Bring-Up Times Cross-Region Benchmarks Email Forums Mailing Lists Redshift Price Tracker Redshift Version Tracker Replacement System Tables Reserved Instances Marketplace Slack System Table Tracker The Known Universe White Papers