AWS Redshift — Best Practices

12 min read Original article ↗

Marko Zarkovic

Press enter or click to view image in full size

"Redshift is slow today, I am going to use a different service"!
DBA’s have heard this sentence so many times…
The main reason for this is not Redshift itself, but that we as users are not as proficient as we would hope for. We are familiar with writing DDL/DML that is on a traditional row-oriented relational database, Redshift is a columnar database better suited for analytics and thus a more appropriate platform for a data warehouse.
We are here to help you understand the full capability of the worker nodes' computability and to optimise the tables to leverage the highly parallel nature of Amazon Redshift. Guidelines provided in this article can help you use the full power of AWS Redshift.

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.

In the Redshift cluster, we have a single leader node and the rest are compute nodes. The leader node distributes SQL to the compute nodes when a query references user-created tables or system tables (tables with an STL or STV prefix and system views with an SVL or SVV prefix). A query that references only catalog tables (tables with a PG prefix, such as PG_TABLE_DEF) or that does not reference any tables, runs exclusively on the leader node.

AWS Redshift Cluster example

Query performance guidelines:

  • Avoid using select *. Include only the columns you specifically need.
  • Use a CASE Expression to perform complex aggregations instead of selecting from the same table multiple times.
  • Don’t use cross-joins unless absolutely necessary. These joins without a join condition result in the Cartesian product of two tables. Cross-joins are typically executed as nested-loop joins, which are the slowest of the possible join types.
  • Use subqueries in cases where one table in the query is used only for predicate conditions and the subquery returns a small number of rows (less than about 200).
  • Use predicates to restrict the dataset as much as possible.
  • In the predicate, use the least expensive operators that you can. Comparison Condition operators are preferable to LIKE operators. LIKE operators are still preferable to SIMILAR TO or POSIX Operators.
  • Avoid using functions in query predicates. Using them can drive up the cost of the query by requiring large numbers of rows to resolve the intermediate steps of the query.
  • If possible, use a WHERE clause to restrict the dataset. The query planner can then use row order to help determine which records match the criteria, so it can skip scanning large numbers of disk blocks. Without this, the query execution engine must scan participating columns entirely.
  • Add predicates to filter tables that participate in joins, even if the predicates apply the same filters. The query returns the same result set, but Amazon Redshift is able to filter the join tables before the scan step and can then efficiently skip scanning blocks from those tables. Redundant filters aren’t needed if you filter on a column that’s used in the join condition.

Redshift Distribution Key

Redshift Distribution Keys (DIST Keys) determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node.

The query optimizer distributes a fewer number of rows to the compute nodes to perform joins and aggregation on query execution. This redistribution of data can include shuffling of the entire tables across all the nodes.

Uneven distribution of data across computing nodes leads to the skewness of the work a node has to do and you don’t want an under-utilised compute node. So the distribution of the data should be uniform. Distribution is per table. So you can select a different distribution style for each of the tables you are going to have in your database.

Types of Distribution Styles

Amazon Redshift supports four kinds of table distribution styles.

  • Auto (The default distribution styles of a table. Redshift starts the table with ALL, but switches the table to EVEN when the table grows larger)
  • All (Leader node maintains a copy of the table on all the computing nodes resulting in more space utilisation. Since all the nodes have a local copy of the data, the query does not require copying data across the network. This results in faster query operations. The negative side of using ALL is that a copy of the table is on every node in the cluster. This takes up too much space and increases the time taken by Copy command to upload data into Redshift)
  • Even (Even Distribution the Leader node of the cluster distributes the data of a table evenly across all slices, using a round-robin approach)
  • Key (Data is distributed across slices by the leader node matching the values of a designated column. So all the entries with the same value in the column end up in the same slice)

Choosing the right Distribution Styles

The motive in selecting a table distribution style is to minimize the impact of the redistribution by relocating the data where it was prior to the query execution. Choosing the right KEY is not as straightforward as it may seem. In fact, setting wrong DISTKEY can even worsen the query performance.

Choose columns used in the query that leads to the least skewness as the DISTKEY. The good choice is the column with maximum distinct values, such as the timestamp. Avoid columns with few distinct values.

  • If the table is highly de-normalised and no JOIN is required, choose the EVEN style.
  • Choose ALL style for small tables that do not often change.
  • It is beneficial to select a KEY distribution if a table is used in JOINS. Also, consider the other joining tables and their distribution style.
  • If one particular node contains the skew data, the processing on this node will be slower. This results in a much longer total query processing time. This query under skewed configuration may take even longer than the query made against the table without a DISTKEY.

Data Distribution Tips:

  • Consider using the ALL distribution style for all infrequently-modified small tables (~3 million rows or less)
  • Distribution keys should have high cardinality to avoid data skew and “hot” nodes.

Redshift Sort Key

Redshift Sort Key determines the order in which rows in a table are stored. Query performance is improved when Sort keys are properly used as it enables the query optimizer to read fewer chunks of data filtering out the majority of it. Redshift Sort Keys allow skipping large chunks of data during query processing. Fewer data to scan means a shorter processing time, thereby improving the query’s performance.

There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort of ordered table while determining optimal query plans.

Amazon Redshift supports two kinds of Sort Keys. Compound Sort Keys and Interleaved Sort Keys.

Compound Sort Key

These are made up of all the columns that are listed in the Redshift sort keys definition during the creation of the table, in the order that they are listed. Therefore, it is advisable to put the most frequently used column at the first in the list. COMPOUND is the default sort type. Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY.

For example, let’s create a table with two compound sort keys.

CREATE TABLE customer ( c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL, c_name VARCHAR(100) NOT NULL) COMPOUND SORTKEY(c_customer_id, c_country_id);

Press enter or click to view image in full size

Figure 1. Table compound sorted by columns — c_customer_id and c_country_id

You can see how data is stored in the table, it is sorted by the columns c_customer_id and c_country_id. Since the column c_customer_id is first in the list, the table is first sorted by c_customer_id and then by c_country_id.

Interleaved Sort Key

Interleaved sort gives equal weight to each column in the Redshift sort keys. As a result, it can significantly improve query performance where the query uses restrictive predicates (equality operator in WHERE clause) on secondary sort columns.

Adding rows to a sorted table already containing data affects the performance significantly. VACUUM and ANALYZE operations should be used regularly to re-sort and update the statistical metadata for the query planner. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns.

Get Marko Zarkovic’s stories in your inbox

Join Medium for free to get updates from this writer.

For example, let’s create a table with Interleaved Sort Keys.

CREATE TABLE customer (c_customer_id INTEGER NOT NULL, c_country_id INTEGER NOT NULL) INTERLEAVED SORTKEY (c_customer_id, c_country_id);

Press enter or click to view image in full size

Figure 2. Data stored in an interleaved manner

As you can see, the first block stores the first two customer IDs along with the first two country IDs. Therefore, you only scan 2 blocks to return data to a given customer or a given country.

The query performance is much better for the large table using interleave sorting. If the table contains 1M blocks (1 TB per column) with an interleaved sort key of both customer ID and country ID, you scan 1K blocks when you filter on a specific customer or country, a speedup of 1000x compared to the unsorted case.

Choosing Sorting Keys

Selecting the right kind needs the knowledge of the queries.

  • Use Interleaved Sort Key when you plan to use one column as Sort Key or when WHERE clauses in your query have highly selective restrictive predicates. Or if the tables are huge. You may want to check table statistics by querying the STV_BLOCKLIST system table. Look for the tables with a high number of 1MB blocks per slice and distributed over all slices.
  • Use Compound Sort Key, when you have more than one column as Sort Key when your query includes JOINS, GROUP BY, ORDER BY, and PARTITION BY when your table size is small.
  • Do not use an interleaved sort key on columns with monotonically increasing attributes, like an identity column, dates, or timestamps.

Redshift Sorting: Zone Maps

  • Zone Maps are minimum-and-maximum values for each block of data
  • Zone Maps are stored in memory and automatically generated
  • Zone Maps allow Redshift to effectively prune blocks that cannot contain data needed for a given query
  • Minimises unnecessary I/O
  • Along with sort keys, zone maps play a crucial role in enabling range-restricted scans to prune blocks and reduce I/O

Compression Encodings

Advice → You have control over which compression you will use for which column type when defining your DDL, a good rule is to always pick ZSTD compression for every column.

A compression encoding specifies the type of compression that is applied to a column of data values as rows are added to a table. If no compression is specified in a CREATE TABLE or ALTER TABLE statement, Amazon Redshift automatically assigns compression encoding as follows:

  • Columns that are defined as sort keys are assigned RAW compression.
  • Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression.
  • All other columns are assigned LZO compression.

Read about Compression encodings: https://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html

Explain — command

Displays the execution plan for a query statement without running the query. Read explain-plans from the bottom-up.

EXPLAIN performance is sometimes influenced by the time it takes to create temporary tables. For example, a query that uses the common subexpression optimization requires temporary tables to be created and analyzed in order to return the EXPLAIN output. The query plan depends on the schema and statistics of the temporary tables. Therefore, the EXPLAIN command for this type of query might take longer to run than expected.

Syntax

EXPLAIN [ VERBOSE ] query

Parameters:

VERBOSE → Displays the full query plan instead of just a summary.

query → Query statement to explain. The query can be a SELECT, INSERT, CREATE TABLE AS, UPDATE, or DELETE statement.

The EXPLAIN command will fail if you use it for other SQL commands, such as data definition language (DDL) or database operations.

Results of the command can be:

DS_DIST_NONE — No redistribution is required because corresponding slices are collocated on the compute nodes. You will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension table.

DS_DIST_ALL_NONE — No redistribution is required because the inner join table used DISTSTYLE ALL. The entire table is located on every node.

DS_DIST_INNER -The inner table is redistributed.

DS_DIST_OUTER — The outer table is redistributed.

DS_BCAST_INNER — A copy of the entire inner table is broadcast to all the compute nodes.

DS_DIST_ALL_INNER — The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.

DS_DIST_BOTH — Both tables are redistributed.

DS_DIST_NONE and DS_DIST_ALL_NONE are good results. They indicate that no distribution was required for that step because all of the joins are collocated.

Press enter or click to view image in full size

Read more about Explain command: https://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html

Summary

  • Use the query writing guidelines for a great increase in performance.
  • Distribution keys and Sort keys are your best friend when joining large tables
  • Understanding of how Zone Maps in Redshift work to maximize input/output
  • Use Compression encoding every time when defining your DDL
  • Explain command is very useful for achieving Merge joins

I know that it takes some time and effort to go through all of these guidelines/methods and implement them, but that time is time well spent. Your Redshift cluster will be infinitely more efficient and you will notice the difference right off in query I/O.