Applying Encodings, SORT, & DIST Keys in Amazon Redshift

Important: The process we outline in this tutorial - which includes dropping tables - can lead to data corruption and other issues if done incorrectly. Proceed with caution or reach out to Stitch support if you have questions.

Want to improve your query performance? In this guide, we’ll walk you through how to use encoding, SORT, and DIST (distribution) keys to streamline query processing.

Note: Redshift supports automatic table optimization to improve querying performance by using ALTER statements and then defining automation for either DIST or SORT keys. This feature is not supported for encodings. For more information on how to set this up in Redshift, refer to the AWS documentation. If you prefer to apply SORT and DIST keys manually, continue using this guide.

Before we dive into their application, here’s a quick overview of each of these performance enhancing tools:

Encodings, or compression types, are used to reduce the amount of required storage space and the size of data that’s read from storage. This in turn can lead to a reduction in processing time for queries.
SORT keys determine the order in which rows in a table are stored. When properly applied, SORT Keys allow large chunks of data to be skipped during query processing. Less data to scan means a shorter processing time, thus improving the query’s performance.
Distribution, or DIST keys determine where data is stored in Redshift. When data is replicated into your data warehouse, it’s stored across the compute nodes that make up the cluster. If data is heavily skewed - meaning a large amount is placed on a single node - query performance will suffer. Even distribution prevents these bottlenecks by ensuring that nodes equally share the processing load.

Considerations

Optimizing for every single query isn’t possible. We suggest selecting the most important queries and selecting SORT/DIST keys that will improve the performance of those queries.
Columns with few unique values aren’t good SORT keys. Because SORT Keys store records together based on similar values, selecting a column with few unique values as the SORT key will heavily skew the data. This will lead to an increase in query processing time.
Tables using Full Table Replication aren’t good candidates for this process Due to the nature of Full Table Replication, encodings, SORT, and DIST keys in these tables may be overwritten during the replication attempts that follow application.

Step 1: Retrieve the table's schema

Note: The process outlined here can be used across the board to apply encodings and Keys.

We’ll use a table called orders, which is contained in the rep_sales schema.

To get started, log into your Redshift database using psql.

Use this command to retrieve the table schema, replacing rep_sales and orders with the names of your schema and table, respectively:

\d+ rep_sales.orders

For the rep_sales.orders table, the result looks like this:

| Column              | Data Type                  |
| --------------------+----------------------------|
| id [pk]             | BIGINT                     |
| rep_name            | VARCHAR(128)               |
| order_amount        | BIGINT                     |
| order_confirmed     | BOOLEAN                    |
| created_at          | TIMESTAMP                  |
| _sdc_sequence       | NUMERIC                    |
| _sdc_received_at    | TIMESTAMP WITHOUT TIMEZONE |
| _sdc_batched_at     | TIMESTAMP WITHOUT TIMEZONE |
| _sdc_table_version  | BIGINT                     |
| _sdc_replication_id | VARCHAR(128)               |

In this example, we’ll perform the following:

Apply SORT and DIST keys to the id column
Apply a bytedict encoding to the rep_name column

Step 2: Create a table copy and redefine the schema

In this step, you’ll create a copy of the table, redefine its structure to include the DIST and SORT Keys, insert/rename the table, and then drop the “old” table.

Step 2.1: Retrieve the table's Primary Key comment

In this step, you’ll retrieve the table’s Primary Key comment. This will be used in the next step to indicate which column(s) are the table’s Primary Keys.

Retrieve the table’s Primary Key using the following query:

SELECT description
  FROM pg_catalog.pg_description
 WHERE objoid = 'old_orders'::regclass;

The result will look like the following, where primary_keys is an array of strings referencing the columns used as the table’s Primary Key:

{"primary_keys":["id"]}

Important: Primary Key comments

Redshift doesn’t enforce the use of Primary Keys, but Stitch requires them to replicate data. In the following example, you’ll see COMMENT being used to note the table’s Primary Key. Make sure you include the Primary Key comment in the next step, as missing or incorrectly defined Primary Key comments will cause issues with data replication.

Step 2.2: Copy, redefine, insert, and drop the table

In this step, you’ll execute a transaction that will perform the following:

Renames the table
Creates a new table with a structure that includes the SORT and DIST keys
Copies the data from the old table and inserts it into the new, redefined table
Renames the new table
Drops the old table

For the rep_sales.orders example table, this is the transaction that will perform the actions listed above:

SET search_path to rep_sales;
BEGIN;
ALTER TABLE orders RENAME TO old_orders;
CREATE TABLE new_orders (
    id bigint,
    rep_name character varying(128) encode bytedict,     /* Sets the encoding */
    order_amount bigint,
    order_confirmed boolean,
    created_at timestamp without time zone,
    _sdc_sequence numeric(18,0),
    _sdc_received_at timestamp without time zone,
    _sdc_batched_at timestamp without time zone,
    _sdc_table_version bigint,
    _sdc_replication_id character varying(128)
)  distkey (id)     // Sets the DIST Key
   sortkey (id);     // Sets the SORT Key
INSERT INTO new_orders (SELECT * FROM old_orders);

COMMENT ON table new_orders IS '{"primary_keys":["XXXXX"]}';
                                                        /* Sets Primary Key comment */

ALTER TABLE new_orders RENAME TO orders;
ALTER TABLE orders OWNER TO <stitch_user>;      /* Grants table ownership to Stitch */
DROP TABLE old_orders;                        /* Drops the "old" table */
END;

Step 3: Verify the table owner

Stitch requires ownership of all integration tables to successfully load data. If Stitch isn’t the table owner, issues with data replication will occur.

To verify the table’s owner, run the following query and replace rep_sales and orders with the names of the schema and table, respectively:

SELECT schemaname,
       tablename,
       tableowner
  FROM pg_catalog.pg_tables
 WHERE schemaname = 'rep_sales'
   AND tablename = 'order'

If Stitch is not the owner of the table, run the following command:

ALTER TABLE <schema_name>.<table_name> OWNER TO <stitch_user>;

Step 4: Verify the encoding and key application

To verify that the changes were applied correctly, retrieve the table’s schema again using this command, replacing rep_sales and orders with the names of your schema and table, respectively:

\d+ rep_sales.orders

In this example, if the Keys and encodings were applied correctly, the response would look something like this:

| Column              | Data type                  | Encoding | Distkey | Sortkey |
|---------------------+----------------------------+----------+---------+---------|
| id                  | BIGINT                     | none     | true    | true    |  
| rep_name            | VARCHAR(128)               | bytedict | false   | false   |  
| order_amount        | BIGINT                     | none     | false   | false   |
| order_confirmed     | BOOLEAN                    | none     | false   | false   |
| created_at          | TIMESTAMP                  | none     | false   | false   |
| _sdc_sequence       | NUMERIC                    | none     | false   | false   |
| _sdc_received_at    | TIMESTAMP WITHOUT TIMEZONE | none     | false   | false   |
| _sdc_batched_at     | TIMESTAMP WITHOUT TIMEZONE | none     | false   | false   |
| _sdc_table_version  | BIGINT                     | none     | false   | false   |
| _sdc_replication_id | VARCHAR(128)               | none     | false   | false   |

For the id column, the Distkey and Sortkey is set to true, meaning that the keys were properly applied.

For rep_name, the Encoding is set to bytedict, indicating that the encoding was also properly applied.

Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.

Related	Troubleshooting
Choosing a Destination Destination & Integration Compatibility Loading Data into Your Destination Switching Destinations Nested Data Structures & Row Count Impact	AWS & Redshift I/O Connection Errors Redshift View Dependency Errors Redshift VACUUM Errors Destination Connection Errors