Apply table partitioning and clustering in BigQuery

Important: The process outlined in this tutorial - which includes dropping tables - can lead to data corruption and other issues if done incorrectly. Please proceed with caution or reach out to Stitch support if you have questions.

Want to improve your BigQuery performance and query costs? When Stitch loads data into BigQuery, tables are created without partitioning or clustering. However, you can apply these performance enhancement tools to your table to streamline query processing, which Stitch will respect on subsequent loads.

In this guide, we’ll walk you through how to add partitioning and clustering to a BigQuery table created by Stitch.

Overview

BigQuery’s table partitioning and clustering features can improve query performance and cost by structuring data to match common query patterns.

Learn more in BigQuery’s table partitioning documentation and clustering documentation.

Considerations

Before diving in, keep in mind that optimizing for every single query isn’t possible. Tables can only be partitioned by one field, which must be a timestamp or date column, and clustered by a single set of columns.

The ideal choice of partitioning and clustering column(s) depends on the nature of your data and queries.

Prerequisites

The required user permissions. The user performing this process must have the permissions outlined in BigQuery’s documentation.

Step 1: Sign into Stitch and the BigQuery Web UI

Sign into Stitch and the BigQuery Web UI to get started.

As an example, we’ll use a table called orders, which is contained in the rep_sales dataset.

Step 2: Pause Stitch loading

From Stitch, pause all source integrations that contain tables you plan to modify.

You will also need to ensure that Stitch doesn’t load any data while you are modifying the tables. To do this, monitor the Integration Details page for each paused integration until:

No Extractions are in progress, and
There are zero rows in Preparing

When the integrations meet these criteria, you can move onto the next step.

Step 3: Create a temporary table with partitioning and clustering

Next, you’ll create a temporary copy of the table with partitioning and clustering added on the created_at column.

Note: Your choice of partitioning and clustering column should be based on your data and the queries you want to optimize. You can even use different columns for partitioning and clustering. This is just an example.

Run the following from the BigQuery Web UI Query Editor:

CREATE TABLE rep_sales.orders_tmp
PARTITION BY DATE(created_at)
  CLUSTER BY created_at 
  AS 
    SELECT *
      FROM rep_sales.orders

Step 4: Drop the original table and rename the temporary table

To drop the original table, run the following from the BigQuery Web UI:
```
 DROP TABLE rep_sales.orders
```
You can also rename it and keep it as a backup.
To rename the temporary table, run the following from the BigQuery Web UI:
```
 ALTER TABLE rep_sales.orders_tmp
 RENAME TO orders
```

Step 5: Unpause Stitch integrations

Return to Stitch and unpause any integrations that you paused in Step 2.

Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.

Related	Troubleshooting
Choosing a Destination Destination & Integration Compatibility Loading Data into Your Destination Switching Destinations Understanding Loading Behavior Stitch's Impact on BigQuery Costs	Destination Connection Errors