Want to improve your BigQuery performance and query costs? When Stitch loads data into BigQuery, tables are created without partitioning or clustering. However, you can apply these performance enhancement tools to your table to streamline query processing, which Stitch will respect on subsequent loads.
In this guide, we’ll walk you through how to add partitioning and clustering to a BigQuery table created by Stitch.
Overview
BigQuery’s table partitioning and clustering features can improve query performance and cost by structuring data to match common query patterns.
Learn more in BigQuery’s table partitioning documentation and clustering documentation.
Considerations
Before diving in, keep in mind that optimizing for every single query isn’t possible. Tables can only be partitioned by one field, which must be a timestamp or date column, and clustered by a single set of columns.
The ideal choice of partitioning and clustering column(s) depends on the nature of your data and queries.
Prerequisites
The required user permissions. The user performing this process must have the permissions outlined in BigQuery’s documentation.
Step 1: Sign into Stitch and the BigQuery Web UI
Sign into Stitch and the BigQuery Web UI to get started.
As an example, we’ll use a table called orders
, which is contained in the rep_sales
dataset.
Step 2: Pause Stitch loading
From Stitch, pause all source integrations that contain tables you plan to modify.
You will also need to ensure that Stitch doesn’t load any data while you are modifying the tables. To do this, monitor the Integration Details page for each paused integration until:
- No Extractions are in progress, and
- There are zero rows in Preparing
When the integrations meet these criteria, you can move onto the next step.
Step 3: Create a temporary table with partitioning and clustering
Next, you’ll create a temporary copy of the table with partitioning and clustering added on the created_at
column.
Run the following from the BigQuery Web UI Query Editor:
CREATE TABLE rep_sales.orders_tmp
PARTITION BY DATE(created_at)
CLUSTER BY created_at
AS
SELECT *
FROM rep_sales.orders
Step 4: Drop the original table and rename the temporary table
-
To drop the original table, run the following from the BigQuery Web UI:
DROP TABLE rep_sales.orders
You can also rename it and keep it as a backup.
-
To rename the temporary table, run the following from the BigQuery Web UI:
ALTER TABLE rep_sales.orders_tmp RENAME TO orders
Step 5: Unpause Stitch integrations
Return to Stitch and unpause any integrations that you paused in Step 2.
Related | Troubleshooting |
Questions? Feedback?
Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.