Key-based Incremental Replication is a method of replication that replicates new or updated data from a data source. In this guide, we’ll cover:
- How it works (with examples),
- When it should be used,
- Limitations of this Replication Method, and
- How to enable it for your integration
How Key-based Incremental Replication works
Key-based Incremental Replication basics
When using Key-based Incremental Replication, Stitch uses a column called a Replication Key - a column in a source table - to identify new and updated data in a table for replication.
When Stitch replicates a table using Key-based Incremental Replication, a few things will happen:
- During a replication job, Stitch stores the maximum value of a table’s Replication Key column.
- During the next replication job, Stitch will compare saved value from the previous job to Replication Key column values in the source.
- Any rows in the table with a Replication Key greater than or equal to the stored value are replicated.
- Stitch stores the new maximum value from the table’s Replication Key column.
- Repeat.
Let’s use a SQL query as an example:
SELECT replication_key_column,
column_you_selected_1,
column_you_selected_2,
[...]
FROM schema.table
WHERE replication_key_column >= 'last_saved_maximum_value'
Supported Replication Key data types
Replication Key columns must be one of the following data types:
Data type | Available for | Notes |
DATETIME | All database integrations | |
INTEGER | All database integrations |
Includes |
TIMESTAMP | All database integrations | |
FLOAT | MongoDB v1+ database integrations | |
INT64 | MongoDB v1+ database integrations | |
NUMBER | Oracle v1+ database integrations | |
OBJECTID | MongoDB v1+ database integrations | |
UUID | MongoDB v1+ database integrations |
Replication Key type impact on data extraction
Below are examples of how different Replication Key types impact the data extracted using Key-based Incremental Replication.
Note: These examples only demonstrate how data is extracted from a data source, not how it will be loaded into your destination.
In this example, we’ll demonstrate how Key-based Incremental Replication works with a TIMESTAMP
(or DATE-TIME
) Replication Key column for initial and ongoing replication jobs.
Let’s assume the Replication Key column for the table in this example is a column named updated_at
.
Initial replication job
During the initial job:
-
In the source: A table in the source contains three records:
1
,2
,3
. Every time a record is created or modified, itsupdated_at
value is updated. -
During extraction: Depending on the integration type, the initial extraction for a table using Key-based Incremental Replication will vary slightly:
- For a database integration:
- Stitch replicates the entire table in full.
- Stitch saves
updated_at: 2018-07-27 12:07:00
as the maximum Replication Key value.
- For a SaaS integration:
- Stitch uses the Start Date defined in the Integration Settings as the initial Replication Key (
updated_at
) value. - Records with an
updated_at
value greater than or equal to the Start Date are replicated. - Stitch saves
updated_at: 2018-07-27 12:07:00
as the maximum Replication Key value.
- Stitch uses the Start Date defined in the Integration Settings as the initial Replication Key (
- For a database integration:
Source records
|
Extracted records
|
Second replication job
During the second job:
-
In the source: Before the next replication job, record
2
is updated and a new record,4
, is added to the table in the source. Theupdated_at
values for these records are updated. -
During extraction: Stitch replicates records
3
,2
, and4
, as the records’updated_at
values are either greater than or equal to the last saved maximum Replication Key value ofupdated_at: 2018-07-27 12:07:00
from the previous replication job.Record
3
is selected because itsupdated_at
value is equal to the maximum Replication Key value from the previous replication job.If this were a SQL query, it would look like this:
SELECT * FROM sample_table WHERE updated_at >= '2018-07-27 12:07:00'
Source records
|
Extracted records
|
In this example, we’ll demonstrate how Key-based Incremental Replication works with an auto-incrementing INTEGER
Replication Key column.
For this table, the Replication Key is a column named customer_id
. The value of this column increments every time a new record is added to the table, but it is never updated after the record is created.
Initial replication job
During the initial job:
-
In the source: A table in the source contains three records:
1
,2
,3
.The
customer_id
value is auto-incremented when a new record is added to the table. This column is not updated, even if changes are made to the record. -
During extraction:
- Stitch replicates the entire table in full.
- Stitch saves
customer_id: 3
as the maximum Replication Key value.
Source records
|
Extracted records
|
Second replication job
During the second job:
-
In the source: Before the next replication job, record
2
is updated and a new record,4
, is added to the table in the source. -
During extraction:
- Stitch replicates records
3
and4
, as the records’customer_id
values are either greater than or equal to the last saved maximum Replication Key value ofcustomer_id: 3
from the previous replication job. Note that:- Record
3
is selected because itscustomer_id
value is equal to the maximum Replication Key value from the previous replication job. - Record
2
is not selected because itscustomer_id
value wasn’t updated when the record was modified, and was therefore less than the last saved maximum Replication Key value.
- Record
- Stitch saves
customer_id: 4
as the new maximum Replication Key value.
If this were a SQL query, it would look like this:
SELECT * FROM sample_table WHERE customer_id >= 3
- Stitch replicates records
Source records
|
Extracted records
|
When Key-based Incremental Replication should be used
Aside from Log-based Replication where it’s supported, Key-based Incremental Replication is the most efficient method for replicating your data. If Log-based Replication is unavailable for your source, Key-based Incremental Replication may be a good fit if:
- A table contains a modification timestamp column, which is updated when the record changes
- Records aren’t hard deleted from the source table. Refer to the Limitations section below for more info.
Note: In the case of SaaS integrations, Stitch will use Key-based Incremental Replication whenever possible. Refer to the Schema section of any integration’s documentation for the Replication Method and Replication Key(s) used by specific tables.
Limitations of Key-based Incremental Replication
Before you select Key-based Incremental Replication as the Replication Method for a table, you should be aware of the limitations this method can have. Being aware of these limitations can help prevent data discrepancies and ensure your data is replicated in the most efficient manner possible.
The limitations of Key-based Incremental Replication are:
Limitation 1: Works best with a modification timestamp column
While an integer
column can be used as a Replication Key, Key-based Incremental Replication functions best with a modification timestamp Replication Key. Unlike an auto-incrementing integer, a modification timestamp allows Stitch to identify both new and updated records for replication.
Limitation 2: Hard deletes aren't captured
Hard deletes aren’t able to be replicated with Key-based Incremental Replication. This is due to the usage of Replication Keys to identify data for replication.
When a record is hard deleted, or entirely removed from a source, its Replication Key value is also removed. Without a Replication Key value to check, Stitch can’t identify the change and update the record in the destination. This means that the record will remain in the destination.
Refer to the Deleted Record Handling guide for a more detailed explanation and examples.
Limitation 3: Duplication in replication
Due to the inclusive nature of Replication Keys, there will be some duplication during the extraction process. This is because Stitch checks for values that are greater than or equal to the last saved maximum Replication Key value.
Because of this approach, the record or records with Replication Key values equal to the maximum value will be selected for extraction during subsequent jobs. Most of the time, the number of re-replicated rows will be small. If, however, a bulk update occurs and a large number of records all have the same Replication Key value, you could see a high amount of rows being replicated during every replication job until a greater Replication Key value is detected.
Example
In this example, we’ll use a customers
table with a Replication Key column named updated_at
.
- In a database, you run a process that updates 100 records in the
customers
table. - These records’
updated_at
values are updated to2018-11-01 00:00:00
. - During the next replication job - or Job 1 - Stitch extracts 101 records from the source
customers
table:- The 100 updated records with
updated_at
values of2018-11-01 00:00:00
, and - The one record with an
updated_at
value equal to the last saved maximum value from the previous replication job.
- The 100 updated records with
- Stitch saves the new maximum Replication Key value as
updated_at: 2018-11-01 00:00:00
. - No records are updated between Job 1 and the next job.
- When Job 2 begins, Stitch again extracts the 100 records with
updated_at: 2018-11-01 00:00:00
because their Replication Key values are equal to the last saved maximum value.
Until a record with a greater updated_at
value is added to the customers
table, Stitch will continue to extract all records with updated_at: 2018-11-01 00:00:00
values.
Bulk update handling
To avoid the above scenario, add a single record with a greater Replication Key value at the end of a bulk update. This will ensure that the maximum Replication Key value Stitch saves will only be equal to one record instead of many.
For example: If one of the 100 updated records in the customers
table had had an updated_at
value of 2018-11-01 00:00:01
, Stitch would have saved this as the maximum Replication Key value. Then, during Job 2, only one record - instead of 100 - would have been re-replicated.
Enable Key-based Incremental Replication
Key-based Incremental Replication is available for use with the majority of Stitch integrations. Depending on the type of integration, enabling this Replication Method will vary:
-
Database integrations: To use Key-based Incremental Replication, a table must contain a column suitable for use as a Replication Key. Note: For MongoDB integrations, there are additional considerations for Replication Keys. Refer to the MongoDB Replication Keys guide for more info.
-
SaaS integrations: With the exception of Salesforce, no configuration is required on your part. Replication Methods are pre-defined for every table set to replicate. Stitch will use Key-based Incremental Replication whenever possible to ensure your data is replicated accurately and efficiently.
-
Webhook integrations: No configuration is required on your part. As webhook data is sent to Stitch in real-time, only new records are ever replicated from a webhook source. This can be thought of as using Key-based Incremental Replication with a Replication Key of
created_at
.
Related | Troubleshooting |
Questions? Feedback?
Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.