Replication Keys are columns that Stitch uses to identify new and updated data for replication. When you set a Mongo collection to replicate using Key-based Incremental Replication, you’ll also need to define a Replication Key.
Incorrectly setting a Replication Key can cause data discrepancies, latency, and high row counts. As Mongo Replication Keys have their own set of quirks, it’s important to understand how they work and what makes a good key.
Replication Keys and Primary Keys
When it comes to replicating your data, there are a lot of ‘keys’ involved. It can be difficult to keep them all straight, but aside from Replication Keys, there’s one more you should keep in mind: Primary Keys.
In Stitch, Replication Keys and Primary Keys serve two different purposes:
-
Replication Keys are used during the Extraction phase of the replication process - or when Stitch is querying your data source - to identify new and updated data for replication.
In the Stitch app, Replication Keys have a next to the column name.
-
Primary Keys are used during the last step of the replication process, which is when Stitch loads replicated data into your destination. Primary Keys identify unique rows within a table and ensure that only the most recently updated version of that record appears in your destination.
In the Stitch app, Primary Keys have a next to the column name.
While a column can sometimes be used as both a Replication Key and a Primary Key, these are not necessarily always the same column.
Replication Key requirements
Stitch uses a field you define - called a Replication Key - to identify new and updated data for replication. For MongoDB integrations, Replication Key fields:
-
Must be a top-level field
-
Must exist in the root of the document
-
Must be single-field indexed
-
Must be one of the data types in the table below:
Data type | Available for | Notes |
DATETIME | All integrations | |
INTEGER | All integrations |
Includes |
TIMESTAMP | All integrations | |
FLOAT | MongoDB v1+ integrations | |
INT64 | MongoDB v1+ integrations | |
OBJECTID | MongoDB v1+ integrations | |
UUID | MongoDB v1+ integrations |
Recommendations & Gotchas
While a field need only be indexed using a single field index and exist in the root of the document to be a MongoDB Replication Key, we have some recommendations (and things you should keep in mind) when selecting a field to be a Replication Key.
Replication Key recommendations
-
Replication Key fields should contain only one data type. While MongoDB allows you to have multiple data types in a single field, we strongly recommend keeping Replication Key fields to just one. This is because of the way MongoDB compares and sorts data types and how this can impact replication. Refer to the Replication Key gotchas section for more info and an example
-
Date and timestamp fields are great Replication Key candidates. We’re big fans of using
updatedAt
ormodifiedAt
. This is the best way to ensure that both new records and updates to existing records are captured.In some cases - for example, if a table is Append-Only -
createdAt
may also be suitable. -
If date or timestamp fields can’t be used, Replication Keys should update incrementally. Fields with
Integer
andObjectId
data types can be used as Replication Keys if they update incrementally, which allows Stitch to identify aMAX
value and detect new records for replication.This is suitable for Append-Only tables only, meaning that the table is only updated with new data. If existing records are ever modified, a field like
updatedAt
ormodifiedAt
should be used instead.
Replication Key gotchas
Before selecting a Replication Key for a collection, there are a few things you should keep in mind:
-
Changing a collection’s Replication Key requires a full re-replication of the collection. To change the Replication Key for a MongoDB collection, Stitch must perform a full re-replication of the collection.
-
Stitch will not capture hard deletes.
-
NULL is a defined BSON data type in MongoDB. Unlike SQL,
NULLs
can actually compare to other data types and replicate without issue. -
MongoDB fields may contain more than one data type. These data types also have a hierarchy. Fields in MongoDB (even
_id
) can contain more than one data type. In addition, MongoDB “ranks” data types, meaning that some are considered greater than others. This can lead to problems detecting new data.While Stitch doesn’t require single data types for Replication Keys, we strongly recommend it. We’ve seen a lot of data discrepancies arise from this.
Here’s an example that demonstrates why this could be a problem in Stitch:
- You set a table to replicate, using a field called
_id
as the Replication Key. This field contains bothObjectId
andString
data types. - A historical replication job of the table completes.
- Because MongoDB considers
ObjectId
data types to be greater thanStrings
, Stitch will record theMAX
value as the last replicated record containing anObjectId
data type in the Replication Key field. - New records are added to the table.
- During the next replication job, Stitch uses the last recorded
MAX
value - in this case, anObjectId
- to identify new/updated data. Remember: only records with Replication Key values greater than or equal to this value will be selected for replication. - Because
ObjectIds > Strings
, all records withStrings
are considered to be less than the last recordedMAX
value. This means Stitch won’t be able to detect these records and replicate them.
Because Stitch may be unable to correctly identify new and updated data due to how data types are sorted, it’s best to keep Replication Key fields to a single data type. See below in the Checking for multiple data types section for guidance on verifying a field’s data types.
- You set a table to replicate, using a field called
-
Some data types may not auto-increment. Before using the field, you should verify the field’s data type and that it auto-increments. Otherwise, Stitch may have issues detecting new data.
Checking for multiple data types
To determine if a field has more than one data type and is thus suitable to be used as a Replication Key, we recommend running the following queries. We’re using Mongo version 3.0+, so keep in mind that these queries may be different for your version.
Run this query to count how many of a single data type there are in the table’s Replication Key field, replacing:
nameOfCollection
with the collection name,replicationKeyField
with the field name, andknownDataTypeId
with the Mongo BSON data type ID. You can find the data type IDs here in Mongo’s docs.
Next, run this query to get a count of all records in the table:
Compare the results. If the counts are equal, then the Replication Key field contains only one data type.
Additionally, the following query will return the MAX
value for the table’s Replication Key field. This can be helpful when comparing your source database to what’s in your data warehouse:
Defining Replication Keys
After you set a collection to replicate, you’ll be prompted to select a Replication Key.
After you select a field from the drop-down, click the Update Settings button.
Changing existing Replication Keys
Changing an existing Replication Key for a collection is simple - just open up the Collection Settings page for the collection and select the new Replication Key field from the drop-down menu.
Note: When you change a collection’s Replication Key, Stitch will queue a full re-replication of the collection’s data. We do this to ensure that there aren’t any gaps because of the Replication Key switch.
Resetting Replication Keys
There may be times when you need to fully replicate a collection (or collections) that usually update incrementally. If, for example, you add a new field and want to backfill data for already-replicated rows, forcing a full re-replication of the collection will populate the field for existing rows and replicate new records. You can do this by resetting Replication Keys.
Important: Before resetting Replication Keys:
This process:
- Will delete and re-create your destination tables with a full re-replication of your source data.
- Will lead to increased row counts which will count towards your limit.
- Cannot be interrupted or reversed once confirmed.
If you have questions or concerns about resetting Replication Keys, reach out to support before proceeding.
Replication Keys in database integrations can be reset at the integration or the collection level.
- At the integration level, the reset will clear the replication key value for ALL collections AND queue a full re-replication for all collections in the integration.
- At the collection level, the reset will clear the replication key value AND queue a full re-replication for that collection only.
To reset Replication Keys, do the following:
- Click into the integration from the Stitch Dashboard page.
- Next:
- To reset the entire integration: Click the Settings link and skip to step 3.
- To reset a collection: Locate the collection you want and click into it. Click the Collection Settings link, located near the top right corner, and proceed to step 3.
- Scroll down to the Reset Replication Keys section.
- Click the Reset Keys button.
- When prompted, click OK to confirm.
- A Success! message will display at the top of the page.
At this point, a full re-replication of the integration or collection will be queued. Note: If there is a large volume of data to be replicated, it may take some time before you see the changes in your data warehouse.
Related | Troubleshooting |
Questions? Feedback?
Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.