Missing some Mongo data in your destination? Due to how Mongo sorts data based on data type, Stitch may be unable to correctly identify new and updated data. If you don’t see data that you’d expect to, the root cause may be multiple data types in the collection’s Replication Key or Primary Key (_id
) fields.
Symptoms
Missing or stale data in the destination for Mongo-backed database integrations.
Cause
The cause of this problem is two-fold:
- Fields in Mongo may contain more than one BSON data type
- Mongo ranks data types, which affects how Mongo determines the current maximum value for a field
Replication Methods and value sorting
For Mongo-backed database integrations, Stitch uses a field’s maximum value to identify new and updated data during replication.
The field itself and how its values are used depend on the Replication Method the collection uses:
Replication Method | Field used | Description |
Key-based Incremental |
Replication Key |
Documents with a Replication Key value greater than or equal to the last saved maximum value for the Replication Key field are replicated. |
Full Table |
Primary Key ( |
Documents with a Primary Key ( This ensures that replication can resume if the replication job is interrupted. |
Log-based Incremental |
Primary Key ( |
Applicable only to the historical replication of a collection. This is not applicable when Stitch reads updates from the database’s logs. Documents with a Primary Key ( This ensures that historical replication for a collection can resume if the replication job is interrupted. |
Mongo’s data type ranking determines what the current maximum value of a field is. This, in turn, can affect how Stitch identifies and replicates data from a Mongo database.
Examples
Consider these examples, which demonstrate how multiple data types in either the Replication Key or Primary Key (_id
) field can cause data discrepancies.
Example: Replication Key
This example demonstrates how multiple values in a Replication Key field can cause data discrepancies.
- A collection is set to replicate, using a field named
table_id
as the Replication Key. Thetable_id
field contains bothObjectId
andString
data. - A historical replication of the collection completes.
- Stitch saves the maximum value of
table_id
. Because Mongo ranksObjectId
data types as greater thanStrings
, the maximum value Stitch saves is anObjectId
value. - New documents are added to the collection.
- During the next replication job, Stitch uses the last recorded maximum value - an
OjbectId
value ` to identify new and updated data. - Because
ObjectIds > Strings
, all documents withStrings
are considered to be less than the last recorded maximum value. This means Stitch won’t be able to detect these documents and replicate them.
Example: Primary Key
This example demonstrates how multiple values in a Primary Key (_id
) field can cause data discrepancies.
- A collection is set to replicate. Stitch automatically uses its
_id
field as the Primary Key. The_id
field contains bothObjectId
and UUID data. - During the replication job, Stitch identifies and saves the maximum value of
_id
. In this example, it’s anObjectId
value. - Stitch queries for all documents with an
_id
value less than or equal to the saved maximum_id
value. - Because Mongo considers
ObjectIds
and UUID values to be neither greater than or less than each other, UUID records may be excluded from the results of Stitch’s query. This means Stitch won’t be able to detect these documents and replicate them.
Diagnose the issue
To determine if a field contains multiple data types, you’ll run queries and compare the count of specific data type values in the Replication Key or Primary Key (_id
) field to the total number of documents in the collection.
Step 1: Get a count of data types for the field
First, you’ll need to get a count how many instances of a single data type there are in a given field in the collection.
Run the query below, replacing the following:
nameOfCollection
: The name of the collectionkeyField
: This is dependent on the Replication Method the collection uses:- For Key-based Incremental Replication: The name of the field used as the collection’s Replication Key
- For Log-based Incremental or Full Table Replication: This value should be
_id
knownDataTypeId
: The ID of the known BSON data type used by thekeyField
. Refer to Mongo’s documentation for a list of BSON data type IDs.
db.<nameOfCollection>.count({<keyField>: {$type: <knownDataTypeId>}});
Step 2: Count all records in the collection
Next, run this query to get a count of all records in the collection:
db.<nameOfCollection>.count();
Step 3: Retrieve the field's current maximum value
Next, run the following query to return the maximum value for the specified Replication or Primary Key field in the collection. This can be helpful when comparing your source database to what’s in your destination:
db.<nameOfCollection>.find().sort({<keyField>:-1}).limit(1);
Step 4: Compare the query results
Compare the results between the queries from Step 1 and Step 2.
If the results are equal, then the Replication or Primary Key field contains only one data type. The root cause may require additional investigation.
If the results aren’t equal, multiple data types in the Replication or Primary Key field may be interfering with Stitch’s replication process. Refer to the Solution section for next steps.
Solution
If you’ve determined the field contains multiple data types, you have a few options:
-
To continue using the collection’s current Replication Method:
- Modify the field to only contain a single data type.
- After this is completed in the source, reset the collection to queue a historical replication.
-
To use a different Replication Method:
- Verify that the Replication Key field (if switching to Key-based Incremental Replication) or the
_id
field (if switching to Full Table or Log-based Incremental Replication) only contains a single data type. Make any modifications before proceeding. - Configure the new Replication Method for the collection in Stitch. Changing a Replication Method automatically queues a historical replication.
- Verify that the Replication Key field (if switching to Key-based Incremental Replication) or the
If you’ve determined multiple data types aren’t causing the discrepancy, we recommend working through the Data discrepancy troubleshooting guide before contacting support.
Additionally, providing support with the info from the queries in this guide can help us investigate more quickly.
Questions? Feedback?
Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.