Log-based Incremental Replication is a replication method in which Stitch identifies modifications to records - including inserts, updates, and deletes - using a database’s binary log files. A log file is a record of events that occur within a database.
In this guide, we’ll cover:
- Some useful terminology,
- How it works (with examples),
- When it should be used,
- Limitations of this Replication Method, and
- How to enable it for your database integration
Log-based Incremental Replication terminology
- Log file - A file in a database containing a list of changes made to the database. Log files are made up of log messages.
- Log message - A single change made to a database. For example: An
UPDATE
to a record. -
Log position ID - A unique identifier corresponding to the position of a log message in a log file. These values are incremental, increasing as log messages are generated.
- In Amazon DynamoDB, this is called a Shard iterator. Shard iterators are used to identify a specific position within a unique shard.
- In Microsoft SQL Server, this is called the Change Tracking Version. As Microsoft’s documentation notes, this concept is similar to
rowversion
. - In MongoDB, this is a field in the database log named
ts
. Thets
field is a combination of a timestamp and an ordinal (integer counter) value. For example:"ts": Timestamp(1412180887, 1)
The timestamp is in seconds since the Unix epoch, and the ordinal is used to differentiate between entries that occured during the same second. - In MySQL and PostgreSQL, this is called a Log Sequence Number (LSN).
- In Oracle, this is called a System Change Number (SCN).
- Replication job - The three-step process by which Stitch replicates data. A replication job includes three distinct steps: Extraction, preparation, and loading.
- Historical replication job - A Stitch replication job that replicates historical data.
In addition to these general terms, each database refers to its log replication feature by a different name. Stitch uses the following database features to perform Log-based Incremental Replication:
-
Amazon DynamoDB: DynamoDB streams
-
Microsoft SQL Server: Change Tracking
-
MongoDB: OpLog
-
MySQL: Binary log file position based replication, or binlog
-
Oracle: LogMiner
-
PostgreSQL: Logical replication
How Log-based Incremental Replication works
There are two types of log replication: Statement and row-based. Stitch uses a row-based approach, which means that when rows are modified via a supported event type, the entire row is written to the log file as a log message.
During a replication job, Stitch iterates over each log message in sequence, or in the same order that the log messages were written to the log. Data is extracted for any log messages that correspond to tables that have been tracked for replication. This means that all tables using Log-based Incremental Replication are extracted from the source database simultaneously as one seamless process.
To ensure only new and updated data is selected for replication, Stitch will use a log’s log position ID to ‘bookmark’ its place in a log file. This value is used to resume replicating data where a previous replication job ended.
Historical replication jobs
During the historical replication job for a table using Log-based Incremental Replication, two things will happen:
- Stitch will retrieve and store the maximum log position from the logs.
-
Stitch will use a
SELECT
-based approach to replicate the table in full. If this were a SQL query, it would look like this:SELECT id, column_you_selected_1, column_you_selected_2, [...] FROM schema.table WHERE id < [max primary key value] ORDER BY id
Note: Stitch will automatically omit the
WHERE
clause in this statement if the table doesn’t have a Primary Key or has a Primary Key that isn’t sortable.
Ongoing replication jobs
After the historical replication of a table is complete, Stitch reads updates for the table from the database’s logs. During ongoing replication jobs using Log-based Incremental Replication, a few things will happen:
- Using the maximum log position ID from the previous job - in this case, the historical replication job - Stitch begins reading log messages in the binary file. Data for tables set to replicate is extracted.
- At the end of the replication job, Stitch bookmarks its place in the log file by storing its current log position ID.
- During the next replication job, Stitch will resume reading data with a greater log position ID than the log position ID from the previous job.
- At the end of the replication job, Stitch bookmarks its place in the log file again.
- Repeat.
When Log-based Incremental Replication should be used
Log-based Incremental Replication may be a good fit if:
- The database is an Amazon DynamoDB, Microsoft SQL Server, MongoDB, MySQL, Oracle, and PostgreSQL-backed database that supports Log-based Incremental Replication.
- Data is contained in a table, not a view.
- Modifications to records are made only using supported event types.
- The structure of the table changes infrequently, if at all. Refer to the Limitations section below for more info.
- For PostgreSQL instances: You’re aware that only master instances support Log-based Incremental Replication and that retaining binary log files will increase the database’s disk space usage.
- For MongoDB instances: The
_id
field in a collection only contains a single data type. Multiple data types in the_id
field can cause discrepancies during replication. Refer to the Limitations section for more info and examples.
If Log-based Incremental Replication isn’t appropriate, Key-based Incremental Replication may be a suitable alternative.
Limitations of Log-based Incremental Replication
Before you select Log-based Incremental Replication as the Replication Method for a table, you should be aware of the limitations this method can have. Being aware of these limitations can help prevent data discrepancies, replication issues, and ensure your data is replicated in the most efficient manner possible.
The following tabs contain the limitations that apply to the databases that support Log-based Incremental Replication. Click the tab for your database type to view its specific limitations for this replication method.
The following limitations are applicable to all database integrations:
- Only available for certain databases
- Only works with specific database event types
- Cannot be used with views
Refer to the other tabs for limitations that are applicable to specific database types.
The following limitations are applicable to Amazon DynamoDB-backed database integrations:
The following limitations are applicable to Microsoft SQL Server-backed database integrations:
- Only available for certain databases
- Only works with specific database event types
- Structural changes require manual intervention (Microsoft SQL Server, MySQL, Oracle, PostgreSQL (v1))
- Cannot be used with views
- Logs can age out and impact replication (Microsoft SQL Server, MongoDB, MySQL, and Oracle)
- Duplication in replication (Microsoft SQL Server)
The following limitations are applicable to MongoDB-backed database integrations:
- Only available for certain databases
- Only works with specific database event types
- Cannot be used with views
- Logs can age out and impact replication (Microsoft SQL Server, MongoDB, MySQL, and Oracle)
- Multiple data types in the _id field can cause discrepancies during historical replication (MongoDB)
The following limitations are applicable to MySQL-backed database integrations:
- Only available for certain databases
- Only works with specific database event types
- Structural changes require manual intervention (Microsoft SQL Server, MySQL, Oracle, PostgreSQL (v1))
- Cannot be used with views
- Logs can age out and impact replication (Microsoft SQL Server, MongoDB, MySQL, and Oracle)
The following limitations are applicable to Oracle-backed database integrations:
- Only available for certain databases
- Only works with specific database event types
- Structural changes require manual intervention (Microsoft SQL Server, MySQL, Oracle, PostgreSQL (v1))
- Cannot be used with views
- Logs can age out and impact replication (Microsoft SQL Server, MongoDB, MySQL, and Oracle)
The following limitations are applicable to PostgreSQL-backed database integrations:
- Only available for certain databases
- Only works with specific database event types
- Structural changes require manual intervention (Microsoft SQL Server, MySQL, Oracle, PostgreSQL (v1))
- Cannot be used with views
- Will increase source disk space usage (PostgreSQL)
- Can only be used with a master instance (PostgreSQL)
- Multiple connections to a replication slot can cause data loss in Stitch (PostgreSQL)
Limitation 1: Only available for certain databases
Log-based Incremental Replication is available only for certain Amazon DynamoDB, Microsoft SQL Server, MongoDB, MySQL, Oracle, and PostgreSQL-backed databases. While the original implementations of these databases support Log-based Incremental Replication some cloud versions may not.
In the table below are the databases Stitch supports and whether Log-based Incremental Replication can be used in Stitch for each one.
-
indicates that if the database/instance type meets the Minimum version requirement, Log-based Incremental Replication can be used in Stitch.
-
indicates that the database/instance type cannot be used in Stitch, even if the Minimum version requirement is met. This may be due to:
- The provider not having support for binary logging (MySQL) or logical replication (PostgreSQL), which is what Stitch uses to perform Log-based Incremental Replication.
- The provider not allowing server settings to be configured in the manner Stitch requires. Refer to the documentation for the database for configuration requirements.
- The provider not allowing binary logging on read replicas.
Note: If public-facing information about the lack of support is available, a link to it will display next to the icon.
Support for Amazon DynamoDB databases
This table is for v1 of Stitch’s Amazon DynamoDB or Amazon DynamoDB-based database integrations.
Database integration | Master instance | Read replica | Supported version(s) |
Amazon DynamoDB |
Support for Microsoft SQL Server databases
This table is for v1 of Stitch’s Microsoft SQL Server or Microsoft SQL Server-based database integrations.
Database integration | Master instance | Read replica | Supported version(s) |
Amazon Microsoft SQL Server RDS | 2012+ | ||
Microsoft Azure SQL Database | |||
Microsoft SQL Server | 2012+ |
Support for MongoDB databases
This table is for v3 of Stitch’s MongoDB or MongoDB-based database integrations.
Database integration | Master instance | Read replica | Supported version(s) |
MongoDB | |||
MongoDB Atlas |
Support for MySQL databases
This table is for v2 of Stitch’s MySQL or MySQL-based database integrations.
Database integration | Master instance | Read replica | Supported version(s) |
Amazon Aurora MySQL RDS | 5.6.2+ | ||
Google CloudSQL MySQL | 5.6.2+ | ||
MariaDB | 5.6.2+ | ||
MySQL | 5.6.2+ |
Support for Oracle databases
This table is for v1 of Stitch’s Oracle or Oracle-based database integrations.
Database integration | Master instance | Read replica | Supported version(s) |
Amazon Oracle RDS | 8.0 - 18c | ||
Oracle | 8.0 - 18c |
Support for PostgreSQL databases
This table is for v2 of Stitch’s PostgreSQL or PostgreSQL-based database integrations.
Database integration | Master instance | Read replica | Supported version(s) |
Amazon Aurora PostgreSQL RDS | 10.6+ | ||
Amazon PostgreSQL RDS | (link) | 9.4+ | |
Google CloudSQL PostgreSQL | (link) | 9.4+ | |
PostgreSQL | (link) | 9.4+ |
Limitation 2: Only works with specific database event types
Log-based Incremental Replication reads data from a database’s log and then replicates the changes. In order to replicate data, the event that caused a change to the data must be written to the log.
Stitch will read the following event types from logs:
DELETE
INSERT
UPDATE
This means that if data is modified using an event type not listed here, it won’t be written to the database’s log or subsequently detected by Log-based Incremental Replication.
For example: If data in a table is modified using ALTER
, the changes won’t be written to the log or identified by Stitch.
Limitation 3: Structural changes require manual intervention (Microsoft SQL Server, MySQL, Oracle, PostgreSQL (v1))
Any time the structure of a source table changes, you’ll need to reset the table from the Table Settings page. This will queue a full re-replication of the table and ensure that structural changes are correctly captured.
Structural changes can include adding new columns, removing columns, changing a data type, etc. Resetting the table is required due to how messages in logs are structured and how Stitch’s integrations validate table schemas when extracting data. When a structural change occurs without a table being reset, an extraction error similar to the following will surface in the Extraction Logs:
Error persisting data for table "[table_name]": Record 0 did not conform to schema
For this reason, Stitch recommends using Log-based Incremental Replication with tables that have structures that don’t change frequently.
Schema violation errors, explained
Messages in database logs are ordinal and don’t contain field information. This means that column values in log messages are in the same order as the columns in the source table, but the messages don’t contain data about which values belong to each column.
Because of this and how extraction is performed using Log-based Incremental Replication, changes in a table’s structure will cause extraction errors.
For example, consider this table:
id | name | type |
1 | Finn | human |
2 | Jake | dog |
3 | Bubblegum | princess |
The values for these fields might look like this in a binary log:
1,Finn,human,2,Jake,dog,3,Bubblegum,princess
Stitch’s Microsoft SQL Server, MySQL, Oracle, and PostgreSQL (v1) integrations use JSON schema validation to ensure that values in log messages are attributed to the correct fields when data is loaded into your destination. For this reason, schema changes in a source - whether it’s changing a column’s data type or re-ordering columns - will cause an extraction error to occur.
If the column order or data types of a source table change in any capacity, the integration will not persist new or updated records that use this updated schema, as it does not have a means of attributing values to their proper columns based on the ordinal set when compared to the expected schema that was previously detected.
As Log-based Incremental Replication is a seamless process, an extraction error for any one table in a replication job will disrupt replication from any other tables using Log-based Incremental Replication.
Let’s look at an example. Here’s the same table from before, now with a new age
column:
id | name | age | type |
1 | Finn | 15 | human |
2 | Jake | 9 | dog |
3 | Bubblegum | 19 | princess |
And the values in the log:
1,Finn,15,human,2,Jake,9,dog,3,Bubblegum,19,princess
Where Stitch previously detected three columns, the log messages now contain data for four columns. Because the log messages don’t contain field information and are read in order, Stitch would be unable to determine what column the 15
, 9
, and 19
values are for.
Limitation 4: Cannot be used with views
Log-based Incremental Replication can’t be used with database views, as modifications to views are not written to log files.
Stitch recommends using Key-based Incremental Replication instead, where possible.
Limitation 5: Logs can age out and impact replication (Microsoft SQL Server, MongoDB, MySQL, and Oracle)
Log files, by default, aren’t stored indefinitely on a database server. The amount of time a log file is stored depends on the database’s log retention settings.
Log retention settings when a log file is automatically removed from the database server. This can either be a set amount of time, or the maximum size of all the database’s log files. When a log file is removed from the server before Stitch can read from it, one of two things will happen depending on the database type:
This can be caused by a few things:
- The log file is purged before historical replication completes. This is because the maximum log position ID is saved at the start of historical replication jobs, so Stitch knows where to begin reading from the database logs after historical data is replicated.
-
For log retention settings that define a time period, the time period is too short. Stitch recommends a minimum of 3 days, but 7 days is preferred to account for resolving potential issues without losing logs.
- For Microsoft SQL Server databases, this is the
CHANGE_RETENTION
setting. - For MysQL databases, these are the
expire_logs_days
orbinlog_expire_logs_seconds
settings. - For Oracle databases:
- For self-hosted Oracle databases, this is the RMAN retention policy setting.
- For Oracle-RDS databases, these are the AWS automated backup and
archivelog retention hours
settings.
- For Microsoft SQL Server databases, this is the
-
For log retention settings that define a maximum size, the size is insufficient. This is applicable to MongoDB integrations. When creating a replica set, this is defined using the replication
oplogSizeMB
configuration option. It can also be defined for an existing replica set using the replSetResizeOplog command. - Any critical error that prevents Stitch from replicating data, such as a connection issue that prevents Stitch from connecting to the database or a schema violation. If the error persists past the log retention period, the log will be purged before Stitch can read it.
Microsoft SQL Sever and MongoDB: Affected tables will be re-replicated in full
When logs age out for Microsoft SQL Server and MongoDB database integration, the affected tables will be re-replicated in full and the following will surface in the Extraction Logs:
-
For Microsoft SQL Server databases:
CHANGE_TRACKING_MIN_VALID_VERSION has reported a value greater than current-log-version. Executing a full table sync.
-
For MongoDB databases:
Clearing state because Oplog has aged out Must complete full table sync before starting oplog replication for [COLLECTION_NAME]
To prevent table re-replication, increase the log retention settings for the database:
- For Microsoft SQL Server databases, this is accomplished via the CHANGE_RETENTION setting. Stitch recommends a value of at least 3 days, but 7 days is preferable.
- For MongoDB databases, this is accomplished via the replSetResizeOplog command. Note: As the maximum size you need depends on your database, it may take some experimentation to identify the best setting. Mongo doesn’t currently recommend an OpLog size.
MySQL and Oracle: Replication will stop
When logs age out for MySQL and Oracle database integrations, an extraction error similar to the following will surface in the Extraction Logs:
-
For MySQL databases:
Unable to replicate binlog stream because the following binary log(s) no longer exist: [binary_log_file_name]
-
For Oracle databases:
ORA-01291: missing logfile
To resolve the error, you’ll need to reset the integration from the Integration Settings page. Note: This is different than resetting an individual table.
Limitation 6: Will increase source disk space usage (PostgreSQL)
PostgreSQL Log-based Incremental Replication uses PostgreSQL’s logical replication feature. Logical replication uses a replication slot, which represents a stream of changes made to a given database.
According to PostgreSQL’s documentation, replication slots will prevent removal of required resources even if no connection is using them:
Replication slots persist across crashes and know nothing about the state of their consumer(s). They will prevent removal of required resources even when there is no connection using them. This consumes storage because neither required WAL nor required rows from the system catalogs can be removed by VACUUM as long as they are required by a replication slot.
This means that log files (Write Ahead Log (WAL), for PostgreSQL) aren’t removed from the replication slot until they’re consumed. In this case, until Stitch reads them during a replication job.
While Stitch will issue a flush_lsn
command after messages have been read, an increase in disk space usage is to be expected when using Log-based Incremental Replication due to how PostgreSQL replication slots function. The amount of disk space usage depends on the number of updates made to the database, how quickly Stitch proceeds with replication, and whether any errors that prevent replication arise.
The greatest increase in disk space usage typically occurs during the switch from historical replication (SELECT
-based replication) to consuming the database’s logs. Disk space usage may spike, but it typically levels off over time.
Note: If you decide to permanently disable Log-based Incremental Replication for your PostgreSQL database, remove the replication slot to prevent further unnecessary disk space consumption.
Limitation 7: Can only be used with a master instance (PostgreSQL)
For PostgreSQL-backed databases, Log-based replication will only work on master instances due to a feature gap in PostgreSQL 10. Based on their forums, PostgreSQL is working on adding support for using logical replication on a read replica to a future version.
If you’re concerned about the increase in disk space usage and the impact this may have, consider connecting a read replica and using Key-based Incremental Replication instead.
Otherwise, we recommend monitoring the instance’s disk space usage during the first few replication jobs to minimize any negative impact on your database’s performance.
Limitation 8: Multiple connections to a replication slot can cause data loss in Stitch (PostgreSQL)
As previously mentioned, Log-based Incremental Replication for PostgreSQL requires a replication slot which Stitch can read log files from. When changes are made to a database, they are written to the log file in the replication slot.
Each change to the database is written to the database’s replication slot exactly once.
As log files are removed from the replication slot once they’re consumed, this means that once the change is read, the record of it is purged.
If multiple connections - whether it’s multiple integrations in Stitch, or connections elsewhere - are using the same replication slot, data loss can occur as each connection will only receive some of the updates made to the database.
According to PostgreSQL’s documentation:
A logical slot will emit each change just once in normal operation… Multiple independent slots may exist for a single database. Each slot has its own state, allowing different consumers to receive changes from different points in the database change stream. For most applications, a separate slot will be required for each consumer. A logical replication slot knows nothing about the state of the receiver(s). It’s even possible to have multiple different receivers using the same slot at different times; they’ll just get the changes following on from when the last receiver stopped consuming them. Only one receiver may consume changes from a slot at any given time.
This means that if one connection reads the changes from the replication slot, Stitch will only be able to extract the changes from when the other connection stopped consuming them.
To avoid data loss caused by this scenario, Stitch recommends creating a dedicated replication slot for PostgreSQL database you want to connect.
Limitation 9: Multiple data types in the _id field can cause discrepancies during historical replication (MongoDB)
Historical replication jobs works a little differently for MongoDB-based database integrations. For MongoDB, Stitch uses the _id
field as the collection’s Primary Key.
During the historical replication job, Stitch first determines the current maximum value of the _id
column. Next, Stitch queries for documents with _id
values that are less than or equal to the current maximum _id
value.
For example: A query for a historical replication job for a MongoDB collection would look like this:
SELECT field_you_selected_1,
field_you_selected_2,
[...]
FROM schema.collection
WHERE _id <= [max _id value]
ORDER BY _id
The MongoDB integration functions this way to ensure that replication for the table can resume if a replication job is interrupted or doesn’t finish before the extraction job time limit.
Because MongoDB ranks BSON data types, this affects how the maximum _id
value is determined. As a result, if multiple data types are present in the _id
column, discrepancies may occur. Refer to the Missing Mongo data due to multiple data types guide for more info and examples.
Limitation 10: Duplication in replication (Microsoft SQL Server)
For Microsoft SQL Server-based integrations, Stitch uses log position IDs inclusively. This means that during a replication job, Stitch will resume reading data with a greater than or equal to log position ID than the log position ID from the previous job.
If this were a SQL query, it would look like this:
SELECT id,
column_you_selected_1,
column_you_selected_2,
[...]
FROM schema.table
WHERE log_position_id >= [last_saved_maximum_log_position_id]
As a result, there will be some duplication during the extraction process. This is because Stitch checks for log position IDs that are greater than or equal to the last saved maximum log position ID. Because of this approach, the record with a log position ID equal to the last saved maximum log position ID will be selected for extraction during the next job.
Enable Log-based Incremental Replication
Using Log-based Incremental Replication requires a specific database configuration. Instructions for configuring the required settings varies from database to database.
For setup instructions, refer to the documentation for your database:
Related | Troubleshooting |
Questions? Feedback?
Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.