Databricks Delta Lake on AWS (v1) Data Loading Reference

Learn how Stitch will load data from your integrations into Stitch’s Databricks Delta Lake on AWS destination.

In this guide, we’ll cover data loading scenarios involving:

Primary key columns
Replication keys and how data is loaded as a result
Object identifiers in the destination, including naming limitations and transformations
Table creation and modification in the destination
Various data types, including how data is typed and structured in the destination
Schema changes in the source or structural changes in the destination
Modifications made to the destination, such as the application of workload/performance management features or user privilege changes

Primary Key scenarios

Scenarios involving Primary Key columns.

A table without a Primary Key is replicated.

IF	A table without a Primary Key is replicated.
THEN	Initial job: Table is created. Subsequent jobs: If using Key-based Incremental or Log-based Incremental Replication, records will be added to the table in an Append-Only fashion. If using Full Table Replication, the table will be overwritten in its entirety during each job.

A table with a single Primary Key is replicated.

IF	A table with a single Primary Key is replicated.
THEN	Initial job: Table is created with a table property comment containing the Primary Key information for the table. Subsequent jobs: If using Key-based Incremental or Log-based Incremental Replication, data will be de-duped based on the Primary Key column and upserted into the table. If using Full Table Replication, the table will be overwritten in its entirety during each job.

A table with multiple Primary Keys is replicated.

IF	A table with multiple Primary Keys is replicated.
THEN	Initial job: Table is created with a table property comment containing the Primary Key information for the table. Subsequent jobs: If using Key-based Incremental or Log-based Incremental Replication, data will be de-duped based on the Primary Key column and upserted into the table. If using Full Table Replication, the table will be overwritten in its entirety during each job.

The table's Primary Key(s) is/are changed.

IF	The table’s Primary Key(s) is/are changed.
THEN	Changing a table’s Primary Key(s) is only permitted if the table is using Full Table Replication. If Primary Key columns are changed for a table using Key-based or Log-based Incremental Replication, Stitch will stop processing data for the table.
AND	The following error will display in the Notifications tab in Stitch: `incoming pks([new_pk_column(s)]) do not match pks([original_pk_column(s)]) in existing table([table_name])`
FIX IT	For tables using Key-based or Log-based Incremental Replication: Re-instate the table’s Primary Key(s) to allow Stitch to continue processing data for the table, or Drop the table in Databricks Delta and then reset the table in Stitch. This will re-create the table using the new Primary Keys and allow loading to continue. Note: Resetting the table will queue a full re-replication of the table’s data.

A Primary Key column in a source contains multiple data types.

THEN

To accommodate data of varying types, Stitch will create multiple columns to ensure data is loaded with the correct type. In the destination, this will look like the column has been “split”. To ensure data is loaded correctly, a Primary Key column may only a single data type.

For example: Stitch initially detected INTEGER values for an id column, but during a subsequent replication job, detected STRING values. All INTEGER values will be placed in the original id column, and a new id__st column will be created to accommodate the STRING values. In Databricks Delta, the table might look like this:

id	id__st	name
1		Finn
	“2”	Jake

Column splits will result in NULL values in the id column. As Primary Key columns can’t contain NULL values, this will result in the error below.

AND

The following error will display in the Notifications tab in Stitch:

missing pk([pk_column]) from updated record

FIX IT

Verify the data type(s) for the Primary Key column in the source. If it contains multiple data types, you’ll need to ensure that the column only contains values of one data type.

Note: If the table is using Key-based or Log-based Incremental Replication, you’ll also need to do the following:

Drop the table in Databricks Delta
Reset the table in Stitch. This will queue a full re-replication of the table’s data, and re-create the table in Databricks Delta with the correct data type for the Primary Key column.

You remove the Primary Key column(s) for a table in Databricks Delta.

IF	You remove the Primary Key column(s) for a table in Databricks Delta.
THEN	Changing a table’s Primary Key(s) is not permitted in Databricks Delta. If Primary Key columns are changed, Stitch will stop processing data for the table.
AND	The following error will display in the Notifications tab in Stitch: `incoming pks([new_pk_column(s)]) do not match pks([original_pk_column(s)]) in existing table([table_name])`
FIX IT	For tables using Key-based or Log-based Incremental Replication: Re-instate the table’s Primary Key(s) to allow Stitch to continue processing data for the table, or Drop the table in Databricks Delta and then reset the table in Stitch. This will re-create the table using the new Primary Keys and allow loading to continue. Note: Resetting the table will queue a full re-replication of the table’s data.

Replication Key scenarios

Scenarios involving Replication Keys and how data is loaded as a result.

A table using Key-based Incremental Replication is replicated where the Replication Key column contains NULL values.

IF	A table using Key-based Incremental Replication is replicated where the Replication Key column contains `NULL` values.
THEN	During the initial job, the table will be created and all rows will be replicated. During subsequent jobs, only rows with populated Replication Keys will be replicated and persisted to Databricks Delta.

Object naming scenarios

Scenarios involving object identifiers in the destination, including naming limitations and transformations.

A table name contains more characters than allowed by Databricks Delta.

IF	A table name contains more characters than allowed by Databricks Delta.
THEN	Databricks Delta will reject all data for the table.
AND	The following error will display in the Notifications tab in Stitch: `table name ([table_name]) is too long ([table_name_length]) for databricks delta max identifier length of 78 characters` Rejected records will be logged in the `_sdc_rejected` table of the integration's schema. Learn more.
FIX IT	If possible, change the table name in the source to be less than Databricks Delta’s character limit of 78 characters. Use the `_sdc_rejected` table to identify the root of the issue.

A column name contains more characters than allowed by Databricks Delta.

IF	A column name contains more characters than allowed by Databricks Delta.
THEN	Databricks Delta will reject columns with names that exceed the column character limit. Other columns in the table will persist to Databricks Delta.
AND	The following error will display in the Notifications tab in Stitch: `column name ([column_name]) is too long ([column_name_limit]` Rejected records will be logged in the `_sdc_rejected` table of the integration's schema. Learn more.
FIX IT	If possible, change the column name in the source to be less than Databricks Delta’s character limit of 122 characters. Use the `_sdc_rejected` table to identify the root of the issue.

Two columns are replicated that canonicalize to the same name.

IF	Two columns are replicated that canonicalize to the same name.
THEN	For example: A table containing both `CustomerId` and `customerid` columns. Databricks Delta will reject the records and create a log for the rejected records in the `_sdc_rejected` table in that integration’s schema.
AND	The following error will display in the Notifications tab in Stitch: `Field collision on [COLUMN_NAME]` Rejected records will be logged in the `_sdc_rejected` table of the integration's schema. Learn more.
FIX IT	If possible, re-name one of the columns in the source so that both column names will be unique when replicated to Databricks Delta. Use the `_sdc_rejected` table to identify the root of the issue.

A column is replicated that has a mixed-case name.

THEN

Databricks Delta will convert letters to lowercase. For example:

Columns in source	Columns in Databricks Delta
`CuStOmErId`	`customerid`
`customerID`	`customerid`
`customerid`	`customerid`

A column is replicated that has a name with spaces.

THEN

Databricks Delta will convert spaces to underscores. For example:

Columns in source	Columns in Databricks Delta
`customer id`	`customer_id`
`CUSTOMER ID`	`customer_id`

A column is replicated with a name that contains unsupported special characters.

THEN

Databricks Delta will convert special characters to underscores. For example:

Columns in source		Columns in Databricks Delta
`!CUSTOMERID`		`_customerid`
`CUSTOMER!ID`		`customer_id`

A column is replicated with a name that begins with a non-letter.

THEN

Databricks Delta will conserve the non-letter characters and prefix the name with an underscore. For example:

Columns in source		Columns in Databricks Delta
`123customerid`		`_123customerid`
`_987CUSTOMERID`		`_987customerid`

Table scenarios

Scenarios involving table creation and modification in the destination.

A table contains entirely NULL columns.

IF	A table contains entirely `NULL` columns.
THEN	No table is created in Databricks Delta. At least one column must have a non-`NULL` value for Stitch to create a table in Databricks Delta.

A table arrives with more columns than Databricks Delta allows.

IF	A table arrives with more columns than Databricks Delta allows.
THEN	Databricks Delta doesn’t have a column limit for tables. Data will continue to load.

Data typing scenarios

Scenarios involving various data types, including how data is typed and structured in the destination.

Stitch detects multiple data types for a single column.

THEN

To accommodate data of varying types, Stitch will create multiple columns to ensure data is loaded with the correct type. In the destination, this will look like the column has been “split”.

For example: Stitch first detected that order_confirmed contained BOOLEAN data, but during a subsequent job, detected STRING values. To accommodate data of varying types, Stitch will:

Store data for the original data type in the original column. In this example, only BOOLEAN values will be stored in order_confirmed. The name of the original column will not change.
Create additional columns to store the other data types - one for each data type detected - and append the data type to the column name. In this example, a order_confirmed__st column will be created to store STRING values.

Note: If the column is used as a Primary Key for the table, this scenario will result in a loading error. Refer to the Primary Key scenarios section for more info and examples.

Data is replicated to Databricks Delta that is nested, containing many top-level properties and potentially nested sub-properties.

IF	Data is replicated to Databricks Delta that is nested, containing many top-level properties and potentially nested sub-properties.
THEN	Nested data structures (JSON arrays and objects) will be loaded intact into a `STRING` column with a comment specifying that the column contains JSON.

A VARCHAR column is replicated to Databricks Delta.

IF	A `VARCHAR` column is replicated to Databricks Delta.
THEN	Databricks Delta will store `VARCHAR` data as `STRING`.

VARCHAR data is loaded that exceeds the current maximum size for the column.

IF	`VARCHAR` data is loaded that exceeds the current maximum size for the column.
THEN	No widening will occur.

A column containing date data with timezone info is replicated to Databricks Delta.

IF	A column containing date data with timezone info is replicated to Databricks Delta.
THEN	Databricks Delta will store the value as `TIMESTAMP WITH TIMEZONE`. In Databricks Delta, this data is stored with timezone information and expressed as UTC.

A column contains timestamp data that is outside Databricks Delta's supported range.

IF	A column contains timestamp data that is outside Databricks Delta’s supported range.
THEN	Databricks Delta will reject the records that fall outside the supported range.
AND	The following error will display in the Notifications tab in Stitch: `timestamp ([timestamp]) out of range for column [column_name]` Rejected records will be logged in the `_sdc_rejected` table of the integration's schema. Learn more.
FIX IT	To resolve the error, offending values in the source must be changed to be within Databricks Delta’s timestamp range. Use the `_sdc_rejected` table to identify the root of the issue.

A column contains integer data.

IF	A column contains integer data.
THEN	Databricks Delta will store integer data as `BIGINT`.

A column contains integer data that is outside Databricks Delta's supported range.

IF	A column contains integer data that is outside Databricks Delta’s supported range.
THEN	Databricks Delta will reject the records that fall outside the supported range.
AND	The following error will display in the Notifications tab in Stitch: `long ([long]) out of range for column [column_name]` Rejected records will be logged in the `_sdc_rejected` table of the integration's schema. Learn more.
FIX IT	To resolve the error, offending values in the source must be changed to be within Databricks Delta’s limit for integers. Use the `_sdc_rejected` table to identify the root of the issue.

A column contains decimal data.

IF	A column contains decimal data.
THEN	Databricks Delta will store decimal data as `DECIMAL(38,6)`.

A column contains decimal data that is outside Databricks Delta's supported range.

IF	A column contains decimal data that is outside Databricks Delta’s supported range.
THEN	Databricks Delta will reject the records that fall outside the supported range.
AND	The following error will display in the Notifications tab in Stitch: `decimal ([decimal]) out of range for column [column_name]` Rejected records will be logged in the `_sdc_rejected` table of the integration's schema. Learn more.
FIX IT	To resolve the error, offending values in the source must be changed to be within Databricks Delta’s limit for decimals. Use the `_sdc_rejected` table to identify the root of the issue.

Schema change scenarios

Scenarios involving schema changes in the source or structural changes in the destination.

A new column is added in table already set to replicate.

THEN

If the column has at least one non-NULL value in the source, the column will be created and appended to the end of the table in Databricks Delta.

Note: If the table using either Key- or Log-based Incremental Replication, backfilled values for the column will only be replicated if:

The records’ Replication Key values are greater than or equal to the last saved maximum Replication Key value for the table, or
The table is reset and a historical re-replication is queued.

Refer to Tracking new columns in an already replicating table guide for more info and examples.

A new column is added by you to a Stitch-generated table in Databricks Delta.

IF	A new column is added by you to a Stitch-generated table in Databricks Delta.
THEN	Columns may be added to tables created by Stitch as long as they are nullable, meaning columns don’t have `NOT NULL` constraints.

A column is deleted at the source.

THEN

How a deleted column is reflected in Databricks Delta depends on the Replication Method used by the table:

Key-based Incremental: The column will remain in the destination, and default NULL values will be placed in it going forward.
Log-based Incremental: Changes to a source table - including adding or removing columns, changing data types, etc. - require manual intervention before replication can continue. Refer to the Log-based Incremental Replication documentation for more info.
Full Table: The column will remain in the destination, and default NULL values will be placed in it going forward.

You remove a column from a Stitch-replicated table in your destination.

THEN

The result of deleting a column from a Stitch-generated table depends on the type of column being removed:

Primary Key columns: Changing a table’s Primary Key(s) is not permitted in Databricks Delta. If Primary Key columns are changed, Stitch will stop processing data for the table.
General columns: If new data is detected for the removed column, Stitch will re-create it in Databricks Delta. This refers to all columns that are not prepended by _sdc or suffixed by a data type. For example: customer_zip, but not customer_zip__st.

Note: An integration must support selecting columns AND you must deselect the column in Stitch for the column removal to be permanent.
_sdc columns: Removing a Stitch replication column will prevent Stitch from loading replicated data into Databricks Delta.
Columns with data type suffixes: Removing a column created as result of accommodating multiple data types will prevent Stitch from loading replicated data into the table. This applies to columns with names such as: customer_zip__st, customer_zip__int, etc.

Destination changes

Scenarios involving modifications made to the destination, such as the application of workload/performance management features or user privilege changes.

Partitioning is applied to Stitch-generated tables in the destination.

IF	Partitioning is applied to Stitch-generated tables in the destination.
THEN	Stitch will respect the partitioning application.

Clustering is applied to Stitch-generated tables in the destination.

IF	Clustering is applied to Stitch-generated tables in the destination.
THEN	Stitch will respect the cluster application.

Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.

Related	Troubleshooting
Syncing Data Syncing New Columns Replication Methods Replication Keys Resetting Replication Keys	Data Discrepancy Guide Missing Objects Missing Columns & NULL Values Non-Replicating Data & Unsupported Data Types Destination & Integration Compatibility