Databricks Delta Lake (AWS) is an open source storage layer that sits on top of your existing data lake file storage. Stitch’s Databricks Delta Lake (AWS) destination is compatible with Amazon S3 data lakes.
This guide serves as a reference for version 1 of Stitch’s Databricks Delta Lake (AWS) destination.
Details and features
Stitch features
High-level details about Stitch’s implementation of Databricks Delta Lake (AWS), such as supported connection methods, availability on Stitch plans, etc.
Release status |
Released |
Stitch plan availability |
All Stitch plans |
Stitch supported regions |
Operating regions determine the location of the resources Stitch uses to process your data. Learn more. |
Supported versions |
Databricks Runtime Version 6.3+ |
Connect API availability |
Supported
This version of the Databricks Delta Lake (AWS) destination can be created and managed using Stitch’s Connect API. Learn more. |
SSH connections |
Supported
Stitch supports using SSH tunnels to connect to Databricks Delta Lake (AWS) destinations. |
SSL connections |
Supported
Stitch will attempt to use SSL to connect by default. No additional configuration is needed. |
VPN connections |
Unsupported
Virtual Private Network (VPN) connections may be implemented as part of a Premium plan. Contact Stitch Sales for more info. |
Static IP addresses |
Supported
This version of the Databricks Delta Lake (AWS) destination has static IP addresses that can be whitelisted. |
Default loading behavior |
Upsert |
Nested structure support |
Supported
|
Destination details
Details about the destination, including object names, table and column limits, reserved keywords, etc.
Note: Exceeding the limits noted below will result in loading errors or rejected data.
Maximum record size |
20MB |
Table name length |
78 characters |
Column name length |
122 characters |
Maximum columns per table |
None |
Maximum table size |
None |
Maximum tables per database |
None |
Case sensitivity |
Insensitive |
Reserved keywords |
Refer to the Reserved keywords documentation. |
Replication
Replication process overview
A Stitch replication job consists of three stages:
Step 1: Data extraction
Stitch requests and extracts data from a data source. Refer to the System overview guide for a more detailed explanation of the Extraction phase.
Step 2: Stitch's internal pipeline
The data extracted from sources is processed by Stitch. Stitch’s internal pipeline includes the Prepare and Load phases of the replication process:
- Prepare: During this phase, the extracted data is buffered in Stitch’s durable, highly available internal data pipeline and readied for loading.
- Load: During this phase, the prepared data is transformed to be compatible with the destination, and then loaded. Refer to the Transformations section for more info about the transformations Stitch performs for Databricks Delta Lake (AWS) destinations.
Refer to the System overview guide for a more detailed explanation of these phases.
Step 3: Amazon S3 bucket
Data is loaded into S3 files in the Amazon S3 bucket you provide during destination setup.
Step 4: Staging data
Data is copied from the Amazon S3 bucket and placed into staging tables in Databricks Delta Lake (AWS).
Step 5: Data merge
Data is merged from the staging tables into real tables in Databricks Delta Lake (AWS).
Loading behavior
By default, Stitch will use Upsert loading when loading data into Databricks Delta Lake (AWS).
If the conditions for Upsert loading aren’t met, data will be loaded using Append-Only loading.
Refer to the Understanding loading behavior guide for more info and examples.
Primary Keys
Stitch requires Primary Keys to de-dupe incrementally replicated data. To ensure Primary Key data is available, Stitch creates a stitch.pks
table property comment when the table is initially created in Databricks Delta Lake (AWS). The table property comment is an array of strings that contain the names of the Primary Key columns for the table.
For example: A table property comment for a table with a single Primary Key:
(stitch.pks="id")
And a table property comment for a table with a composite Primary Key:
(stitch.pks="id,created_at")
Note: Removing or incorrectly altering Primary Key table property comments can lead to replication issues.
Incompatible sources
No compatibility issues have been discovered between Databricks Delta Lake (AWS) and Stitch's integration offerings.
See all destination and integration incompatibilities.
Transformations
System tables and columns
Stitch will create the following tables in each integration’s dataset:
Additionally, Stitch will insert system columns (prepended with _sdc
) into each table.
Data typing
Stitch converts data types only where needed to ensure the data is accepted by Databricks Delta Lake (AWS). In the table below are the data types Stitch supports for Databricks Delta Lake (AWS) destinations, and the Stitch types they map to.
- Stitch type: The Stitch data type the source type was mapped to. During the Extraction and Preparing phases, Stitch identifies the data type in the source and then maps it to a common Stitch data type.
- Destination type: The destination-compatible data type the Stitch type maps to. This is the data type Stitch will use to store data in Databricks Delta Lake (AWS).
- Notes: Details about the data type and/or its allowed values in the destination, if available. If a range is available, values that exceed the noted range will be rejected by Databricks Delta Lake (AWS).
Stitch type | Destination type | Notes |
BIGINT | UNSUPPORTED |
|
BOOLEAN | BOOLEAN | |
DATE | TIMESTAMP |
|
DOUBLE | DECIMAL |
|
FLOAT | FLOAT | |
INTEGER | INT | |
JSON ARRAY | STRING |
|
JSON OBJECT | STRING |
|
NUMBER | DECIMAL |
|
STRING | STRING |
JSON structures
Databricks Delta Lake (AWS) supports nested records within tables. When JSON objects and arrays are replicated, Stitch will load the JSON intact into a STRING
column and add a comment ("json"
) specifying that the column contains JSON data.
Refer to Databricks’ documentation for examples and instructions on working with complex data structures.
Column names
Column names in Databricks Delta Lake (AWS):
- Must contain only letters (a-z, A-Z), numbers (0-9), or underscores (
_
) - Must begin with a letter or an underscore
-
Must be less than the maximum length of 122 characters. Columns that exceed this limit will be rejected by Databricks Delta Lake (AWS).
- Must not be prefixed or suffixed with any of Stitch’s reserved keyword prefixes or suffixes
Stitch will perform the following transformations to ensure column names adhere to the rules imposed by Databricks Delta Lake (AWS):
Transformation | Source column | Destination column |
Convert uppercase and mixed case to lowercase |
CUSTOMERID or cUsTomErId
|
customerid
|
Convert spaces to underscores |
customer id
|
customer_id
|
Convert special characters to underscores |
customer#id or !customerid
|
customer_id and _customerid
|
Prepend an underscore to names with leading numeric characters |
4customerid
|
_4customerid
|
Timezones
Databricks Delta Lake (AWS) will store the value as TIMESTAMP WITH TIMEZONE
. In Databricks Delta Lake (AWS), this data is stored with timezone information and expressed as UTC.
Compare destinations
Not sure if Databricks Delta Lake (AWS) is the destination for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.
Related | Troubleshooting |
Questions? Feedback?
Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.