Understanding Loading and Row Usage for Nested Data Structures

Important: Not applicable to all destinations and data types

Destinations: This article is applicable only to the following destinations, as they do not natively support nested data structures:
- Amazon Redshift
- Amazon S3 (CSV)
- Microsoft Azure Synapse Analytics
- Panoply
- PostgreSQL
PostgreSQL JSON & JSONB datatypes: The info in this article is not applicable to PostgreSQL JSON and JSONB data types from the PostgreSQL (v1) integration. These data types will be stored as strings in your destination. In the PostgreSQL (HP) (v2) integration, these data types are stored as JSON objects.

To understand how Stitch loads the data it receives, you need to know a little bit about JSON.

MongoDB and many SaaS integrations use nested structures, which means each attribute (or column) in a table could have its own set of attributes. Stitch is designed to deconstruct these nested structures into separate tables to easily query the data.

In this article, we’ll cover:

Understanding JSON data structures
Deconstruction of nested arrays
Connecting subtables to top level records
Impact on total row count
Tips for reducing your row count

Understanding JSON data structures

When Stitch pulls data from an integration, it’s pulling a series of JSON records. JSON records can contain structures called objects and arrays.

Objects

An object is an unordered set of name and value pairs; each set is called a property. Objects begin with a left curly bracket ( { ) and end with a right curly bracket ( } ).

{  
   "product_id":"5008798",
   "name":"Awesome Dino Shirt",
   "price":"5.99",
   "attributes":{               // object begins
      "color":"blue",
      "size":"large",
      "type":"clothing",
      "ounces":"5"
   }                           // object ends
}

When Stitch receives an object, the properties in the object are “flattened” into the table and columns are created. Columns created from object properties follow this naming convention: [object_name]__[property_name]

If a table were created for the object in the example above, the schema would look like this:

product_id	name	price	attributes__color	attributes__size	attributes__type	attributes__ounces
5008798	Awesome Dino Shirt	5.99	blue	large	clothing	5

Arrays

An array is an ordered collection of values. Values are separated by commas and can be a string (contained in double quotes), numbers, boolean, an object, or another array. Arrays begin with a left square bracket ( [ ) and end with a right square bracket ( ] ).

Here’s an example:

{
   "order_id":"1234",
   "customer_id":"100",
   "line_items":[            // array begins
      {
         "product_id":"5008798",
         "price":"5.99"
      }
   ]                        // array ends
}

When Stitch receives a nested array - or an array that’s inside a JSON record - like the one above, it will “denest” it from the parent structure and create a subtable.

Deconstruction of nested arrays

To give you a better understanding of how Stitch denests arrays, we’ll walk you through an example using a Shopify order record. In this example, the order record is composed of three parts:

Core order data
Line items
Tax lines

Here’s what the JSON for the Shopify order looks like:

{
   "order_id":"1234",
   "created_at":"2015-01-01 00:00:00",
   "customer_id":"100",
   "line_items":[                         // line item record begins
      {
         "product_id":"5008798",
         "price":"5.99",
         "quantity":"1",
         "tax_lines":[                    // tax line record begins
            {
               "price":"5.99",
               "rate":"0.06",
               "title":"State Tax"
            }
         ]                               // tax line record ends
      }
   ]                                    // line item record ends
}

This record contains three levels of data due to the nested arrays. Stitch will denest the arrays from the top level record - in this case, the core order info - and create subtables.

From this one order record, three tables will be created:

Table name	Description
orders	This table contains the core order data: order ID, created timestamp, and customer ID.
orders__line_items	This table contains the line item info: product ID, price, and quantity.
orders__line_items__tax_lines	This table contains the tax line info: price, rate, and title.

Connecting subtables to top level records

When subtables are created, Stitch will append a few columns to be used as composite keys that enable you to connect subrecords back to their parent. Let’s take a look at the schemas for each of the Shopify tables to get a better idea of how this works.

In this section:

Top level: Core order data
Second level: Line items
Third level: Tax lines

Top level: Core order data

This table contains the order record’s Primary Key, order_id.

order_id [pk]	created_at	customer
1234	2015-01-01 00:00:00	100

Second level: Line items

In addition to the attributes in the nested record - in this case, product ID, price, and quantity for line items - Stitch will add these columns to second level tables:

Column name Description

_sdc_source_key_[pk_column_name]

This contains the top level record’s Primary Key.

In your destination, [pk_column_name] will be replaced with the name of the Primary Key column for the top level record. In this example, the column would be _sdc_source_key_order_id.

_sdc_level_0_id

This forms part of a composite primary key for this row and can be used to associate further down the line nested rows to this parent. This will auto-increment for each unique record in the table, beginning with 0. When used with the _sdc_source_key column, it creates a unique identifier for the row.

For the Shopify example, the value for the first line item record would be 0, the second 1, the third 2, and so on.

We recommend always joining the top level table to the nested table - this will allow you to avoid queries that may have outdated data.

Here’s what the orders__line_items table would look like if another line item were added to the order record:

_sdc_source_key_order_id	_sdc_level_0_id	product_id	price	quantity
1234	0	5008798	5.99	1
1234	1	3445689	10.99	1

If you wanted to return all line items for order number 1234, you’d run the following query:

SELECT *
  FROM orders__line_items li
 WHERE _sdc_source_key_order_id = 1234

Third level: Tax lines

In addition to the attributes in the nested record - in this case, price, rate, and title for tax lines - Stitch will add these columns to third level tables:

Column name	Description
_sdc_source_key_[pk_column_name]	This contains the top level record’s Primary Key. In your destination, `[pk_column_name]` will be replaced with the name of the Primary Key column for the top level record. In this example, the column would be `_sdc_source_key_order_id`.
_sdc_level_0_id	This is the foreign key for the second level (`orders__line_items`) table. Combined with the source key (`_sdc_source_key_order_id`), it can be used to find the parent.
_sdc_level_1_id	This forms part of a composite primary key for this row and can be used to associate further down the line nested rows to this parent. For the Shopify example, the first tax line record would be 0, the second 1, the third 2, and so on.

Here’s what the orders__line_items__tax_lines table would look like if another tax line were added to the order record:

_sdc_source_key_order_id	_sdc_level_0_id	_sdc_level_1_id	price	rate	title
1234	0	0	5.99	.06	State Tax
1234	1	0	10.99	.06	State Tax

If you wanted to return all line items and tax lines for order number 1234, you’d run the following query:

SELECT *
      FROM orders__line_items li
INNER JOIN orders__line_items__tax_lines tl
        ON tl.__sdc_level_0_id = li.__sdc_level_0_id
       AND tl.__sdc_source_key_order_id = li.__sdc_source_key_order_id
     WHERE _sdc_source_key_order_id = 1234

Impact on total row count

Because Stitch is built to denest nested arrays into separate tables, you can expect to see more rows in Stitch and in your destination than what’s in the source itself.

To sum it up: row count in original data source ≠ the row count reflected in the Stitch app or your destination.

Consider the Shopify example. Order 1234 isn’t just a single row in the destination. Because Stitch had to denest subrecords and create tables to accommodate these records, you can expect to see more than just one row for order 1234 moving through Stitch.

From the top level record, there’s the row in the orders table:

order_id [pk]	created_at	customer
1234	2015-01-01 00:00:00	100

From the second level record, we have the rows in the orders__line_items table:

_sdc_source_key_order_id	_sdc_level_0_id	product_id	price	quantity
1234	0	5008798	5.99	1
1234	1	3445689	10.99	1

From the third level record, we have the rows in the orders__line_items__tax_lines table:

_sdc_source_key_order_id	_sdc_level_0_id	_sdc_level_1_id	price	rate	title
1234	0	0	5.99	.06	State Tax
1234	1	0	10.99	.06	State Tax

In total, Stitch will count each of these rows (a total of 5) towards your row count. So while there is only one record in the Shopify data source, inside of Stitch you’ll see 5 replicated rows and there’ll be 5 rows in your destination.

Tips for reducing your row count

Understanding how Stitch handles nested data structures will in turn give you a deeper understanding of how your data is structured once it gets to your destination. While this knowledge will help comprehending the data’s structure, what about applying it to how many rows you’re using? How can you plan ahead?

For starters, check out the Reducing Your Row Count guide for detailed tips and gotchas on how to keep your row count in line.
Take some time to learn about how the data for your integrations is structured and how it’s replicated.

Check out the SaaS Schemas section for detailed info on Replication Methods and the tables you can expect to see for each integration. Every article links to that service’s API documentation, which is the best way to learn about the data for that integration is structured.
Set your integrations to replicate less frequently, if feasible.

Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.