Delta Sharing: How to save on egress costs using cached data and incremental Change Data Feeds?

Amr Ali
6 min readApr 19, 2023

--

delta sharing logo

Data sharing is becoming increasingly vital to the modern data stack, enabling teams to collaborate, analyse, and derive insights from shared data. One of the best data-sharing options is Delta Sharing, an open-source protocol that allows you to share data seamlessly between different parties. However, data transfer costs, particularly egress costs from cloud storage, can quickly add up and become a significant expense. To address this challenge, we will explore the three approaches you can use with Delta Sharing and discuss how they can benefit users in managing data-sharing costs.

Approach 1: Remotely Linked Tables

This approach creates tables as local table definitions from shared tables on the recipient side. However, the table data is at the provider side (hence, they are called remotely linked tables). These remotely linked tables are automatically synced with the data in the provider tables in real time, allowing recipients to access and analyze the data directly on the latest available data from the provider. This is particularly useful when working with frequently changing data or when multiple users simultaneously need access to the same data.

The benefit of using remotely linked tables is that data recipients can work with the shared data directly from their Databricks workspace without worrying about outdated or stale data. The downside is that working directly with the provider data might incur egress costs on the provider side each time the recipient performs a query.

This approach works great if the data size is in a small range (less than 100GB).

Approach 1: Remotely Linked Tables

To implement this approach, the recipient is required to perform the following steps:

  1. Download the share profile file from the data provider, save it to dbfs, and run the code below
CREATE TABLE <mycatalog.mydatabase.mytable> 
USING deltaSharing
LOCATION '<profile-file-path-on-dbfs>#<share-name>.<schema-name>.<table-name>';

If you are sharing data using Databricks to Databricks approach, you can create the table using the below (note you need to request the provider sharing identifier and the share name from the provider)

create catalog <mycatalog> 
using share `<provider-sharing-identifier>`.<share-name>;

2. Once you run this code, mycatalog will contain all tables in the share, which will be linked to the same tables on the provider side.

Approach 2: Managed Full Replica Tables

Managed full replica tables to create exact copies of the shared tables at the recipient side that are fully refreshed periodically. Any changes made to the provider tables, such as inserts, updates, and deletes, will be immediately reflected on the recipient side when the next full refresh. Managed full replica tables are handy when data recipients want the best query performance since data is now near the compute and is OK to work on cached data.

Another advantage of using managed full replica tables is that data recipients can work with the data locally in their workspace without providers incurring continuous egress costs for accessing the provider’s storage whenever the recipient tries to query the data. The replicated data is readily available for analysis and processing, enabling efficient data operations and eliminating the need for repeated data transfers.

This approach works great if the data size is medium (between 100GB and 500GB).

Approach 2: Managed Full Replica Tables

To implement this approach, the recipient is required to perform the following steps:

  1. Like the first approach, recipients will mount the table as Remotely Linked Tables. However, instead of allowing end users to query these tables directly, the admin will create a Databricks job that runs on specific intervals that copy the data from these tables to other tables in a local database. Copying the data can be quickly done using this code:
CREATE TABLE catalog1.database1.local_table 
as select * from catalog2.remote_database.remote_table;

2. The recipient will use unity catalogue permissions to restrict access to the remotely linked tables and open access to the managed local tables.

3. Create a Databricks job and run this code at an interval that works best for the business use case.

Approach 3: Managed Incremental Replica Tables

Managed incremental replica tables are exactly like the managed Full Replica tables. However, only table changes from source to target are reflected when a periodic refresh is due. Data changes are generated using Delta Change Data Feed (CDF) from the source table and using primary keys on the target table. We can MERGE these changes back to the cached tables. All inserts, updates, and deletes are incrementally replicated to the recipient tables in that scenario. Managed incremental replica tables are handy when data recipients are OK with working with cached data for a while for performance reasons.

Another advantage of managing incremental replica tables is that only the changed data must be transferred, significantly reducing egress costs at the provider’s end. This allows for efficient synchronization of only the changed data, minimizing data transfer costs and reducing the time needed to transfer data from provider to recipient.

This approach works great if the data size is enormous (bigger than 500GB).

To implement this approach, some steps must be performed by both the provider and the recipients.

First, the data provider needs to share the tables with Change Data Feeds and History turned on.

ALTER TABLE table1 SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
ALTER SHARE share ADD TABLE table1 WITH HISTORY;

Second, the recipient is required to complete the following steps:

  1. Like the first approach, recipients will mount the table as Remotely Linked Tables. However, Recipient will, and only for the first time, copy the data from these tables to other tables in a local database like the second approach.
  2. The recipient will then create a Databricks job that runs on specific intervals, reads table changes from the remotely linked table, and reflects it to the target table using MERGE INTO command. You should follow what is detailed in this link to use the MERGE INTO properly.
  3. Recipients will use unity catalogue permissions to restrict access to the remotely linked tables and open access to incrementally cached tables.

Summary

In this post, I presented three approaches to working with Delta Sharing. A summary table of the three approaches is below:

Comparison of the three approaches to work with Delta Sharing

Implementation

The three approaches have been implemented as an open-source project (Databricks Delta Sharing Utils) on GitHub that you can try. Examples of these approaches are below.

Remotely Linked Tables

# create a DeltaShareRecipient instance and point it to the share file location
dsr = DeltaShareRecipient(
share_profile_file_loc
='https://databricks-datasets-oregon.s3-us-west-2.amazonaws.com/delta-sharing/share/open-datasets.share')
# this will display a list of all shares, and what tables are shared
dsr.discover()
# this will mount the table on the local catalog, however data stays at provider side
dsr.create_remotly_linked_tables("my_share")

Managed Full Replica

# create a DeltaShareRecipient instance and point it to the share file location
dsr = DeltaShareRecipient(
share_profile_file_loc
='https://databricks-datasets-oregon.s3-us-west-2.amazonaws.com/delta-sharing/share/open-datasets.share')
# this will display a list of all shares, and what tables are shared
dsr.discover()
# this will cache all the data at the recipient side
dsr.create_fully_cached_tables("my_share")

Managed Incremental Replica

# create a DeltaShareRecipient instance and point it to the share file location
dsr = DeltaShareRecipient(
share_profile_file_loc
='https://databricks-datasets-oregon.s3-us-west-2.amazonaws.com/delta-sharing/share/open-datasets.share')
# this will display a list of all shares, and what tables are shared
dsr.discover()
# this will start sync the data from the data sharer to the data recipients incrementally:
dsr.create_incrementally_cached_tables("my_share", primary_keys={'table1':'key1, key2'})

For the last two examples, create a Databricks job with the code above, and run on the interval that suits the business case. You do not need to run the first example in a job, as tables are linked at creation time.

Try out Databricks Delta Sharing Utils on Databricks today and streamline your data-sharing process while saving on egress costs!

Disclaimer

Delta Sharing Utils is provided as-is and is not officially supported by Databricks through customer technical support channels. Support, questions, and feature requests can be communicated through the Issues page of the provided git repo. Issues with this code will not be answered or investigated by Databricks Support.

--

--

Amr Ali

Cloud Solutions Architect with an interest in Big Data Analytics, Machine Learning and AI.