Merge vs. Apply Changes in Databricks

Merge vs. Apply Changes in Databricks

How to effectively use Apply Changes for Change Data Capture

In SQL, the MERGE statement is a familiar tool in the toolkit of any data specialist, frequently employed for managing Change Data Capture (CDC). Unsurprisingly, the power of MERGE INTO extends into the Databricks environment. However, the use of MERGE for CDC data presents its own set of challenges.

Imagine this scenario: since our last check of the CDC feed in the original database, a single record may have undergone multiple changes. Managing these changes in the correct order becomes a crucial consideration. Furthermore, crafting the logic to handle distinct operation types, such as INSERT, UPDATE, or DELETE, falls squarely on our shoulders.

This is where APPLY CHANGES steps in to simplify the process!

MERGE redefined: APPLY CHANGES

Luckily, within the context of Delta Live Tables (DLT) there’s a much simpler way of doing what’s essentially a complicated MERGE. The folks at Databricks gave it a different name, APPLY CHANGES, to distinguish it from a regular MERGE.

In a nutshell, APPLY CHANGES is a declarative way in Databricks Delta Live Tables that simplifies handling CDC by applying changes from a source table to a target table. It’s worth noting, however, that there are subtle differentiators setting APPLY CHANGES apart from the conventional MERGE statement.

Comparing MERGE with APPLY CHANGES

While MERGE serves as a versatile tool for merging or updating data across tables, APPLY CHANGES is specifically tailored for use within Delta Live Table (DLT) pipelines. It’s important to note that the source table can be a regular Delta table, but the target table must be a Delta Live table.

What sets APPLY CHANGES apart is its specialized capability to handle CDC data seamlessly. This means not only accommodating updates but also gracefully managing multiple updates to the same row, even when they occur out of order—a truly noteworthy feat. We’ll delve deeper into this impressive functionality shortly.

APPLY CHANGES syntax

Upon browsing through the documentation of APPLY CHANGES, we notice the syntax is somewhat familiar to MERGE, yet still different enough to warrent further exploration. In the above docs, the following example command is given. Let’s go over each line to get a better understanding of what’s happening.

APPLY CHANGES INTO
  live.target
FROM
  stream(cdc_data.users)
KEYS
  (userId)
APPLY AS DELETE WHEN
  operation = "DELETE"
APPLY AS TRUNCATE WHEN
  operation = "TRUNCATE"
SEQUENCE BY
  sequenceNum
COLUMNS * EXCEPT
  (operation, sequenceNum)
STORED AS
  SCD TYPE 2

Naming

Pretty obvious, but the naming is different. The folks at Databricks deliberately named it different to distinguish between a regular MERGE and APPLY CHANGES.

Matching using KEYS

The KEYS keyword is used to define the columns for matching. The expectation is a list of at least one column that must exist in both the source and target tables. By comparison, in a MERGE command, a straightforward JOIN-like condition can be stated, where the key(s) to join on are specified.

Interestingly, APPLY CHANGES introduces a separation of the keys and an optional filter by having a distinct WHERE clause, separate from the KEYS. This extra WHERE caues the update only to be applied when the specified condition is satisfied.

Deleting records using APPLY AS DELETE

When working with CDC data, it’s common to encounter a column in the incoming data that signifies the type of operation performed in the source database — commonly INSERT, UPDATE, or DELETE. Here we specify the conditions under which we want to apply a DELETE.

Note that in the APPLY CHANGES statement, the default behavior is an UPSERT. This eliminates the need for explicit specification of INSERT or UPDATE statements, improving readability.

Ordering mutations using SEQUENCE BY

The SEQUENCE BY keyword is a powerful tool in APPLY CHANGES that handles scenarios where multiple edits, such as updates or deletes, are associated with the same primary key or a combination of specified columns in the target table. SEQUENCE BY only expects a column that determines the order in which to handle multiple updates (usually a timestamp of the mutation). APPLY CHANGES then handles the rest under the hood.

Selecting which COLUMNS to INSERT or UPDATE

Rather than explicitly listing all columns in separate UPDATE or INSERT statements, users can efficiently specify the target columns with the COLUMNS keyword. Moreover, this approach allows for a reverse scenario where users articulate the columns they prefer not to update using COLUMNS * EXCEPT.

Creating a target table

A pivotal step involves the creation of a target table. This is accomplished through the straightforward command CREATE STREAMING TABLE <table-name>. This command operates without the need for a predefined schema. While its precise functionality may not be explicitly outlined in the documentation, it’s reasonable to assume that it verifies the existence of a Delta Live table first, before creating one. Note that failing to include this command in your workflow results in a pipeline failure.


© 2024. All rights reserved.

Powered by Hydejack v9.1.6