In this post I want to cover some things that happen internally when you do updates and deletes with Azure Synapse Link for SQL Server 2022 whilst it is running.
Because recently somebody asked if Azure Synapse Link for SQL Server 2022 captures updates and deletes after they had read a previous post. Where I covered my initial tests for Azure Synapse Link for SQL Server 2022.
Anyway, short answer is that Azure Synapse Link for SQL Server 2022 captures updates and deletes. In this post I will go into more detail about some of the things that appear to happen along the way.
Because if you read the official documentation Azure Synapse Link for SQL Server 2022 uses Azure Synapse Link for SQL change feed. However, I still had some questions about it.
For example, what happens to the data after it leaves SQL Server? How does it handle the files that contain the data in the Azure Data Lake Storage Gen2 storage account? Plus, how are the file contents used for the dedicated SQL Pool?
This post aims to answer some of those questions. So that you get a better idea of the internals behind the scenes.
To get a recap about Azure Synapse Link for SQL Server 2022 itself I recommend reading my other post. Which covers my initial tests for Azure Synapse Link for SQL Server 2022.
Azure Synapse Link for SQL Server 2022 recently came out of preview and is now generally available.
Updates with Azure Synapse Link for SQL Server 2022
For this post I used a copy of the WideWorldImporters sample database. With an Azure Synapse link configured for the Sales.OrderLines and Sales.Orders tables.
I focus on the Sales.Orders table in this post. Which contains 73,595 rows and a column called OrderID as its primary key.
First of all, I update the InternalComments column in every single row the Sales.Orders table. Using the below code.
update [Sales].[Orders]
set InternalComments = 'Before Kevin took over'
I then went into Synapse Studio and ran the following code in my dedicated SQL Pool.
select count(*)
from Sales.Orders
where InternalComments = 'Before Kevin took over'
It returned 73,595 rows. Which was great, because it meant that it worked as expected. Plus, the update happened fast. But what happened in-between?
Azure Synapse Link for SQL Server 2022 landing zone
I browsed the container within the Azure Data Lake storage account that had been configured as the landing zone. Which contains the files relating to the changes done in the SQL Server database.
I decided to go through the below subfolders in the container to get a better idea of what was going on.
ChangeData subfolder
I noticed in the ChangeData subfolder that a new comma separated value (csv) file had appeared.
I downloaded the csv file and opened it. I must admit that I was surprised to see that it contained all the columns for the table. Instead of just the primary key and the column that I had updated.
Manifests subfolder
However, I was curious about what else happened in the storage account for this update. So, I went back to the Manifests subfolder and opened up the Manifest_1.json file that was in there.
Within that file I could see that it had logged various events. What interested me the most was the last entry for the new update as shown below.
{"operation":"dml_only","commitTime":"2022-08-29T17:53:22.883","fromLsn":"0000027c00012e9800ab","toLsn":"0000027c00012e9800ab","format":"csv","lastTransactionCommitTime":"","tables":{"6b14fb1f-5846-4485-861f-4623f2e986fb":{"version":"0000027c000065d00002","batch":"c2c6113e.csv"}}}
Those with a SQL Server background might have noticed the references to the LSN entries. Which I suspect refers to the transaction log for the SQL Server database. In addition, the name of the new csv file in the ChangeData folder was there.
indexWorkingDir subfolder
I then navigated back to the indexWorkingDir subfolder. From there I navigated to a subfolder called insert.data which contained a csv file with a longer name. Which appeared to be slightly smaller than the csv file in the ChangeData subfolder.
I downloaded the file to view its contents and that also appeared to contain every single column in the table.
Out of curiosity I then checked the delete.index subfolder and spotted something interesting. Because it appears to contain a csv file with all the primary key values in it.
It got me thinking. Did this mean that for updates Azure Synapse link deletes the existing rows and then replaces them?
SQL requests for updates
To find out the answer I went back into Synapse Studio and selected the Monitor hub. From there, I then went to Activities and selected SQL requests.
It would seem that is exactly how the data is updated in the dedicated SQL Pool. All the primary key values are inserted into a temporary table which is then used as a join for a delete statement. Like in the below example.
DELETE FROM [Sales].[#source_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_delete];COPY INTO [Sales].[#source_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_delete]
FROM 'https://{STORAGEACCOUNT}.dfs.core.windows.net/landingzone/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/indexWorkingDir/00000xxxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/delete.index'
WITH (FILE_TYPE = 'CSV', FIELDQUOTE = '0X22', FIELDTERMINATOR=',', FILE_SPLIT = 'OFF', CREDENTIAL = (IDENTITY='Managed Identity')
);DELETE [Sales].[Orders]
FROM [Sales].[Orders]
INNER JOIN [Sales].[#source_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_delete]
ON ([Sales].[Orders].[OrderID] = [Sales].[#source_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Afterwards, all the updated rows are inserted again from the csv file that contains all the updated rows.
I would suggest bearing this in mind if you are performing a large number of updates. However, the load time from the files to the dedicated SQL Pool does appear to be fast. I suspect this is due to the architecture of the dedicated SQL Pool.
Updating ten percent of rows
I wondered what would happen if I only updated ten percent of the tables. So, I ran the below T-SQL in the SQL Server database.
update top (10) percent [Sales].[Orders]
set InternalComments = 'Top 10 percent'
I then checked the dedicated SQL Pool and confirmed that the same number of rows had been updated.
Afterwards, I checked the ChangeData folder in the storage account. Where a new file had appeared.
I checked the Manifest_1.json file and a new row had been added to reflect the new update.
Afterwards, I went back into the indexWorkingDir folder. Where there was a new subfolder.
So, I went into the new subfolder and then into its insert.data subfolder. Where there was another new csv file. It only contained updated rows.
So, when updates are performed it appears that the three main things below occur in the storage container:
- The Manifest_1.json file gets updated. Which contains the LSN details and the name of the new file.
- A new csv file gets created in the relevant ChangeData subfolder for the table. Containing the updated rows.
- A new subfolder gets created in the indexWorkingDir folder. Containing two subfolders.
One of which is a delete.index folder that contains a csv containing the primary key values of the rows that had changed.
The other one created is an insert.data folder. Which contains a csv file that has all the updated rows inside it.
In addition, both subfolders contain a _SUCCESS file, which appears to be empty.
However, it is still recommended to only update the essentials in your SQL Server database.
Deletes with Azure Synapse Link for SQL Server 2022
To test what happens with deletes I thought I would delete the top ten percent that I had updated previously first. However, I must admit that I had to remove a couple foreign keys in the database before doing this.
After removing the foreign keys I deleted only those rows and then checked to make sure that the rows were no longer in the dedicated SQL Pool.
I then checked the Manifest_1.json file and noticed that a new row had appeared as below.
{"operation":"dml_only","commitTime":"2022-08-29T20:29:47.440","fromLsn":"0000027c000166b80142","toLsn":"0000027c000166b80142","format":"csv","lastTransactionCommitTime":"","tables":{"6b14fb1f-5846-4485-861f-4623f2e986fb":{"version":"0000027c000065d00002","batch":"de0782d9.csv"}}}
I then checked the ChangeData folder and saw that a new file had been created. Which of course I had to open. It contained the primary key values for all the rows that had to be deleted.
I then checked the insert.data folder expecting an empty folder. However, it contained an empty csv file.
I checked the SQL requests in Synapse Studio again to see what was happening inside the dedicated SQL Pool.
I noticed that the primary key values are inserted into a temporary table which is then used as a join in a delete statement. Like in the below example.
DELETE [Sales].[Orders]
FROM [Sales].[Orders]
INNER JOIN [Sales].[#source_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_delete]
ON ([Sales].[Orders].[OrderID] = [Sales].[#source_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_delete].[OrderID])
I stopped the testing of the deletes at that point. Due to the fact it is appears fairly obvious how the mechanism operates at this moment in time.
Final words
I hope my post about updates and deletes with Azure Synapse Link for SQL Server 2022 has been an interesting insight.
Because I had some questions about what happened internally and wanted to share what I had discovered.
Of course, if you have any comments or queries about this post feel free to reach out to me.
[…] Kevin Chant changes some data: […]