In this post I cover some initial testing of Mirroring Azure Cosmos DB Databases in Microsoft Fabric that I performed.
I wanted to do this post for various reasons. Including the fact that it was announced during the Microsoft Fabric Community Conference that Mirroring is now in Public Preview.
Which means that you can now mirror data from Azure SQL Database, Azure Cosmos DB and Snowflake into your own Microsoft Fabric tenant. Even trial tenants.
Of course, these features do come with some limitations. For example, you currently cannot mirror Azure SQL Databases that are in another tenant.
Anyway, you can read more about the Public Preview announcement in a post by Charles Webb.
By the end of this post, you will see the results of my initial tests. Plus, I share some additional tips towards the end. Along the way I share plenty of links.
Initial testing of Mirroring Azure Cosmos DB Database in Microsoft Fabric with a sample Database
First I went through the prerequisites and enabled mirroring in my Microsoft Fabric tenant.
Afterwards, I created an Azure Cosmos DB account and made sure that I added the recommended mirroring settings to configure my Cosmos DB account.
After my Cosmos DB account had been created I went to the ‘Data Explorer‘ section. From there, I selected ‘Launch quick start’. I then followed the wizard to create the sample database.
Once the sample database had been created, I went to Microsoft Fabric to configure Mirroring. First I we to add a new Mirrored Azure Cosmos DB Database in the workspace, as below.
From there, I entered in the connectivity details for the Azure Cosmos DB account.
On the next screen that followed I selected my sample database and then clicked the ‘Connect’ button. Which then went to the mirroring item. I then clicked on the ‘Monitor replication’ button to check that that the replication had completed.
When you deploy a mirrored database three objects are deployed, which are as below:
- Mirrored database
- Semantic Model
- SQL analytics endpoint
In reality, I could have queried the SQL analytics endpoint once it had completed.
However, to show what else can be done I wanted to check that I could add the table as a shortcut in Lakehouse. By adding it as a shortcut as per the guide on how to access mirrored Azure Cosmos DB data in Lakehouse and notebooks from Microsoft Fabric.
Bulk import testing for Mirroring Azure Cosmos DB Database in Microsoft Fabric
Once I had tested the initial sample database I decided to perform bulk insert tests. By utilizing the cosmos-dotnet-bulk-import-throughput-optimizer bulk import program that is available.
With this in mind, I cloned the program locally and added the connection information to the ‘Program.cs‘ ran it before correcting the throughput limit in Azure Cosmos DB to support the program.
I did this so that it would create the correct database for me first. In order for me to configure the mirroring in Microsoft Fabric before running the bulk insert.
Afterwards, I went into ‘Cost Management’ for Azure Cosmos DB and amended my throughput limit so that the program would work.
I then went into Microsoft Fabric to configure the replication for the new database. However, this time the setup in Microsoft Fabric was slightly different. Due to the fact that I was able to select the existing connection and select the new database.
Anyway, once configured I ran the bulk insert program from my laptop by running the ‘dotnet run’ command. I kept an eye on the progress of the replication whilst the inserts were happening, and I was impressed with the latency.
Once completed I did the similar checks as above to ensure all was well. I then decided to take it one step further by testing the insert of one-hundred million records.
Testing with one-hundred million new records
First of all, I deployed a new virtual machine in Azure that had a better specification than my laptop. To handle the larger amount of data that I was generating. I opted for a DS series virtual machine with 64GB of memory.
After the virtual machine had been deployed, I installed everything that was required to host the bulk import program.
I changed the ‘AmountToInsert’ value in the ‘Program.cs’ file to one-hundred million. In addition, I increased the ‘CreateAsync’ value to a higher throughput value to cater for the additional data and changed my throughput limits in Azure Cosmos DB again accordingly.
Once that was done, I started the package. It completed fine. However, I did notice the delay was more noticeable this time.
I was still impressed. Due to the fact that it was a lot faster then the file tests for Azure Synapse Link. that I performed.
Once completed I checked that all the records had been inserted into Azure Cosmos DB.
I then went to add the table into the Lakehouse as a shortcut as per the guide on how to access mirrored Azure Cosmos DB data in Lakehouse and notebooks from Microsoft Fabric.
Unidentified table
Things got interesting at this point. Due to the fact that the table was not able to be identified. Even though the files were clearly there, and the column names were fine. Instead, the Lakehouse Explorer recommends that I move the shortcut to be in Files.
To make things more interesting, when I selected the SQL analytic endpoint the shortcut was correctly recognized as a table. When I queried the number of rows for the table it returned one-hundred million rows.
In reality, I can probably still query the table within the Lakehouse with code. Either that or move the shortcut over to the Files location in the Lakehouse and attempt to resolve the issue.
However, I suspect that since the table appears in the SQL analytics endpoint there must be an issue with a mechanism somewhere within the Lakehouse. For now, I will leave it where it is.
Additional tips about Mirroring Azure Cosmos DB Databases in Microsoft Fabric
Before I conclude this post, I want to provide some additional tips about Mirroring Azure Cosmos DB Databases in Microsoft Fabric.
- If you want to quickly view the number of items in a container within Azure Cosmos DB you can run the below code:
SELECT VALUE COUNT(1) FROM c - You can remove all the existing data in the database used by the bulk insert program quickly within the Azure Portal. In order to do this, you navigate to your Azure Cosmos DB account, go into ‘Data Explorer’ and then right click ‘items’ and select delete.
When prompted for the id value simply enter ‘items’. - You can delete the items container within Azure Cosmos DB whilst the connection for mirroring is still active. It will get recreated when you restart the bulk insert program.
- It appears that currently the ‘Total rows replicated’ value shown in the ‘Monitor mirrored database’ window is cumulative.
Which means that if you delete data in Azure Cosmos DB and then add new data whilst the connection is in-place this value only ever increases. - Just to make it clear, you can easily query the mirrored database with T-SQL by using the SQL analytics endpoint that gets deployed with it,
- You may have already realized this. Because the mirrored data is a delta table in Microsoft Fabric, you can create reports based on the mirrored data in multiple places. Making it a powerful way to process reports from the original data sources.
- My final tip is that if you delete the Azure Cosmos DB account the data still exists in your Microsoft Fabric workspace. Allowing you to utilize this as a migration strategy.
I suspect the same applies for data mirrored from both Azure SQL Database and Snowflake.
Final words
I hope that me sharing my thoughts about my initial testing of Mirroring Azure Cosmos DB Database in Microsoft Fabric inspires some of you to look into the mirroring functionality yourselves.
Personally, I am impressed with this offering and I really like the experience.
I am looking forward to when Mirroring Azure SQL Databases works across tenants though.
Of course, if you have any comments or queries about this post feel free to reach out to me.
Be First to Comment