In this post I want to cover spreading your SQL Server wings with Azure Synapse Analytics. As part of a series of posts about spreading your SQL Server wings with the Microsoft Intelligent Data Platform.
You can find out more about this series of posts in an introductory post I published about the series.
In reality, there are a lot of features within Azure Synapse Analytics where your SQL Server background can prove to be useful.
By the end of this post, you will have a good overview of Azure Synapse Analytics. In addition, where your SQL Server background can prove to be useful. Plus, there are plenty of links included in this post.
About Azure Synapse Analytics
Azure Synapse Analytics is an integrated service that has been developed by Microsoft.
Its ambition is to provide Data ingestion, Big Data analytics and Datawarehousing all under one roof. To save you having to use multiple services.
For example, traditionally you could use Azure Data Factory to ingest data via pipelines. From there, you could use Azure Databricks to perform powerful analytics on that data using Spark clusters. Finally, you would keep a historic copy of the data in Azure SQL Datawarehouse.
Now, you can do all three within Azure Synapse Analytics.
Azure Synapse Analytics can ingest data by using its own pipeline functionality or by using Azure Data Factory. In fact, the two are very similar. In fact they are so similar that a wrote a post about how you can copy an Azure Data Factory pipeline to Synapse Studio.
Synapse Studio
When you deploy Azure Synapse Analytics in Azure you create an environment known as a workspace. You can manage your workspace by using a front-end called Synapse Studio. Which you can access through the Azure Portal.
You can use Synapse Studio to manage a lot of different aspects of your workspace. Including all your data and development needs. For example, you can manage your notebooks and SQL scripts there.
In the past I did a post which was a six-minute crash course about Synapse Studio. Which you can use to find your way around Synapse Studio quickly.
You can also ingest data from a few other services in near-real time. By using something called Azure Synapse Link.
As far as working with Big Data analytics is concerned, Azure Synapse Analytics comes with its own way to work with Apache Spark pools.
Finally, as far as working with long-term storage in a SQL Server related way is concerned Azure Synapse has a couple of interesting options. Called SQL Pools.
SQL Pools
I want to cover SQL Pools first. Since I consider them relatively easy for those with a SQL Server background to adopt when compared to other features.
Currently, there are two types of SQL Pools in Azure Synapse Analytics. To help with some jargon here, you can think of these two different pools as different types of compute.
As the name suggests, they can both support T-SQL syntax to a degree. Which makes it easier for SQL Server professionals to adopt.
As you will see below there are some interesting similarities and differences between the two.
One key point I want to highlight before I go any further is that you can connect to both using a variety of applications that are associated with SQL Server. Including SQL Server Management Studio and Azure Data Studio.
Spreading your SQL Server wings with Dedicated SQL Pools
First type are known as dedicated SQL Pools. Which use to be a separate service called Azure SQL Datawarehouse and is now integrated into Azure Synapse Analytics.
People tend to use this SQL pool for long-term storage or large volumes of data. It is popular with those with a SQL Server background because it can be managed in a SQL Server related way.
It uses what is known as Massively Parallel Processing (MPP) to work with data. Basically, when you work with data behind the scenes that data can be split across multiple nodes. So that it can process the data faster.
In reality, dedicated SQL Pools supports a lot more T-SQL syntax than serverless SQL Pools. Due to its infrastructure and the fact that it is a long-term storage solution.
Microsoft provides a detailed guide about Transact-SQL features supported in Azure Synapse SQL.
In reality, there are some tell-tale signs that dedicated SQL Pools are based on earlier incarnations of Microsoft’s Datawarehouse solutions.
For example, multiple DMV’s appear to reference Microsoft’s Parallel Data Warehouse (PDW). As you can see in the below example.
-- Find top 10 queries longest running queries
-- Note dmv name slightly different
SELECT TOP 10 *
FROM sys.dm_pdw_exec_requests
ORDER BY total_elapsed_time DESC;
One key point I want to highlight here is that you can only use Azure Synapse Link for SQL to load data into dedicated SQL Pools. Something I have covered in multiple posts. Which happens to be another solution listed within the Microsoft Intelligent Data Platform.
Spreading your SQL Server wings with Serverless SQL Pools
Second type of SQL pools are known as serverless SQL Pools. Which you can think of as serverless compute that can be spun-up on demand to run T-SQL syntax.
Serverless supports less T-SQL syntax than dedicated SQL Pools. Due to its requirements and infrastructure.
Personally, I would like more T-SQL syntax to be supported. For example, dynamic data masking.
Anyway, below is an example of some popular T-SQL syntax used within serverless SQL Pools. Which creates an external table which reads one or more files:
-- First state the file format
IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = 'SynapseParquetFormat')
CREATE EXTERNAL FILE FORMAT [SynapseParquetFormat]
WITH ( FORMAT_TYPE = PARQUET)
GO
--Then the data source location
IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = '{REDACTED}_dfs_core_windows_net')
CREATE EXTERNAL DATA SOURCE [{REDACTED}_dfs_core_windows_net]
WITH (
LOCATION = 'abfss://{REDACTED}.dfs.core.windows.net'
)
GO
-- Finally create the external table
-- Which should look familiar to those who have used Polybase in SQL Server
CREATE EXTERNAL TABLE dbo.DemoDeltaTable (
[Name] nvarchar(4000),
[Present] nvarchar(4000)
)
WITH (
LOCATION = 'UnitedKingdom/Delta/Nice/**',
DATA_SOURCE = [{REDACTED}_dfs_core_windows_net],
FILE_FORMAT = [SynapseParquetFormat]
)
GO
--Afterwards you can then query the table directly
SELECT TOP 100 * FROM dbo.DemoDeltaTable
GO
Serverless SQL pools can be used to analyse data stored in files within storage services or to create a logical data warehouse with the files that are stored.
Typically, by using something known as a Data Lakehouse. Which is a practice that is also being encouraged to be done in Azure Databricks.
Not many people realise this, but you can also work with statistics in serverless SQL Pools. Whenever people ask me for more details about this I always point them to the Synapse Expresso video about the importance of statistics in serverless SQL Pools.
Spark clusters
I could easily carry on covering all the great ways SQL Server professionals can do with SQL Pools alone. However, I really should cover using Apache Spark in Azure Synapse Analytics to work with Spark clusters now.
To help with some jargon for those with a SQL Server background, you can think of Spark clusters as In-Memory OLTP spread across multiple machines. In reality, it is a bit more complex then that but it gives you a good idea.
Within Synapse Studio you can develop code in notebooks to run on these clusters. Similar to how you work with notebooks in Azure Data Studio.
When a lot of people think of working with Spark clusters and notebooks together they tend to think of Python.
However, there are other language options in Azure Synapse Analytics. Including a version of SQL called Spark SQL. Which you can see below.
If you have a SQL Server background, you can ease into working with Spark clusters by first writing Spark SQL code in notebooks. Which can be very useful for basic tasks.
For example, you can use the below code to create a table based on the contents of multiple CSV files.
CREATE TABLE IF NOT EXISTS UnitedKingdomCSV (Name STRING, City STRING, Present STRING)
USING csv OPTIONS(
path = "abfss://raw@{REDACTED}.dfs.core.windows.net/UnitedKingdom/csv/*.csv",
header = "true"
)
You can even use it to create a table based on the contents of a Delta Lake as well.
CREATE TABLE IF NOT EXISTS UKPresentsExternal
(Name STRING, City STRING, Present STRING) USING DELTA
Partitioned by (City)
LOCATION 'abfss://cleansed@{REDACTED}.dfs.core.windows.net/UnitedKingdom/Delta/Nice';
It might surprise some of you to know that a large range of other types of SQL statements work as well. Including merge and delete statements.
You can view the supported SQL statements in the SQL reference for Apache Spark 3.3.0. For those wondering, I state that reference and not the latest one because the latest version of Apache Spark currently supported in Azure Synapse Analytics is 3.3.0.
Pipelines
Next I want to briefly cover pipelines in Azure Synapse Analytics. This is where you can create pipelines to do various tasks. For example, load data into a dedicated SQL Pool. .
In reality, working with these pipelines is similar to working with pipelines in Azure Data Factory. Main difference is that the items you work with are in different places.
Anyway, if you have worked with SQL Server Integration Services (SSIS) your experience can come in handy with pipelines. Because it is a similar concept.
To highlight this, below is an example of a mapping data flow that you can work with in Azure Synapse Analytics.
Power BI
You might have noticed earlier in this post that there is a Power BI option when you first enter Synapse Studio. That is because you can connect Synapse Studio to a Power BI workspace and work directly on reports within Synapse Studio.
Here your SQL Server background can come in handy in a number of ways. Including only working with the data that you need.
If you want to connect to a Power BI workspace in another tenant, I wrote a post a while back which is a nifty Power BI fix for Synapse Studio.
Final words about spreading your SQL Server wings with Azure Synapse Analytics
I hope this post about spreading your SQL Server wings with Azure Synapse Analytics has proved to be useful. Because there is a lot you can do in Azure Synapse Analytics.
If your are keen to learn more, I wrote a post a while back which covers recommended Azure Synapse Analytics certifications.
Of course, if you have any comments or queries about this post feel free to reach out to me.
[…] Kevin Chant offers a primer on Azure Synapse Analytics: […]
[…] I covered Azure Synapse Analytics in my previous post I thought I would go a bit deeper into dedicated SQL Pools. So that I can cover things in more detail, including CI/CD […]
[…] in a previous post in the series about Azure Synapse Analytics I mentioned that Azure Synapse Analytics was an integrated solution. Where you could work with […]