This post accompanies a video called ‘A fast and festive introduction to Azure Data Engineering services’. Which is available to watch from the day this post is published.
I made the video for this years Festive Tech Calendar.
So, I thought I would write this post to accompany it so that you have an overview of what is in the video.
One key point about this post is that there is material in the video which is not included in this post. For example, demos of services. In addition, there is some material in this post which does not exist in the video as an added bonus.
Introduction
I will admit that I was going to use the below image as an introduction to the video. Instead, I went for another option. Feel free to let me know which one you would have preferred. You can also watch the video and check out my festive jumper by clicking on the image below instead of the link above.
Since people have their own ideas about which Azure services are Azure Data Engineering services, I made a judgement call. I decided to focus on the services included for the current Microsoft Certified Azure Data Engineering Associate certification.
Which includes services such as Azure Stream Analytics, Azure Databricks and Azure Synapse Analytics. You can view all the services covered by viewing the details for the DP-203 exam.
One key point is that this years Festive Tech Calendar is for a good cause. This year it is raising money for Girls Who Code. On the Festive Tech Calendar website you can find the link to the JustGiving page.
Please support this if you have enjoyed any of the content for this years Festive Tech Calendar. Because a lot of us professionals have put a lot of effort into this. Especially the organizers.
Because it was helping raise money for a charity, I did something a bit different for this video. I decided to record two Bio’s for it. A professional Bio at the start and a festive themed one towards the end. You can find a teaser for the festive one towards the end of this post.
Order of services
I decided to show the services in a similar order to how you would see them in some of Microsoft’s solution ideas online. Which tends to show the flow of data in various stages. Like the below example.
Ingest Azure Data Engineering services
First of all there is Azure Event Hubs and Azure IoT Hub. Which sound similar and yet are useful in different ways.
Both can be used to ingest streaming data. However, Azure Event Hubs is more commonly used to ingest data from applications. For example, Apache Kafka.
Azure IoT Hub on the other hand, tends to be used to ingest data from IoT (Internet of Things) devices. For example, sensors in devices.
Both can be setup as inputs for Azure Stream Analytics. Which is one of the more popular services that can process data from these hubs.
It can process the data from these hubs to detect both temporary and long-term anomalies. In addition, it can prepare the data to be stored in other services as well. For example, Azure SQL Database. Plus, it can send data directly to Power BI as well as a source for reports.
Something that might interest those with a SQL background is that you can decide how to process the data in Azure Stream Analytics using SQL like language.
Final service that I want to cover in the ingest section is Azure Data Factory. Which is one of the more popular Azure Data Engineering services. It is a proven solution in Azure.
A lot of people use it to perform Extract Transform Load (ETL) and Extract Load Transform (ELT) tasks. In other words, it helps move and transform data.
Those of you with a SQL Server background can conceptually think of Azure Data Factory as a cloud-based version of SQL Server Integration services. In reality though, it is very different.
Latest version of it is V2. However, I can tell you from experience that there is a chance you might encounter somewhere that still has version 1 of Azure Data Factory still in use.
Store Azure Data Engineering service
Currently, there is only one service relating to storage mentions mentioned in the criteria for the DP-203 exam. Which is Azure Data Lake Storage. Now, there are two versions of Azure Data Lake Storage.
Azure Data Lake Storage Gen1 is still used in a lot of places. However, Azure Data Lake Storage Gen2 is becoming more popular. Azure Data Lake Storage Gen2 supports hierarchal namespaces. In other words, it supports folders.
Prep and train Azure Data Engineering services
Now here’s where things get a bit interesting. Because there’s blurred lines between the two main services you can use in both this stage and the Model and Serve stage to a degree.
With this in mind, I renamed this section in the video.
Multi-tasking Azure Data Engineering services
With this in mind, I renamed the Prep and train stage to multi-tasking services in the video. Mainly due to the fact that both Azure Databricks and Azure Synapse Analytics are offering some similar features.
Azure Databricks
Azure Databricks is the Azure-hosted version of the popular Databricks service. It is mostly used to work with high volumes of data in Spark clusters. Unless you decide to copy the demo that I did in the video.
Those of you with a SQL Server background can think of Spark clusters as In-Memory OLTP spread across multiple machines. However, it is a bit more complicated than that.
It allows people to share code and compute so that they can work together to do things like Big Data analytics. Which they tend to do against large volumes of data stored in a storage service like Azure Data Lake Storage.
When you deploy Azure Databricks it creates a management environment called a Workspace. It is within this Workspace that others can share code via the use of notebooks.
These notebooks tend to be based on either Python, Scala, Java or SQL.
However, you can create a notebook in one of these languages and then run a command in a cell based on another language. Thanks to something known as magic.
Not the kind of magic that we tell children Santa uses, but a special command instead. At the top of the cell you enter a percentage sign (%) followed by the language you want to change it to. I show some magic in the video.
Lakehouse
Databricks is promoting a new paradigm called a Lakehouse. Which you can think of as a combination of a Data Lake and a Data Warehouse.
To help with this they recommend using something called a Delta Lake.
Which helps keep the files in your Data Lake reliable by introducing ACID properties. For those who do not know, ACID stands for Atomic, Consistent, Isolation and Durability. It is a term that is probably familiar to those of you with a SQL background.
Another thing that might sound familiar to those of you with a SQL Server background is the fact that these Delta Lakes use a transaction log. One popular feature in these Delta Lakes is Time Travel, which allows you to view the state of your data at a point in time.
One last thing that might pique the interest of those of you with a SQL background is that Databricks have now introduced Databricks SQL. Just be aware that you currently need to be using the Premium tier of Databricks to use it.
Anyway, it would take a lot longer to go through all the features of Databricks. So, onto the next service.
Azure Synapse Analytics
Azure Synapse Analytics is a fully integrated service that has been developed by Microsoft.
Its ambition is to provide Data ingestion, Big Data analytics and Data Warehousing all under one roof. To save you having to use multiple services.
Take the following example.
Traditionally, you would use Azure Data Factory to ingest data via pipelines. Azure Databricks to perform powerful analytics on that data using Spark clusters. Finally, you would keep a historic copy of the data in Azure SQL Datawarehouse.
Now, you can do all three within Azure Synapse Analytics.
It can ingest data by using its own pipeline functionality or by using Azure Data Factory. In fact, the two are very similar. In fact they are so similar that a wrote a post about how you can copy an Azure Data Factory pipeline to Synapse Studio.
One big difference is that you cannot use SSIS Integration Runtimes in Azure Synapse Analytics.
You can also ingest data from a couple of other services in near-real time. Using something called Azure Synapse Link. I covered in a previous post that you can now use Azure Synapse Link for Dataverse. In addition, Azure Synapse Link will be available for SQL Server 2022.
As far as Big Data analytics is concerned, Azure Synapse Analytics comes with its own way to work with Apache Spark pools.
Finally, as far as long-term storage is concerned Azure Synapse has a couple of interesting options. Called SQL Pools.
SQL Pools
Currently, there are two types of SQL Pools. As the name suggests, they can both support T-SQL syntax to a degree. Which makes it easier for SQL Server professionals to use.
First type are known as dedicated SQL Pools. Which use to be a separate service called Azure SQL Datawarehouse and is now integrated into Azure Synapse Analytics.
It uses what is known as multi-parallel processing (MPP) to work with data. Basically, when you work with data behind the scenes that data can be split across multiple nodes. So that it can process the data faster.
Second type are known as serverless SQL Pools. Which you can think of as Spark pools that run T-SQL syntax.
They can be used to analyse data stored in files within storage services or to create a logical data warehouse with the files that are stored. Which is a practice that is also being encouraged to be done in Azure Databricks.
When you deploy Azure Synapse Analytics in Azure you create an environment known as a workspace. Which you then manage using Synapse Studio.
In the past I did a post which was a five-minute crash course to Synapse Studio. So, in the video I decided to see how long a recorded version of a crash course would take.
Blurred lines between couple of Azure Data Engineering services
Anyway, now that I have gone through the last two services you can see that there are some blurred lines between what they can do. In addition, both can implement the new Lakehouse paradigm.
Honourable mentions
I will give a couple of honourable mentions to some other services below. Feel free to follow the links for more information about them.
- Azure HDInsight
- Azure Data Explorer
- Apache Kafka
- Azure Purview (known as Microsoft Purview since April 2022)
- Azure Cosmos DB
- Power BI
Alternative bio
At the end of the video, I do my alternative Bio. You can see a teaser for it below.
You can watch the full version at the end of the ‘A fast and festive introduction to Azure Data Engineering services’ video.
Final words
I hope you have enjoyed reading this fast and festive introduction to Azure Data Engineering services. You can see more in the video, including some demos.
Of course, if you have any comments or queries about this post feel free to reach out to me.
[…] if you use the Data Lakehouse paradigm that I introduced in a previous post it potentially removes the need for an additional service to use as a Data Warehouse […]
[…] I have published a post that contains an overview about the contents of the video. For example, last years post that accompanied the […]