• Home
        • Database Development

          Database development is designing, creating a database or data model, and analyzing requirements and their intents as raw data.

          Learn More
        • Architecture & Design

          Software architecture refers to the fundamental structures of a software system and the discipline of creating such structures and systems.

          Learn More
        • Programming

          Computer programming is the process of performing a particular computation or more generally, accomplishing a specific result.

          Learn More
        • Cloud Computing

          Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power.

          Learn More
        • ETL Development

          ETL provides the foundation for data analytics and machine learning workstreams. Through a series of business rules, ETL cleanses and organizes data.

          Learn More
        • Data Visualization & Reports

          Data and information visualization is an interdisciplinary field that deals with the graphic representation of data and information.

          Learn More
  • Blog
  • Contact

How to Load Slowly Changing Dimension in Azure Using MERGE

Premier-League-2013_2014-Table-Standings-Football-Rankings-_-Eurosport

A data warehouse is typically made up of fact and dimension tables that are loaded using different load patterns such as incremental vs. full; and batched vs real-time. The different types of fact and dimension tables can further influence how the data warehouse gets loaded. Slowly Changing Dimensions (SCD) is one type of dimension table found in a data warehouse. It simply works by storing, tracking and managing current and historical data over a period of time within a data warehouse. 

In this article, we look at one example of a data warehouse load, specifically an incremental load of a type 2 Slowly Changing Dimension using a SQL Merge statement. SQL Merge is an ANSI standard (SQL:2003) supported by database systems such as Oracle, PostgreSQL and SQL Server. To demonstrate a load of Slowly Changing Dimension using a SQL Merge, we will use the following:

  1. SQL Server – specifically T-SQL within a Microsoft’s Azure SQL DB
  2. Premier League dataset

Premier League Use Case

The Premier League (also known as EPL) is England’s top football league system and one of the most watched football league around the globe. It is organized by seasons which run between August until May of the following year. A total of 38 games are played by each of the 20 teams participating in a season with the Premier League trophy awarded to a team that accumulated the highest points.

Historical and current season data is available for download from various sports websites. One such sports website is Eurosport.com which keeps track of league standings per season per team along with goals scored, games won, draws, losses, goals conceded etc. An example of Premier League standings data from Eurosport.com is shown in Figure 1.

Premier League 2013-2014 Table Standings
Figure 1: Premier League 2013-2014 Table Standings

The Premier League use case is a perfect candidate to simulate data warehouse load as it has various data sets that could be used to apply different data warehouse load patterns. Teams in the Premier League refer to the different football clubs that play in the Premier League. Teams is therefore one data set that we could store in a dimension to simulate a data warehouse load. Possible attributes of a Team dimension are:

  1. Name of team
  2. Team description
  3. The season that the team plays in (although season can also be a standalone dimension). 

Stage Data into Azure SQL DB

To get started, we extract the Premier League data shown in Figure 1 into a table in Azure SQL DB. The definition of our staging table is shown in Script 1.

CREATE TABLE [dbo].[PL_Season_2010_2011](
	[NO] [tinyint] NOT NULL,
	[TEAMS] [nvarchar](50) NOT NULL,
	[P] [tinyint] NOT NULL,
	[W] [tinyint] NOT NULL,
	[D] [tinyint] NOT NULL,
	[L] [tinyint] NOT NULL,
	[GF] [tinyint] NOT NULL,
	[GA] [tinyint] NOT NULL,
	[GD] [smallint] NOT NULL,
	[PTS] [tinyint] NOT NULL
) ON [PRIMARY]
GO
Script 1: CREATE TABLE Statement for the staging table

The simplest way to get the data staged would be to copy the data from Eurosport.com into Excel Spreadsheet and then import it using SSMS as shown in Figure 2.

Azure, Data Warehouse, Eurosport.com, MERGE, Premier League, SCD
Figure 2: Stage season data using SSMS

Figure 3 shows Premier League 2020-2011 season data that has been imported and staged into SQL table dbo.PL_Season_2010_2011.

Azure, Data Warehouse, Eurosport.com, MERGE, Premier League, SCD
Figure 3: Premier League 2010-2011 season data

The rest of the different Premier League seasons’ data was successfully staged into separate tables as shown in Figure 4.

Premier League-Staging-Tables
Figure 4: Successfully Staged Premier League Season

Create Dimension Table

Data staging helps prepare the definition of the dimension table that will be used to store the final data. Script 2 provides a CREATE TABLE statement for our Team dimension. You will notice in Line 1 under Script 2 that the Team dimension is created under the DIM schema this is to make it easier to identify data warehouse dimension tables.

CREATE TABLE [DIM].[Team](
	[TeamID] [int] IDENTITY(1,1) NOT NULL,
	[Team] [varchar](255) NOT NULL,
	[Season] [varchar](25) NOT NULL,
	[IsActiveRecord] [bit] NULL,
	[IsDeletedRecord] [bit] NULL  DEFAULT ((0)) ,
	[ActiveRecord_Start] [datetime2](7) NOT NULL  DEFAULT (GETDATE()),
	[ActiveRecord_End] [datetime2](7) NOT NULL  DEFAULT ('2099-12-31')
) ON [PRIMARY]
GO
Script 2: CREATE TABLE Statement for Team dimension

The definition of our Team dimension consists of business-related fields such as Team (for team name) and Season. However, data warehouse designs typically recommend creation of metadata columns in addition to business key columns. Thus lines 2, 5-8 represent such metadata columns in our dimension. The complete name and description of the columns used in Script 2 are provided in Table 1.

Column Name

Column Description

TeamIDis a surrogate key used to uniquely identify a record in the dimension
Teamis used to store names of the Teams participating in Premier League Seasons. The team is also the unique business key identifier
Seasonis used to store the Premier League Seasons i.e. 2010-2011, 2011-2012
IsActiveRecordis a Boolean value used to indicate current and historical values. Current values are indicated by one (1) whereas a zero (0) indicates historical record.
IsDeletedRecordis used to indicate a record that no longer exists at source.
ActiveRecord_Startis used to track the effective start date for when the record is current/active. All records are assigned a default value of the current date until they are closed off and replaced by the latest versions of the records
ActiveRecord_Endis used to track effective end dates for when the record is current/active. If the record remains active then the default value is set to a future date of 2099-12-31
Table 1: Definition of Team dimension

Build SQL Merge Statement

Now that we have prepared our staging dataset and created our Team dimension, we move next to build a SQL Merge statement that will be used to load the Team dimension. Generally, SQL Merge statements work by comparing source data (incoming records) against target data (existing records in a dimension table). If data at the source doesn’t exist at the target, then you can choose to insert the new data into the dimension.

Script 3 is a statement for preparing our source data using one of the Premier League season’s data that has previously been staged. The script simply returns a list of unique Teams from the 2010-2011 season’s data.

SELECT DISTINCT
	[TEAMS] as [Team]
	,'2010-2011' as Season
	, GETDATE() AS ActiveRecord_Start
FROM [dbo].[PL_Season_2010_2011]
Script 3: CREATE TABLE Statement for Team dimension

The complete SQL Merge statement for our Team dimension is shown in Script 4. Some highlights of the SQL Merge statement are as follows:

  1. The join between source and target is based on the Team name. This comparison is only done against records that are currently active in the dimension.
  2. for WHEN MATCHED we run an update of the target provided that the team or season is not the same.
  3. The load into the data warehouse is from lines 29-46.
MERGE [DIM].[Team] AS TARGET
USING (
		SELECT DISTINCT
			 [TEAMS] as [Team]
			,'2010-2011' as Season
			, GETDATE() AS ActiveRecord_Start
		FROM [dbo].[PL_Season_2010_2011]
) AS SOURCE
ON (
		TARGET.[Team] = SOURCE.[Team]
	AND TARGET.IsActiveRecord = 1
)
WHEN MATCHED 
	AND (
		   TARGET.[Team] <> SOURCE.[Team]
		OR TARGET.[Season] <> SOURCE.[Season]
		)
	THEN
    UPDATE SET
		
		 TARGET.[Season] = SOURCE.[Season]
		,TARGET.ActiveRecord_Start = SOURCE.ActiveRecord_Start

WHEN NOT MATCHED BY SOURCE 
		AND TARGET.IsActiveRecord = 1

	THEN DELETE

WHEN NOT MATCHED BY TARGET THEN
  --Add new records into the dimension
  INSERT (
       [Team]
      ,[Season]
      ,[IsActiveRecord]
	  ,[IsDeletedRecord]	
      ,[ActiveRecord_Start]
      ,[ActiveRecord_End]
)
  VALUES(
	SOURCE.[Team]
	,SOURCE.[Season]
	,1
	,0
	,SOURCE.ActiveRecord_Start
	,'2099-12-31 23:59'
)
;
Script 4: CREATE TABLE Statement for Team dimension

Figure 5 shows the data loaded into the Team dimension following the execution of Script 4 – by default all the new records will have an IsActiveRecord flag of one (1), IsDeletedRecord to zero (0), and ActiveRecord_Start set to the current date.

Azure, Data Warehouse, Eurosport.com, MERGE, Premier League, SCD
Figure 5: Team Dimension load with 2010-2011 Season Data

Implement Slowly Changing Dimension

At this point, we have successfully demonstrated the following:

  1. Prepare and Stage data into the data warehouse (hosted in SQL Azure DB)
  2. Create a target Team dimension under a seperate schema (DIM)
  3. Build source query to identify Premier League team records
  4. Build a SQL Merge statement to load Team dimension

What would happen if we were to load next season’s data from our staging like the 2011-12 season?

Well, the SQL Merge statement will do the following:

  1. Update the existing records
  2. Insert new Teams
  3. Delete teams no longer exist at the source

However, this will not version control our existing records – it would simply implement a type 1 SCD, which is replacing old records with new ones. As mentioned earlier, we aim to implement a type 2 SCD – which means we need to find a mechanism to close out existing records and flag new versions of the records as active.

The mechanism to do this in the SQL Merge is via the OUTPUT clause. OUTPUT variables are used to track the version of records that have been deleted, updated or inserted. We can extend our current type 1 SCD functionality by tracking all records that are deleted or updated by storing them in a temporary table and then writing those records back into the Team dimension.

Script 5 shows the complete SQL Merge stored procedure that implements type 2 SCD.

  1. Line 3 declares a temporary table variable that will be used to store records tracked by the OUTPUT clause.
  2. Lines 61-70 are used to write any updated/deleted/inserted records into the temporary table variable
  3. Lines 73-110 are then used to write the updated and deleted records back into our Team dimension.
CREATE PROC [DIM].[prcTeam] as

DECLARE @DIM_Team TABLE
(
   ActionType [varchar](25) NULL,
    [Team] [varchar](255)  NULL,
	[Season] [varchar](25)  NULL,
	[IsActiveRecord] [bit] NULL,
	[IsDeletedRecord] [bit] NULL DEFAULT (0),
	[ActiveRecord_Start] [datetime2](7)  NULL,
	[ActiveRecord_End] [datetime2](7)  NULL
);

MERGE [DIM].[Team] AS TARGET
USING (
		SELECT DISTINCT
			 [TEAMS] as [Team]
			,'2011-2012' as Season
			, GETDATE() AS ActiveRecord_Start
		FROM [dbo].[PL_Season_2011_2012]
) AS SOURCE
ON (
		TARGET.[Team] = SOURCE.[Team]
	AND TARGET.IsActiveRecord = 1
)
WHEN MATCHED 
	AND (
		   TARGET.[Team] <> SOURCE.[Team]
		OR TARGET.[Season] <> SOURCE.[Season]
		)
	THEN
    UPDATE SET
		
		 TARGET.[Season] = SOURCE.[Season]
		,TARGET.ActiveRecord_Start = SOURCE.ActiveRecord_Start

WHEN NOT MATCHED BY SOURCE 
		AND TARGET.IsActiveRecord = 1

	THEN DELETE

WHEN NOT MATCHED BY TARGET THEN
  --Add new records into the dimension
  INSERT (
       [Team]
      ,[Season]
      ,[IsActiveRecord]
	  ,[IsDeletedRecord]	
      ,[ActiveRecord_Start]
      ,[ActiveRecord_End]
)
  VALUES(
	SOURCE.[Team]
	,SOURCE.[Season]
	,1
	,0
	,SOURCE.ActiveRecord_Start
	,'2099-12-31 23:59'
)

OUTPUT
	$action AS ActionType
	,DELETED.[Team]
	,DELETED.[Season]
	,DELETED.IsActiveRecord
	,DELETED.[IsDeletedRecord]
	,DELETED.[ActiveRecord_Start]
	,GETDATE()
	
INTO @DIM_Team;

----expire previously active record
INSERT INTO [DIM].[Team]
(
       [Team]
      ,[Season]
      ,[IsActiveRecord]
	  ,IsDeletedRecord
      ,[ActiveRecord_Start]
      ,[ActiveRecord_End]
)
SELECT  
	   [Team]
      ,[Season]
      ,0 as IsActiveRecord
	  ,0 IsDeletedRecord
      ,[ActiveRecord_Start]
	  ,[ActiveRecord_End]	
FROM  @DIM_Team
WHERE ActionType in ('UPDATE')

----expire previously active record
INSERT INTO [DIM].[Team]
(
       [Team]
      ,[Season]
      ,[IsActiveRecord]
	  ,IsDeletedRecord
      ,[ActiveRecord_Start]
      ,[ActiveRecord_End]
)
SELECT  
	   [Team]
      ,[Season]
      ,0 as IsActiveRecord
	  ,1 IsDeletedRecord
      ,[ActiveRecord_Start]
	  ,[ActiveRecord_End]	
FROM  @DIM_Team
WHERE ActionType in ('DELETE')
;
GO
Script 5: SQL Merge Proc with OUTPUT Clause

Figure 6 shows a preview of our type 2 SCD Team dimension with 2-seasons’ worth of data (2010-2011 & 2011-2012 season data). You will notice for instance that the Arsenal team appears in both seasons as a result the latest season data (2011-2012) has an active record flag. Birmingham City team only appeared in the 2010-2011 season – as a result, it was then flagged as deleted during the processing of the 2011-2012 season data; the same is true of the Blackpool team at line 8 – these teams were relegated at the end of 2010-2011 season.

Azure, Data Warehouse, Eurosport.com, MERGE, Premier League, SCD
Figure 6: Results of the Team Dimension into a Type 2 SCD using a Merge

Summary

This article has demonstrated a data warehouse load using a SQL Merge statement. We prepared and staged the Premier League season’s data from the Eurosport.com website into a SQL Azure DB. We then created a target Team dimension and populated it using the SQL Merge statement.

Loading

Sifiso

November 20, 2023
Sifiso is Data Architect and Technical Lead at SELECT SIFISO – a technology consulting firm focusing on cloud migrations, data ingestion, DevOps, reporting and analytics. Sifiso has over 15 years of across private and public business sectors, helping businesses implement Microsoft, AWS and open-source technology solutions. He is the member of the Johannesburg SQL User Group and also hold a Master’s Degree in MCom IT Management from the University of Johannesburg.

Meet Our Experts

We are proud to have a team of experts who are passionate about delivering the best possible solutions to you. Our team members are highly skilled and experienced in a diverse range of IT technologies, and are committed to staying up-to-date with the latest industry trends and best practices to deliver you the best results.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Join Our Newsletter

Subscribe to get our latest and best thinking on the most definitive workforce topics affecting HR leaders and organizations today.