Sunday, March 18, 2012

Slowly Changing Dimensions

In a prior post, before we took a break to practice a bit, we looked at the difference between natural and surrogate keys.  At first glance, you may wonder why a surrogate key cannot also serve as a natural key.  The answer to this excellent question leads us into a discussion of something called slowly changing dimensions (SCD). 
As time goes on, we would be foolish to expect that the descriptive data of our facts will not change.  It most definitely will change.  As people get promoted, then the job title fields in the employee dimensions will be affected.  As business grows and additional lines of business are created, then the dimensions that contain that data will need to be updated.  There are a few ways of dealing with this depending on the reporting needs of the organization.
Type 1 SCD: I typically describe a type 1 SCD as a dimension that does not keep history.  Once a piece of data changes (i.e., an employee is promoted from a Junior Analyst to a Senior Analyst) the old data is overwritten with the new.  Those analyzing the data will never know, from this data set, that this person was ever a Junior Analyst.
Type 2 SCD: I typically describe a type 2 SCD as a dimension that does keep history.  Once a piece of data changes, a new row, containing the updated data, is added to the dimension table.  The original row is still there.  All future fact records that pertain to that descriptive piece of data will have a foreign key pointing to the new record.  The older records in the fact table will still point to the original dimension record.  So, using the example above, if an employee is promoted from a Junior Analyst to a Senior Analyst, two rows will exist for that employee in the dimension table.  One will show this employee as being a Junior Analyst and the other a Senior Analyst.  All of the records in the fact table that describe facts that occurred after the promotion will point to the Senior Analyst row.  Those prior to the promotion will point to the Junior Analyst row.  Those analyzing this data will be able to see that this person was promoted.
Type 3 SCD: A type 3 SCD is essentially a hybrid of a type 1 and type 2 SCD.  In this scenario as history changes, rows are added into the dimension table, consistent with a type 2 SCD.  Each row contains a column with the value that pertains to that moment in time, as does a type 2 SCD.  Each row also contains a column with the current value.  So, continuing with the example above, while you will be able to see that the older fact rows pertain to a time in which the employee was a Junior Analyst, you will also be able to see (while looking at this same historical dataset) that the employee is now a Senior Analyst.
So, let’s return to our original question.  Why can’t a natural key be used as the primary key of a dimension table?  Technically it can, but that’s a bad idea.  Consider a data warehouse that was built so that the natural keys were used as the primary key of the dimension table.  If a fact table has employee details, the foreign key pointing to the dimension row in the employee dimension will be the organization’s employee number.  This is fine if the employee dimension is a type 1 SCD.  However, if the need arises for this to become a type 2 SCD then this becomes a problem.  Now you run into a situation where the new design calls for there to be, possibly, multiple rows in the dimension table with the same primary (natural) key since the natural key has changed.  Also, what happens when the organization expands and decides to change all of the employee numbers from five digits to ten digits?  Now all of your keys related to this dimension in the data warehouse (foreign key values in the fact table and primary key values in the dimension table) must change.  However, if surrogate keys are used, then implementing this change will be a much smoother process. 

No comments:

Post a Comment