Friday, June 22, 2012

Snowflake Schema

Going back to the piggy bank example from this blog's first post, we know that the value of a data warehouse lies with allowing a user to analyze data easily.  This is mainly achieved through denormalization.  This often differs from the value of a source system which lies with getting data into the system quickly.  This is mainly achieved through normalization.

So, if we were to run a select statement against a source system that is designed to return (among other things) an employee name as well as his department, those two things may be stored in two tables.  The employee table may simply store a foreign key to the department table, which stores the names of the departments.  The query may find 'John Smith' in the employee table and then the number 6 in the department field.  The database will have to go to the department table, look up the row with the number 6 to see that the value is 'Human Resources' and then return those two values, along with any other values that were specified in the select statement, in the query results.  Doing this for several rows and for several different kinds of values will take some time.

In a typical data warehouse design, however, all of these values may exist in the same table and be repeated.  This way, the query can go to one table and find 'John Smith' and 'Human Resources' without having to translate a foreign key.  This keeps things quick.

However, there are some instances in which some normalization is appropriate in a star schema.  This normalization is not to the extent that it exists in the source system, which is often third normal form, but it is normalization none the less.  Consider the ERD below (click on it to expand)...



This is yet another addition to the star schema that we have been using for the past several weeks.  Notice the DIM_EXECUTIVE table at the bottom.  It is a dimension table that is joined directly to the DIM_EMPLOYEE dimension table.  It is not joined to the fact table.  Why would we not add the four non-key values directly to the DIM_EMPLOYEE table?  We could and that would not be incorrect, but what if there is not much executive information compared to the number of employees?

So, if the DIM_EXECUTIVE table contains 10 rows and the DIM_EMPLOYEE table contains 4,000,000 rows, then placing this executive information into the DIM_EMPLOYEE table will result in 3,000,990 rows of empty space.  In this case, it might make sense.

When a dimension table joins to another dimension table the star schema is now referred to as a snowflake schema.  The "second layer" of dimension tables causes the tables, when they are depicted in an ERD, to resemble a snowflake.

I will add that when in doubt, it is probably best not to snowflake.  Snowflaking causes some additional complexity on the part of the reporting tools when it comes to interpreting the data.

To experiment with analyzing this kind of data, you can run the script found here in an Oracle environment.  Rather than continue adding to the script little by little, this script will create ALL of the tables and data depicted in the ERD.  As with last week's post on junk dimensions, you have some options with regards to snowflaking or not.  Remember that your number one goal is to turn your organization's data into something that is truly valuable, thus enabling your leaders to make excellent decisions.  Make your decision with that in mind.

More on snowflaking can be found in the books listed in the additional reading section to the right as well as by visiting http://www.brianciampa.com/ and, under the Concepts menu, clicking Terms or Advanced.

Update: The Kimball training that I received taught me that I was incorrect in classifying this type of design as a snowflake schema.  A snowflake schema contains a completely normalized version of the dimension tables.  This post describes an outrigger table.  The example above should still be considered a star schema.  Sorry for my mistake.

Image(s): FreeDigitalPhotos.net

2 comments: