Saturday, February 11, 2012

Natural Keys vs. Surrogate Keys

A prior post examined fact and dimension tables.  The concept is fairly easy to understand but there is some more complexity to dimension tables that I want to touch on.  That complexity surrounds the ways in which records are uniquely identified.  Two terms called a natural key and a surrogate key come into play. 
Natural Key: A natural key is a key that uniquely identifies a piece of descriptive data (dimension) from a business perspective.  Giving a business expert this value will mean something to him.
Surrogate Key: A surrogate key is a key that is internal to the data warehouse.  It most often takes the form of an auto generated number that serves as the primary key of a dimension table.  Giving a business expert this value will mean nothing to him.
Consider the following table (also used in the other post)…
This table is a dimension table that contains data about products.  From the business’ point of view, the way that you uniquely identify a product is by the product number.  A user of the data warehouse can compare the product_number field to the source system and find it in there.  The product_number field, therefore, is known as the natural key. 
However, giving the business users the key_product value will be meaningless to them.  No source system will have that value.  As new products are added to the dimension table, the ETL auto generates a number and places it into this field.  The only reason for this is so that it can be used for the primary key/foreign key relationship between this dimension table and the fact table.  Therefore, the key_product field is the surrogate key.
Why can’t the natural key serve as the surrogate key?  Good question.  It’s a bit much to answer in this post but I’ll basically say that there may be instances in which more than one record will exist in the dimension table for a given product.

No comments:

Post a Comment