A prior post examined fact and dimension tables. The concept is fairly easy to understand but there is some more complexity to dimension tables that I want to touch on. That complexity surrounds the ways in which records are uniquely identified. Two terms called a natural key and a surrogate key come into play.
Natural Key: A natural key is a key that uniquely identifies a piece of descriptive data (dimension) from a business perspective. Giving a business expert this value will mean something to him.
Surrogate Key: A surrogate key is a key that is internal to the data warehouse. It most often takes the form of an auto generated number that serves as the primary key of a dimension table. Giving a business expert this value will mean nothing to him.
Consider the following table (also used in the other post)…
This table is a dimension table that contains data about products. From the business’ point of view, the way that you uniquely identify a product is by the product number. A user of the data warehouse can compare the product_number field to the source system and find it in there. The product_number field, therefore, is known as the natural key.
However, giving the business users the key_product value will be meaningless to them. No source system will have that value. As new products are added to the dimension table, the ETL auto generates a number and places it into this field. The only reason for this is so that it can be used for the primary key/foreign key relationship between this dimension table and the fact table. Therefore, the key_product field is the surrogate key.Why can’t the natural key serve as the surrogate key? Good question. It’s a bit much to answer in this post but I’ll basically say that there may be instances in which more than one record will exist in the dimension table for a given product.