Friday, September 21, 2012

To Constrain or Not Constrain

One of the advantages of using a relational database is that you can mandate that certain relationships MUST exist within your data.  If somebody tries to enter data into the database that does not relate to other data correctly, the database can be configured to reject that "bad data" until it is corrected.  Since many data warehouses reside in relational databases, using this feature is an option for the data warehousing team as well. 

Consider this example...suppose that a star contains a DIM_PRODUCT table that contains the products that are used by that business process.  If the Men's Raincoat product has a KEY_PRODUCT value of 27 (surrogate key) then the records in the corresponding fact tables that pertain to this product will have a KEY_PRODUCT value of 27 (foreign key).  That's a lesson from Database 101.  If somebody removes the Men's Raincoat product from the DIM_PRODUCT table while records that point to it exist in the fact table, then those fact table records will point to nothing...and become meaningless.

The advantage of enforcing this constraint within the database itself is that if somebody tries to remove the Men's Raincoat product from the DIM_PRODUCT table the database will not allow it until the "child records" from the fact table have been deleted or repointed.  Using this feature sounds like a no-brainer (and it may be) but a data warehouse provides an additional twist to this kind of decision since so much data is being inserted.

Option 1: Enforce Constraints - This will ensure that the relationship between the fact and dimension tables are always valid from a technical perspective.  However, each time that the data is loaded via the ETL job, the fact table must look to make sure that a parent record exists in the dimension.  If (and only if) it exists, it will load that record into the fact.  Doing that for each and every record will ensure good data intregity but it can also slow a job down.

Option 2: Do Not Enforce Constraints - This will probably result in a faster and more efficient ETL job.  However, the possiblity of the fact table containing some orphan records exists.

It is up to each data warehousing team to decide which is best for their particular situation.  In some cases it is appropriate to physically enforce constraints (option 1).  In other cases, it may be appropriate to logically enforce constraints (option 2), meaning that the data warehousing team will need to periodically run SQL statements that specifically look for orphaned records. 

If you are looking for more information on data warehousing and/or data that can be used to practice ETL and architecture skills, visit and click on Career in Data Warehousing.  Also, if you are looking for a fresh way to market your data warehousing skills, consider The Data Warehouse Portfolio.


No comments:

Post a Comment