Friday, August 10, 2012

ETL - Practice Loading a Dimension - Solution

In the previous post we looked at the process used to write a basic ETL job to populate a dimension table.  As discussed previously, while no two developers will write one exactly the same way, an ETL job that populates a dimension table will need to accomplish the following...

1.) Extract all of the dimension attributes from the source.
2.) Transform the data according to the requirements (i.e., concatenate first name and last name to create a full name column, etc.)
3.) Identify those records that are new to the table as opposed to those records that already exist in the table and may need to be updated due to updates in the source system
4.) Generate new surrogate key values and add them to the new records
5.) Load the records into the dimension table

Consider this example of an Oracle PL/SQL procedure that will populate the DIM_PRODUCT table.  Before running the procedure keep these things in mind...

1.) I made an error in the SQL file from the previous post, so you may want to re-download that and run it again.  The DIM_PRODUCT.PRODUCT_EFFECTIVE_FLAG should be named DIM_PRODUCT.PRODUCT_ACTIVE_FLAG. 
2.) This job depends on an Oracle sequence to create the surrogate keys.  Before running the job, run this statement in your Oracle environment...

CREATE SEQUENCE
SEQ_DIM_PRODUCT 
MINVALUE 0
INCREMENT BY 1
START WITH 1;

3.) I'll encourage you not to get too lost in the Oracle syntax.  The point is to examine the logical flow of an ETL job.  If you have a better way of structuring the Oracle code (or want to use something other than Oracle), then by all means do that.

Run the procedure (after you have created the sample data provided in the previous post, of course) to populate the DIM_PRODUCT dimension.  Notice what is happening...

1.) Extract all of the dimension attributes from the source - Everything is pulled from the source system and placed into the STAGE_PRODUCT_EXTRACT table.
2.) Transform the data according to the requirements (i.e., concatenate first name and last name to create a full name column, etc.) - The PRODUCT_ACTIVE_FLAG is derived and that data is placed into the STAGE_PRODUCT_TRANSFORM table.
3.) Identify those records that are new to the table as opposed to those records that already exist in the table and may need to be updated due to updates in the source system - New records are identified (via an outer join) and placed into the STAGE_PRODUCT_LOAD table. 
4.) Generate new surrogate key values and add them to the new records - Surrogate keys are created in the STAGE_PRODUCT_LOAD table using the Oracle sequence mentioned earlier.
5.) Load the records into the dimension table - Existing records are updated in the DIM_PRODUCT table and the new records from the STAGE_PRODUCT_LOAD table are loaded.

This job is designed to be run as many times as necessary.  Running one time or multiple times should still result in 25 records being placed into the DIM_PRODUCT table.  This is a simple example for a few reasons, one of which is that we are working with a very small amount of data.  A more complex ETL job may examine the source system's data and somehow determine which records are new and/or have been updated before pulling them into the staging area.

Also, more complex ETL jobs may not have five simple steps, as this one does, to accomplish the five things listed above.  It may take several steps to accomplish those five things due to the complexity of the data. 

If you wish, change some of the source data to see the updates occur or tweak the ETL job to your liking.  By running this ETL job, you have just turned ordinary data into something that is truly valuable.

Image: FreeDigitalPhotos.net

No comments:

Post a Comment