Valuable Data: August 2012

Saturday, August 25, 2012

Dimensional Modeling Video

Dimensional modeling is a very powerful technique in helping to enable excellent decision making. This is a video that explains the business value that can be derived from using this technique. For more information on dimensional modeling, including free data with which you can practice, look at the Learn, Grow, and Succeed links at http://www.brianciampa.com/careerindatawarehousing.html.

Also, for those looking for a fresh way to market your skillset consider The Data Warehouse Portfolio at http://www.brianciampa.com.

Friday, August 17, 2012

The Data Warehouse Portfolio

You've spent weeks in data warehousing classes, days in training sessions, hours reading books on the subject, and perhaps years warehousing data for other organizations. However, right now you want to work for this one. You know that you have the experience as well as the ambition. You have a list of ideas to take to this new potential employer that you know will provide immense value. So, what's the next step? All of that is placed onto a very professionally-looking resume and submitted. The only problem is that everybody else has done that same thing. You are not known as "Bob, who will add huge value to the business intelligence team," you are known as applicant 27...just another resume in the stack.

Perhaps you are on the other end of the spectrum. You are a novice looking for your first data warehousing position with no prior experience outside of class. You need a way to actually show that despite your lack of work experience you can do this...and do it well.

Do you need a fresh approach to marketing your skillset in hopes of standing out in the crowd? It is interesting that those in the event industry are expected to provide samples of their work for potential clients. Cake decorators, wedding photographers, and other similar vendors would probably be passed over if they only provided a resume that explained what they had done before or what their education claims that they are capable of doing.

What if you placed a sample of your data warehousing skillset onto the web specifically for a potential employer to browse? In other words, you would have a resume to provide the details of your experience and education and a portfolio to provide an example of what you can actually do. The Data Warehouse Portfolio is a book that suggests a way to accomplish this. More information is available at www.brianciampa.com. You may also want to revisit this prior blog post for a bit more information as well.

Your job is to turn data into valuable data. However, showing the value in yourself is the first step in getting that job.

Friday, August 10, 2012

ETL - Practice Loading a Dimension - Solution

In the previous post we looked at the process used to write a basic ETL job to populate a dimension table. As discussed previously, while no two developers will write one exactly the same way, an ETL job that populates a dimension table will need to accomplish the following...

1.) Extract all of the dimension attributes from the source.
2.) Transform the data according to the requirements (i.e., concatenate first name and last name to create a full name column, etc.)
3.) Identify those records that are new to the table as opposed to those records that already exist in the table and may need to be updated due to updates in the source system
4.) Generate new surrogate key values and add them to the new records
5.) Load the records into the dimension table

Consider this example of an Oracle PL/SQL procedure that will populate the DIM_PRODUCT table. Before running the procedure keep these things in mind...

1.) I made an error in the SQL file from the previous post, so you may want to re-download that and run it again. The DIM_PRODUCT.PRODUCT_EFFECTIVE_FLAG should be named DIM_PRODUCT.PRODUCT_ACTIVE_FLAG.
2.) This job depends on an Oracle sequence to create the surrogate keys. Before running the job, run this statement in your Oracle environment...

CREATE SEQUENCE
SEQ_DIM_PRODUCT
MINVALUE 0
INCREMENT BY 1
START WITH 1;

3.) I'll encourage you not to get too lost in the Oracle syntax. The point is to examine the logical flow of an ETL job. If you have a better way of structuring the Oracle code (or want to use something other than Oracle), then by all means do that.

Run the procedure (after you have created the sample data provided in the previous post, of course) to populate the DIM_PRODUCT dimension. Notice what is happening...

1.) Extract all of the dimension attributes from the source - Everything is pulled from the source system and placed into the STAGE_PRODUCT_EXTRACT table.
2.) Transform the data according to the requirements (i.e., concatenate first name and last name to create a full name column, etc.) - The PRODUCT_ACTIVE_FLAG is derived and that data is placed into the STAGE_PRODUCT_TRANSFORM table.
3.) Identify those records that are new to the table as opposed to those records that already exist in the table and may need to be updated due to updates in the source system - New records are identified (via an outer join) and placed into the STAGE_PRODUCT_LOAD table.
4.) Generate new surrogate key values and add them to the new records - Surrogate keys are created in the STAGE_PRODUCT_LOAD table using the Oracle sequence mentioned earlier.
5.) Load the records into the dimension table - Existing records are updated in the DIM_PRODUCT table and the new records from the STAGE_PRODUCT_LOAD table are loaded.

This job is designed to be run as many times as necessary. Running one time or multiple times should still result in 25 records being placed into the DIM_PRODUCT table. This is a simple example for a few reasons, one of which is that we are working with a very small amount of data. A more complex ETL job may examine the source system's data and somehow determine which records are new and/or have been updated before pulling them into the staging area.

Also, more complex ETL jobs may not have five simple steps, as this one does, to accomplish the five things listed above. It may take several steps to accomplish those five things due to the complexity of the data.

If you wish, change some of the source data to see the updates occur or tweak the ETL job to your liking. By running this ETL job, you have just turned ordinary data into something that is truly valuable.

Image: FreeDigitalPhotos.net

Friday, August 3, 2012

ETL - Practice Loading a Dimension

We've been looking recently at the concept of extract, transform, and load (ETL) jobs. This post will begin to examine some of the mechanics. When loading a dimension table the ETL job should accomplish the following things...

1.) Extract all of the dimension attributes from the source.
2.) Transform the data according to the requirements (i.e., concatenate first name and last name to create a full name column, etc.)
3.) Identify those records that are new to the table as opposed to those records that already exist in the table and may need to be updated due to updates in the source system
4.) Generate new surrogate key values and add them to the new records
5.) Load the records into the dimension table

Each ETL job that populates a dimension table will need to accomplish these five things, although no two developers will develop a job exactly the same way. To get a feel for this type of job, consider this very simple example. A source system contains these two tables among others (click the picture to enlarge)...

...and this dimension needs to be populated...

If you have an Oracle environment available to you, run this sql file to create the two source system tables, populate them with data, and create the DIM_PRODUCT table. Each product, as you can see, is grouped under a single product group. The PRODUCT_EFFECTIVE_FLAG indicates whether or not a product is currently effective (current at the time of the ETL job). If you wish, try to write an ETL job that will populate the DIM_PRODUCT table in PL/SQL or another language of your choosing. We will examine a possible solution in a future post.