Friday, September 28, 2012

Data Extraction Techniques

One of the challenges of writing ETL involves deciding which records to pull from the source system.  If a salary star schema is built to contain payroll data, then the ETL job will refresh on a recurring basis to add new data to that star.  If some data related to salary dollars already exists in the star and some new data makes it into the source system, how can the ETL job know to load the new data into the star and avoid reloading the records that were previously loaded (since they already exist in the data warehouse)?  There is no one way of doing this, but consider these options...

1.) Use the source system's datetime stamps - An ETL developer should consider himself very blessed if his source system contains an insert and update datetime stamp for each record in the system.  If this is the case, then he can simply store the date and time of the most recent successful extract and then pull everything that has been inserted or updated since then.  It will be important for the ETL developer to understand how those audit columns are populated in each table to make sure that necessary data will not be left out and unnecessary data will not be included due to odd occurances (i.e., a source system table is reloaded for some reason and records that pertain to data that was initially added three years ago have yesterday's timestamp, etc.).

2.) Compare to the prior extract - In his book The Data Warehouse ETL Toolkit : Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, Ralph Kimball suggests always saving the prior extract in the staging area.  A new extract can be compared to this prior extract and the differences will be uploaded.  Finally the new extract will overwrite the prior extract and be used as the baseline for the next update.

3.) Always pull as of a certain date - In this approach, the salary dollars may always be extracted for the current calendar year (for example).  So, as the year goes on each extraction will be slightly larger than the last.  In April, all salary dollars that date back to January will be extracted from the source.  In May, all salary dollars that date back to January will be extracted from the source as well.  When using this option in conjunction with loading a fact table, the ETL developer will need to delete all of the applicable year's records from the fact table and then load the extract.  Although several records will be deleted and simply reloaded (with no change), this is one way of grabbing what is needed from the source when a simpler solution is not possible.  However, any updates made to the data that pertain to a different year will be missed.  This will need to be taken into consideration.  When using this option in conjunction with loading a dimension table, the ETL developer will need to compare to the dimension table to see if any of the dimensions have changed and only load or update the changes.  Deleting records from the dimension table and reloading is not a good strategy since the fact table has foreign keys that reference the dimension table.

4.) Refresh the fact table - In this case all of the data from the source system will always be extracted and the fact table will be truncated and reloaded each time.  This will only be an acceptable solution if the fact table can be loaded in a timely fashion.  Also, this will only be acceptable with a fact table most of the time.  Even if an ETL developer plans to refresh an entire star (fact and dimension tables) she must consider the possibility that some of these dimensions are conformed dimensions, meaning that other fact tables reference them.

There are obviously more options than these four.  With whatever option you choose, it is important to build the job so that it can be run multiple times with no negative impact.  In other words, running the job twice should not result in duplicate records showing up.  Even though the initial extract may place more records than are needed into the staging area, the job should be "smart enough" to know exactly what to load into the data warehouse and what to exclude.

For more information on data warehousing techniques as well as data with which you can practice writing your own ETL, visit www.brianciampa.com/careerindatawarehousing/grow.html.  Also, if you need a fresh approach to marketing your skillset, consider The Data Warehouse Portfolio.

Image courtesy of digitalart / FreeDigitalPhotos.net

2 comments:

  1. hey nice information for me,thanks for sharing the nice blog with us,I read the full article of data extraction techniques in easy way.I recommend this article to my friends i am sure they will benefited from this blog.I will definitely bookmark this blog.

    data extraction service

    ReplyDelete