Friday, September 28, 2012
Data Extraction Techniques
1.) Use the source system's datetime stamps - An ETL developer should consider himself very blessed if his source system contains an insert and update datetime stamp for each record in the system. If this is the case, then he can simply store the date and time of the most recent successful extract and then pull everything that has been inserted or updated since then. It will be important for the ETL developer to understand how those audit columns are populated in each table to make sure that necessary data will not be left out and unnecessary data will not be included due to odd occurances (i.e., a source system table is reloaded for some reason and records that pertain to data that was initially added three years ago have yesterday's timestamp, etc.).
2.) Compare to the prior extract - In his book The Data Warehouse ETL Toolkit : Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, Ralph Kimball suggests always saving the prior extract in the staging area. A new extract can be compared to this prior extract and the differences will be uploaded. Finally the new extract will overwrite the prior extract and be used as the baseline for the next update.
3.) Always pull as of a certain date - In this approach, the salary dollars may always be extracted for the current calendar year (for example). So, as the year goes on each extraction will be slightly larger than the last. In April, all salary dollars that date back to January will be extracted from the source. In May, all salary dollars that date back to January will be extracted from the source as well. When using this option in conjunction with loading a fact table, the ETL developer will need to delete all of the applicable year's records from the fact table and then load the extract. Although several records will be deleted and simply reloaded (with no change), this is one way of grabbing what is needed from the source when a simpler solution is not possible. However, any updates made to the data that pertain to a different year will be missed. This will need to be taken into consideration. When using this option in conjunction with loading a dimension table, the ETL developer will need to compare to the dimension table to see if any of the dimensions have changed and only load or update the changes. Deleting records from the dimension table and reloading is not a good strategy since the fact table has foreign keys that reference the dimension table.
4.) Refresh the fact table - In this case all of the data from the source system will always be extracted and the fact table will be truncated and reloaded each time. This will only be an acceptable solution if the fact table can be loaded in a timely fashion. Also, this will only be acceptable with a fact table most of the time. Even if an ETL developer plans to refresh an entire star (fact and dimension tables) she must consider the possibility that some of these dimensions are conformed dimensions, meaning that other fact tables reference them.
There are obviously more options than these four. With whatever option you choose, it is important to build the job so that it can be run multiple times with no negative impact. In other words, running the job twice should not result in duplicate records showing up. Even though the initial extract may place more records than are needed into the staging area, the job should be "smart enough" to know exactly what to load into the data warehouse and what to exclude.
For more information on data warehousing techniques as well as data with which you can practice writing your own ETL, visit www.brianciampa.com/careerindatawarehousing/grow.html. Also, if you need a fresh approach to marketing your skillset, consider The Data Warehouse Portfolio.
Image courtesy of digitalart / FreeDigitalPhotos.net