Sunday, January 15, 2012

Valuable Data

What is data warehousing?  Why is it important?  If I'm an executive of an organization and we just spent a few million dollars implementing a large enterprise system for our finance or human resources data, why are you asking me for MORE money to implement something called a data warehouse?  Why do I need two systems?  What's it going to do for me?
This is a valid question and below I'll try to explain why a decision maker can benefit from such a system.  First, allow me to give you some background.  I've always enjoyed working with data.  Relational databases and their capabilities fascinate me.  A few years ago, I took that one step further and dove into the world of data warehousing/business intelligence.  In an attempt to share information a bit I thought I would blog about this area.  All organizations have data but only those who invest in their ability to analyze that data turn it into something of true value.  I would love any feedback, suggestions, constructive criticisms, etc. that you have.  This area is constantly changing, so knowledge sharing is essential and welcome.
An operational system involves a database that is built in something called third-normal form.  In a nutshell, this means that the system is optimized for getting data into it but not ideal for analyzing that data (at a high level).  A data warehouse is a database that is denormalized (the opposite of third-normal form).  This type of system is optimized for analyzing the data but is not optimized for putting data into it.  Consider this example...
When you were young you may have had a piggy bank into which you placed coins.  Putting coins into that bank is not hard.  You simply drop them into the slit in the top.  However, those banks are not ideal for understanding how much money they contain.  First, they are designed to encourage saving, so getting the money out is not the easiest thing for a kid to do.  Second, even if the bank is opened and you are able to peer inside, you will just see a pile of coins.  It's anybody's guess as to how much money it represents.
Now, consider if you changed the environment of those coins a bit.  You remove the coins from the bank and put them into money rolls.  That's a more tedious process than simply dropping coins through a slit at the top of the piggy bank.  However, once that is done, it will be much easier to see how much money the bank contained.  Looking at coins that are rolled up (structured) in this way makes them optimized for counting (analysis).
In much the same way, when you go to a retail store and purchase a shirt you don't want to have to stand at the register for a long time waiting for the computer to process the transaction.  As a result, those systems are built in third-normal form so that the data can be placed into the system quickly and you can be on your way.  The transaction may look something like this...

Date
Item Number
Item Description
Department
Amount Paid
11/14/2011
876FTR
Men’s Polo
Men’s Dept
$12.99


Now, what if 100,000 items were sold by several stores in one day?  There would be 100,000 records like this in the system (with different data of course).  Multiply that by 365 (days in a year) and you see the amount of data that we're talking about.  Now, suppose an executive is wondering when Men's Polo shirts are most often sold.  Summer?  Spring?  October?  July?  Examining all of those thousands of records is not practical, much like trying to determine how much money is contained within the pile of coins in the piggy bank.  However, if a program ran each night (usually called an ETL job) that copied this data into a data warehouse and structured it in such a way that you can easily see into which year, quarter, season, and month this transaction fell, it would be much easier to answer those questions for upper management.  When data is made available to decision makers in this way it becomes truly valuable.  Data that is just sitting in an operational system is just data.  Data that is structured in such a way that it not only tells a story but you can easily see what that story is is a pathway to making better decisions. 
The piggy bank example might not be a perfect comparison.  In a data warehousing environment the data exists in third-normal form in the source system and is then copied from the source (not removed from the source) and placed into the data warehouse.  You can't "copy" the coins so that they remain in the piggy bank and exist as rolled coins too.  However, I'll trust that the example provides some benefit. 
In the next post I plan to explain data warehousing in a more technical sense.  Until then, enjoy your data!

No comments:

Post a Comment