Saturday, January 28, 2012

Reflecting Data

Each morning I wake up and do some quick analysis.  I’m not referring to analysis of data in a relational database.  I’m referring to analysis of myself.  I stand in front of the mirror to see how the night treated me.  My eyes are groggy, my hair is a mess, and I have stubble on my cheeks.  As a result, I need to fix a few things.
This process is not unlike the analysis done by executives of any organization.  An executive seeks information and then acts on it.  The mirror in this illustration represents the role of the business intelligence experts.  It provides the data to be analyzed.
This illustration is interesting when placed into the context of data warehousing/business intelligence.  While standing in front of the mirror, I represent the source system.  The mirror is strictly a reflection of the source system (me in this case).  When I notice something that needs to be fixed (i.e., my hair needs to be combed) it would be odd and incorrect to “fix” the mirror in some way.  What if I spent the next few minutes going around the house and somehow manipulating all of the mirrors so that my messy hair is no longer shown in the reflection?  In that case, a few problems exist…
1.)    The ultimate problem has not been solved; I’m just not looking at it anymore.
2.)    The mirror is no longer a true reflection of my appearance; it is a reflection of how I want to appear.
In the same way, a data warehouse should be a reflection of its source systems.  Data is not created specifically for the data warehouse; it is created for a source system and then restructured by the data warehouse to make analysis easier.  If a source system contains some incorrect data, and that data flows to the data warehouse, the tendency is to want to “fix it on the report” (leaving the incorrect data in the source system).  This would be equivalent to “fixing” the mirror in the example above.
Taking the “let’s fix it in the warehouse and worry about it later” approach can be a bit dangerous.  If a quick SQL statement is run to change the data now, what happens when the ETL job runs again?  It may replace the updated data with the original (incorrect) data from the source system, resulting in some reports with the updated data floating around and other reports (run after the ETL) with the source system’s data.  Also, anybody who goes back to the source system to verify something will see the discrepancy.  Questions may be asked, and the perception of the warehouse’s data integrity may begin to be compromised.  In other words, this is just not a good plan of action.
As a side note, I do want to address the possibility that an ETL job was written incorrectly.  If the source system shows A and the data warehouse shows B as a result of the ETL, then there is an issue with the warehouse that must be addressed by that team immediately.
When the data warehouse is used as a true reflection of the source system, it truly becomes a stimulus for action.  When something looks odd in the data, it may refer to a data problem in the source system or a reflection of a business problem (i.e., sales really dropped 10% last month?).  Each of these should move the organization to some sort of action (correcting data in the source system or re-evaluating a business strategy).  In this case, the data warehouse has not only helped data to become truly valuable but it has also helped to increase the intelligence of the organization.

Saturday, January 21, 2012

Technical Value Add

In the last post we looked a bit at the theory of data warehousing and the value that it adds.  Now I would like to examine the value add from a more technical perspective.  Suppose you have the following tables in an operational system.
Now, suppose that an executive asks the following question: Which employees sold products in the summer or fall from the “Apparel” product group? 
From that conversation you are tasked with providing a report that shows…
1.)    Employee Name
2.)    Employee Status
3.)    Product Name
4.)    Product Status
5.)    Season
6.)    Actual Sales Price
Since this data has not been warehoused, a SQL statement similar to the following will probably be required (assuming Oracle syntax).
SELECT                    a.last_name,
                                decode(a.termination_date, null, ‘Active’,’Inactive’) as “Employee Status”,
                                decode(e.discontinued_date, null, ‘Active’,’Inactive’) as “Product Status”,
                                case        when extract(month from in (11,12,1,2) then ‘Winter’
                                                when extract(month from in (3,4,5) then ‘Spring’
                                                when extract(month from in (6,7,8) then ‘Summer’
                                                when extract(month from in (9,10) then ‘Fall’
                                end as “Season”,
                                e.price – c.discount_applied as “Actual Sales Price”                         
FROM                     employee a,
                                transaction_header b,
                                transaction_detail c,
                                product_group d,
                                product e
WHERE          = b.sales_person_id
                                AND = c.header_id
                                AND = c.product_id
                                AND = e.product_group_id
                                AND case                when extract(month from in (11,12,1,2) then ‘Winter’
                                                                when extract(month from in (3,4,5) then ‘Spring’
                                                                when extract(month from in (6,7,8) then ‘Summer’
                                                                when extract(month from in (9,10) then ‘Fall’
                                                end in (‘Summer’,’Fall’)
                                AND = ‘Apparel’

While this is not the longest SQL statement ever written, it is pretty verbose.  The employee status, product status, actual sales price, and season fields need to be derived or calculated.  That derivation or calculation must happen within the SQL statement for each row that is returned, which will eat up resources.  If the business rules for the definition of a season change, then they will need to be changed in both the select clause and the where clause. 
Suppose that beginning in 2011, management decided that November would be considered a “Fall” month but wanted to keep history.  In other words, November transactions that occurred in 2010 or earlier would fall into the “Winter” bucket but those in 2011 and later will fall into the “Fall” bucket.  Imagine the SQL statement that would need to be written for that!
In reality, this example may be a bit simplistic.  In fact, if this requirement came through and it was the only requirement of its kind, it may be worth it to just write the SQL.  However, once decision makers start to see data in a slightly different way they tend to want more and more and more…so the SQL statements get more complex.  Once the analytical requirements reach a certain point, the data needs to be restructured.
Now, consider if the data from the tables is denormalized and placed into a data warehouse (see below).
In this case, the SQL statement to produce the same report will be something like this…
SELECT                    b.employee_last_name,
FROM                     fact_transaction a,
                                dim_employee b,
                                dim_product c,
                                dim_date d
WHERE                   a.key_date = d.key_date
                                AND a.key_product = c.key_product
                                AND a.key_sales_person = b.key_employee
                                AND d.season in (‘Summer’,’Fall’)
                                AND c.product_group_name = ‘Apparel’
GROUP BY              b.employee_last_name,

That is much simpler.  Also, the SQL statement does not contain any of the business rules.  Each night, an ETL job (ETL is an acronym for extract, transform, and load) essentially assigns dates to the appropriate quarter, year, season, etc., assigns products to the appropriate product group, derives statuses, and performs other tasks that would be tedious within a select statement.  If the business rules that define a season, for example, change then that can be taken care of in the ETL job.  The SQL statement can remain the same.  This makes pulling data for the purposes of analysis much simpler.  If you are used to thinking about data in third-normal form then this will require a slightly different perspective.  However, the value add can be great in helping executives analyze their data quickly.

Sunday, January 15, 2012

Valuable Data

What is data warehousing?  Why is it important?  If I'm an executive of an organization and we just spent a few million dollars implementing a large enterprise system for our finance or human resources data, why are you asking me for MORE money to implement something called a data warehouse?  Why do I need two systems?  What's it going to do for me?
This is a valid question and below I'll try to explain why a decision maker can benefit from such a system.  First, allow me to give you some background.  I've always enjoyed working with data.  Relational databases and their capabilities fascinate me.  A few years ago, I took that one step further and dove into the world of data warehousing/business intelligence.  In an attempt to share information a bit I thought I would blog about this area.  All organizations have data but only those who invest in their ability to analyze that data turn it into something of true value.  I would love any feedback, suggestions, constructive criticisms, etc. that you have.  This area is constantly changing, so knowledge sharing is essential and welcome.
An operational system involves a database that is built in something called third-normal form.  In a nutshell, this means that the system is optimized for getting data into it but not ideal for analyzing that data (at a high level).  A data warehouse is a database that is denormalized (the opposite of third-normal form).  This type of system is optimized for analyzing the data but is not optimized for putting data into it.  Consider this example...
When you were young you may have had a piggy bank into which you placed coins.  Putting coins into that bank is not hard.  You simply drop them into the slit in the top.  However, those banks are not ideal for understanding how much money they contain.  First, they are designed to encourage saving, so getting the money out is not the easiest thing for a kid to do.  Second, even if the bank is opened and you are able to peer inside, you will just see a pile of coins.  It's anybody's guess as to how much money it represents.
Now, consider if you changed the environment of those coins a bit.  You remove the coins from the bank and put them into money rolls.  That's a more tedious process than simply dropping coins through a slit at the top of the piggy bank.  However, once that is done, it will be much easier to see how much money the bank contained.  Looking at coins that are rolled up (structured) in this way makes them optimized for counting (analysis).
In much the same way, when you go to a retail store and purchase a shirt you don't want to have to stand at the register for a long time waiting for the computer to process the transaction.  As a result, those systems are built in third-normal form so that the data can be placed into the system quickly and you can be on your way.  The transaction may look something like this...

Item Number
Item Description
Amount Paid
Men’s Polo
Men’s Dept

Now, what if 100,000 items were sold by several stores in one day?  There would be 100,000 records like this in the system (with different data of course).  Multiply that by 365 (days in a year) and you see the amount of data that we're talking about.  Now, suppose an executive is wondering when Men's Polo shirts are most often sold.  Summer?  Spring?  October?  July?  Examining all of those thousands of records is not practical, much like trying to determine how much money is contained within the pile of coins in the piggy bank.  However, if a program ran each night (usually called an ETL job) that copied this data into a data warehouse and structured it in such a way that you can easily see into which year, quarter, season, and month this transaction fell, it would be much easier to answer those questions for upper management.  When data is made available to decision makers in this way it becomes truly valuable.  Data that is just sitting in an operational system is just data.  Data that is structured in such a way that it not only tells a story but you can easily see what that story is is a pathway to making better decisions. 
The piggy bank example might not be a perfect comparison.  In a data warehousing environment the data exists in third-normal form in the source system and is then copied from the source (not removed from the source) and placed into the data warehouse.  You can't "copy" the coins so that they remain in the piggy bank and exist as rolled coins too.  However, I'll trust that the example provides some benefit. 
In the next post I plan to explain data warehousing in a more technical sense.  Until then, enjoy your data!