Saturday, January 21, 2012

Technical Value Add

In the last post we looked a bit at the theory of data warehousing and the value that it adds.  Now I would like to examine the value add from a more technical perspective.  Suppose you have the following tables in an operational system.
Now, suppose that an executive asks the following question: Which employees sold products in the summer or fall from the “Apparel” product group? 
From that conversation you are tasked with providing a report that shows…
1.)    Employee Name
2.)    Employee Status
3.)    Product Name
4.)    Product Status
5.)    Season
6.)    Actual Sales Price
Since this data has not been warehoused, a SQL statement similar to the following will probably be required (assuming Oracle syntax).
SELECT                    a.last_name,
                                decode(a.termination_date, null, ‘Active’,’Inactive’) as “Employee Status”,
                                decode(e.discontinued_date, null, ‘Active’,’Inactive’) as “Product Status”,
                                case        when extract(month from in (11,12,1,2) then ‘Winter’
                                                when extract(month from in (3,4,5) then ‘Spring’
                                                when extract(month from in (6,7,8) then ‘Summer’
                                                when extract(month from in (9,10) then ‘Fall’
                                end as “Season”,
                                e.price – c.discount_applied as “Actual Sales Price”                         
FROM                     employee a,
                                transaction_header b,
                                transaction_detail c,
                                product_group d,
                                product e
WHERE          = b.sales_person_id
                                AND = c.header_id
                                AND = c.product_id
                                AND = e.product_group_id
                                AND case                when extract(month from in (11,12,1,2) then ‘Winter’
                                                                when extract(month from in (3,4,5) then ‘Spring’
                                                                when extract(month from in (6,7,8) then ‘Summer’
                                                                when extract(month from in (9,10) then ‘Fall’
                                                end in (‘Summer’,’Fall’)
                                AND = ‘Apparel’

While this is not the longest SQL statement ever written, it is pretty verbose.  The employee status, product status, actual sales price, and season fields need to be derived or calculated.  That derivation or calculation must happen within the SQL statement for each row that is returned, which will eat up resources.  If the business rules for the definition of a season change, then they will need to be changed in both the select clause and the where clause. 
Suppose that beginning in 2011, management decided that November would be considered a “Fall” month but wanted to keep history.  In other words, November transactions that occurred in 2010 or earlier would fall into the “Winter” bucket but those in 2011 and later will fall into the “Fall” bucket.  Imagine the SQL statement that would need to be written for that!
In reality, this example may be a bit simplistic.  In fact, if this requirement came through and it was the only requirement of its kind, it may be worth it to just write the SQL.  However, once decision makers start to see data in a slightly different way they tend to want more and more and more…so the SQL statements get more complex.  Once the analytical requirements reach a certain point, the data needs to be restructured.
Now, consider if the data from the tables is denormalized and placed into a data warehouse (see below).
In this case, the SQL statement to produce the same report will be something like this…
SELECT                    b.employee_last_name,
FROM                     fact_transaction a,
                                dim_employee b,
                                dim_product c,
                                dim_date d
WHERE                   a.key_date = d.key_date
                                AND a.key_product = c.key_product
                                AND a.key_sales_person = b.key_employee
                                AND d.season in (‘Summer’,’Fall’)
                                AND c.product_group_name = ‘Apparel’
GROUP BY              b.employee_last_name,

That is much simpler.  Also, the SQL statement does not contain any of the business rules.  Each night, an ETL job (ETL is an acronym for extract, transform, and load) essentially assigns dates to the appropriate quarter, year, season, etc., assigns products to the appropriate product group, derives statuses, and performs other tasks that would be tedious within a select statement.  If the business rules that define a season, for example, change then that can be taken care of in the ETL job.  The SQL statement can remain the same.  This makes pulling data for the purposes of analysis much simpler.  If you are used to thinking about data in third-normal form then this will require a slightly different perspective.  However, the value add can be great in helping executives analyze their data quickly.

No comments:

Post a Comment