Saturday, March 24, 2012

Conformed Dimensions

In a prior post we looked at the structure associated with a star schema, which includes both fact and dimension tables.  The beauty of a star schema is that dimension tables can be (and are intended to be) shared by different stars. 
Let’s look further.  We’ve already used the following star as an example in an earlier post.  The Fact_Transaction.Key_Sales_Person field can be joined to the Dim_Employee.Key_Employee field to report on certain pieces of employee data.  If a data warehouse had 20 stars, it would be very confusing if each star had its own definition of an employee.  It would also be impossible to compare data between stars. 

If an employee dimension such as the one below exists in one star then the fact table in a different star can also use the key_employee values as its foreign key.  In such a scenario, the ERD can be expanded to…

Notice that the organization decided to warehouse its salary data and create a new star which uses the fact_salary table.  Also notice how this table references the dim_date dimension as well as the dim_employee dimension, also used by the fact_transaction table.  From my perspective, at least two advantages exist…
1.)    As already discussed, the time and effort involved in creating and maintaining multiple copies of the same dimension table is avoided.
2.)    Facts that share dimensions can easily be compared side by side.  Using the ERD above, let’s say that a report was requested to compare the dollars sold to the wages paid to each employee per month.  The dollars sold (fact_transaction.actual_sales_price) and the wages paid (fact_salary.actual_wages_paid) are not in the same fact table.  However, since we have conformed dimensions, this can be accomplished using a SQL statement similar to the following (using Oracle syntax)…
SELECT                    decode(paid.year, null, sales.year, paid.year),
                                decode (paid.month_number, null, sales.month_number, paid.month_number),
                                decode(paid.employee_last_name, null, sales. employee_last_name, paid.employee_last_name),
                                decode(paid.employee_first_name, null, sales. employee_first_name, paid.employee_first_name),
                                paid.amtpaid,
                                sales.salesamt
FROM
(SELECT                   d.year,
                                d.month_number,
                                c.employee_last_name,
                                c.employee_first_name
                                sum(b.actual_wages_paid) amtpaid
FROM                      fact_salary b,
                                dim_employee c,
                                dim_date d
WHERE                    b.key_date_paid = d.key_date
                                AND b.key_employee = c.key_employee
GROUP BY              d.year,
                                d.month_number,
c.employee_last_name,
                                c.employee_first_name) paid
FULL OUTER JOIN
(SELECT                   d.year,
                                d.month_number,
                                c.employee_last_name,
                                c.employee_first_name
                                sum(a.actual_sales_price) salesamt
FROM                      fact_transaction a,
                                dim_employee c,
                                dim_date d
WHERE                    a.key_date = d.key_date
                                AND a.key_sales_person = c.key_employee
GROUP BY              d.year,
                                d.month_number,
                                c.employee_last_name,
                                c.employee_first_name) sales
                ON                           paid.year = sales.year
                                                AND paid.month_number = sales.month_number
                                                AND paid.employee_last_name = sales.employee_last_name
                                                AND paid.employee_first_name = sales.employee_first_name
               
As long as the dimensions included in the query from each fact table are at the same grain, this analysis can be done.  For example, trying to compare monthly pay to daily sales (which doesn’t even make sense) will not be possible.  Also note that this kind of thing will probably be done behind the scenes with a piece of reporting software, as opposed to report writers having to write this type of query manually.
From a business perspective, because the definition of an individual employee is the same definition in multiple business processes (sales person in transactions and recipient of pay in salary) and the dates mean the same thing in multiple business processes (date of sale and date paid) then these are conformed dimensions.  From a technical perspective, even if the dimension tables are copied and placed into different stars they would still technically be considered conformed as long as they are copies of each other or subsets of each other.  For example, if we broke up our dim_employee table into dim_employee_usa, dim_employee_europe, dim_employee_canada, etc. and used the one that was the most appropriate for the country being analyzed, that might make for a faster response from the database.  In Oracle-speak, this can be accomplished using materialized views as well (another way to essentially accomplish the same thing).  It is important to note that if this approach is taken, the key_employee value must be the same for each instance of an employee that exists in multiple tables.  So, if Brian Ciampa exists in the overall dim_employee table and also in the dim_employee_usa table (because he is located in the US), then he should have the same key_employee value in each one.  That way, either table can be joined to any fact table that contains that foreign key.  The commonality with regards to structure and business rule definition is key, even if different tables are used.

No comments:

Post a Comment