Valuable Data: March 2012

Saturday, March 31, 2012

Practice Using Conformed Dimensions

In the last post we looked at the benefits of having conformed dimensions. If you wish, feel free to play with this yourself. I have two scripts available that will essentially add a fact_salary table to the star that we used for practice a few weeks ago.

If you created that star in your Oracle instance already and want to simply add the fact_salary table, use this script.

If you want to create the entire star in your Oracle instance, use this script.

The updated ERD is below (if the image is too small to read, click on it to have it enlarged).

See if you can write some SQL that will allow you to compare the measures between the two fact tables. If you are really into this, try adding a third or fourth fact table of your own (and more dimensions if needed) and doing some additional comparisons. A powerful use of a data warehouse involves comparing measures between multiple business processes to answer questions like…

How does his productivity growth compare to his salary growth?

How does this year’s increase in revenue (either in dollars or as a percentage) compare to last year’s increase in dollars allocated to new marketing initiatives?

How are the number of new work from home opportunities affecting employee turnover?

Using conformed dimensions can help to easily answer questions like these. What was once just a pile of data is now providing valuable information to decision makers.

Saturday, March 24, 2012

Conformed Dimensions

In a prior post we looked at the structure associated with a star schema, which includes both fact and dimension tables. The beauty of a star schema is that dimension tables can be (and are intended to be) shared by different stars.

Let’s look further. We’ve already used the following star as an example in an earlier post. The Fact_Transaction.Key_Sales_Person field can be joined to the Dim_Employee.Key_Employee field to report on certain pieces of employee data. If a data warehouse had 20 stars, it would be very confusing if each star had its own definition of an employee. It would also be impossible to compare data between stars.

If an employee dimension such as the one below exists in one star then the fact table in a different star can also use the key_employee values as its foreign key. In such a scenario, the ERD can be expanded to…

Notice that the organization decided to warehouse its salary data and create a new star which uses the fact_salary table. Also notice how this table references the dim_date dimension as well as the dim_employee dimension, also used by the fact_transaction table. From my perspective, at least two advantages exist…

1.) As already discussed, the time and effort involved in creating and maintaining multiple copies of the same dimension table is avoided.

2.) Facts that share dimensions can easily be compared side by side. Using the ERD above, let’s say that a report was requested to compare the dollars sold to the wages paid to each employee per month. The dollars sold (fact_transaction.actual_sales_price) and the wages paid (fact_salary.actual_wages_paid) are not in the same fact table. However, since we have conformed dimensions, this can be accomplished using a SQL statement similar to the following (using Oracle syntax)…

SELECT decode(paid.year, null, sales.year, paid.year),

decode (paid.month_number, null, sales.month_number, paid.month_number),

decode(paid.employee_last_name, null, sales. employee_last_name, paid.employee_last_name),

decode(paid.employee_first_name, null, sales. employee_first_name, paid.employee_first_name),

paid.amtpaid,

sales.salesamt

FROM

(SELECT d.year,

d.month_number,

c.employee_last_name,

c.employee_first_name

sum(b.actual_wages_paid) amtpaid

FROM fact_salary b,

dim_employee c,

dim_date d

WHERE b.key_date_paid = d.key_date

AND b.key_employee = c.key_employee

GROUP BY d.year,

d.month_number,

c.employee_last_name,

c.employee_first_name) paid

FULL OUTER JOIN

(SELECT d.year,

d.month_number,

c.employee_last_name,

c.employee_first_name

sum(a.actual_sales_price) salesamt

FROM fact_transaction a,

dim_employee c,

dim_date d

WHERE a.key_date = d.key_date

AND a.key_sales_person = c.key_employee

GROUP BY d.year,

d.month_number,

c.employee_last_name,

c.employee_first_name) sales

ON paid.year = sales.year

AND paid.month_number = sales.month_number

AND paid.employee_last_name = sales.employee_last_name

AND paid.employee_first_name = sales.employee_first_name

As long as the dimensions included in the query from each fact table are at the same grain, this analysis can be done. For example, trying to compare monthly pay to daily sales (which doesn’t even make sense) will not be possible. Also note that this kind of thing will probably be done behind the scenes with a piece of reporting software, as opposed to report writers having to write this type of query manually.

From a business perspective, because the definition of an individual employee is the same definition in multiple business processes (sales person in transactions and recipient of pay in salary) and the dates mean the same thing in multiple business processes (date of sale and date paid) then these are conformed dimensions. From a technical perspective, even if the dimension tables are copied and placed into different stars they would still technically be considered conformed as long as they are copies of each other or subsets of each other. For example, if we broke up our dim_employee table into dim_employee_usa, dim_employee_europe, dim_employee_canada, etc. and used the one that was the most appropriate for the country being analyzed, that might make for a faster response from the database. In Oracle-speak, this can be accomplished using materialized views as well (another way to essentially accomplish the same thing). It is important to note that if this approach is taken, the key_employee value must be the same for each instance of an employee that exists in multiple tables. So, if Brian Ciampa exists in the overall dim_employee table and also in the dim_employee_usa table (because he is located in the US), then he should have the same key_employee value in each one. That way, either table can be joined to any fact table that contains that foreign key. The commonality with regards to structure and business rule definition is key, even if different tables are used.

Sunday, March 18, 2012

Slowly Changing Dimensions

In a prior post, before we took a break to practice a bit, we looked at the difference between natural and surrogate keys. At first glance, you may wonder why a surrogate key cannot also serve as a natural key. The answer to this excellent question leads us into a discussion of something called slowly changing dimensions (SCD).

As time goes on, we would be foolish to expect that the descriptive data of our facts will not change. It most definitely will change. As people get promoted, then the job title fields in the employee dimensions will be affected. As business grows and additional lines of business are created, then the dimensions that contain that data will need to be updated. There are a few ways of dealing with this depending on the reporting needs of the organization.

Type 1 SCD: I typically describe a type 1 SCD as a dimension that does not keep history. Once a piece of data changes (i.e., an employee is promoted from a Junior Analyst to a Senior Analyst) the old data is overwritten with the new. Those analyzing the data will never know, from this data set, that this person was ever a Junior Analyst.

Type 2 SCD: I typically describe a type 2 SCD as a dimension that does keep history. Once a piece of data changes, a new row, containing the updated data, is added to the dimension table. The original row is still there. All future fact records that pertain to that descriptive piece of data will have a foreign key pointing to the new record. The older records in the fact table will still point to the original dimension record. So, using the example above, if an employee is promoted from a Junior Analyst to a Senior Analyst, two rows will exist for that employee in the dimension table. One will show this employee as being a Junior Analyst and the other a Senior Analyst. All of the records in the fact table that describe facts that occurred after the promotion will point to the Senior Analyst row. Those prior to the promotion will point to the Junior Analyst row. Those analyzing this data will be able to see that this person was promoted.

Type 3 SCD: A type 3 SCD is essentially a hybrid of a type 1 and type 2 SCD. In this scenario as history changes, rows are added into the dimension table, consistent with a type 2 SCD. Each row contains a column with the value that pertains to that moment in time, as does a type 2 SCD. Each row also contains a column with the current value. So, continuing with the example above, while you will be able to see that the older fact rows pertain to a time in which the employee was a Junior Analyst, you will also be able to see (while looking at this same historical dataset) that the employee is now a Senior Analyst.

So, let’s return to our original question. Why can’t a natural key be used as the primary key of a dimension table? Technically it can, but that’s a bad idea. Consider a data warehouse that was built so that the natural keys were used as the primary key of the dimension table. If a fact table has employee details, the foreign key pointing to the dimension row in the employee dimension will be the organization’s employee number. This is fine if the employee dimension is a type 1 SCD. However, if the need arises for this to become a type 2 SCD then this becomes a problem. Now you run into a situation where the new design calls for there to be, possibly, multiple rows in the dimension table with the same primary (natural) key since the natural key has changed. Also, what happens when the organization expands and decides to change all of the employee numbers from five digits to ten digits? Now all of your keys related to this dimension in the data warehouse (foreign key values in the fact table and primary key values in the dimension table) must change. However, if surrogate keys are used, then implementing this change will be a much smoother process.

Image: photostock / FreeDigitalPhotos.net

Saturday, March 10, 2012

Gaining Business Intelligence

In the last post we looked at a hands-on example of analyzing data using a star schema. The questions were…

1.) Who was the highest performer in terms of dollars in the first quarter?

2.) Who was the highest performer in terms of volume in the first quarter?

3.) The business experts speculate that a product may have been applied to a line item with the incorrect dollar amount. Using this star, see if you can identify where that may have happened.

Keeping in mind that there are several ways to analyze the data so that these questions can be answered, here are a few ways that will work.

1.) The highest performer in terms of dollars was James Couch with $3,529.98. This can be answered using this SQL statement…

select c.first_name, c.last_name, sum(a.amount), count(a.transaction_line)

from fact_transaction a,

dim_day b,

dim_employee c,

dim_product d

where a.key_day = b.key_day

and a.key_sales_person = c.key_employee

and a.key_product = d.key_product

and b.calendar_quarter = 1

group by c.first_name, c.last_name

order by 3 desc

2.) The highest performer in terms of volume turned out to be several people. Gabriel Palmer, Norma Manlow, and Arnold Tab all fit that bill with 7. This can also be answered using the SQL statement above by changing the last line to “order by 4 desc”.

3.) This request from the business offered an interesting challenge. It would be possible to export all of the transaction data to a spreadsheet and then ask an analyst to examine it line-by-line to find the item that was misapplied. That option would be very tedious and time consuming. Consider this SQL statement…

select d.product_name, b.calendar_quarter, sum(a.amount)/count(a.transaction_line)

from fact_transaction a,

dim_day b,

dim_employee c,

dim_product d

where a.key_day = b.key_day

and a.key_sales_person = c.key_employee

and a.key_product = d.key_product

group by d.product_name, b.calendar_quarter

order by 1,2

This is examining the average price per product per quarter. Scrolling down the list will show an anomaly with the Men’s Athletic Shoes product in calendar quarter 2. Writing the select statement so that it shows the distinct quarter, product, and amount will essentially lead you to the same conclusion. As I mentioned above, there may be multiple ways to answer each question. Anyway, at this point, independent of which method you choose, more detail can be added to the query so that more information can be gathered. Now that we have identified the portion of the “forest” that appears to be problematic (Men’s Athletic Shoes in quarter 2), we can limit our analysis to the “trees” that make up that portion (the individual transactions). Consider the following SQL statement…

select d.product_name, a.transaction_number, a.transaction_line, sum(a.amount)

from fact_transaction a,

dim_day b,

dim_employee c,

dim_product d

where a.key_day = b.key_day

and a.key_sales_person = c.key_employee

and a.key_product = d.key_product

and b.calendar_quarter = 2

and d.product_name = 'Men''s Athletic Shoes'

group by d.product_name, a.transaction_number, a.transaction_line

Scrolling down shows us that line one of transaction 1345641 has an odd amount associated with it. The source system owners can be alerted that this transaction may need to be investigated (assuming that there is nothing wrong with the ETL job).

While this is a simplistic example (the highest performer only sold 7 items) I hope the point is obvious. Warehousing data allows users to easily find answers to business questions like these. If this is the first time that you have done something like this, welcome to Business Intelligence. If you enjoy this kind of thing, continue to play with the data to see what other conclusions you can draw.

Image: jannoon028 / FreeDigitalPhotos.net

Sunday, March 4, 2012

Seeking Business Intelligence

In a prior post, we looked at a hands-on example of examining data in a star schema. Suppose an analyst on the business side does some manual analysis each quarter to determine who deserves the top performer award. You are interested in showing her how the data warehouse can be used to easily answer this question without the need for a lot of manual analysis. Using this star, see if you can answer these questions…

1.) Who was the highest performer in terms of dollars in the first quarter?

2.) Who was the highest performer in terms of volume in the first quarter?

The value of the data warehouse is rising in the eyes of the business community to the point that it is being used more and more for information. The business experts, after looking at some high-level data, speculate that a product may have been applied to a line item with the incorrect dollar amount. Using this star, see if you can identify where that may have happened. By asking these questions (or others), you are using the data warehouse to seek intelligence about your business or organization.

Spend some time answering these questions and in a future post we will examine the solutions. If you find yourself enjoying this exercise that’s probably a tell-tale sign that you enjoy business intelligence.

Image: digitalart / FreeDigitalPhotos.net