Valuable Data: April 2012

Friday, April 27, 2012

Big Data

One item that comes up often in discussions regarding Business Intelligence is Big Data. Big Data refers to a data set that is so large that using traditional relational database technology to interact with it is very difficult and perhaps impractical. So, let's say that a star schema is designed in hopes of helping to increase business intelligence in some area and an ETL job is written to populate that star. If that ETL job takes days to run due to the size and/or complexity of the data (not due to the inefficiency of the code within the ETL job, that can be corrected by rewriting it) then that data set may be referred to as Big Data...and it can be quite frustrating.

There are new technologies being developed that help to deal with this "thorn in the side" of data warehousing professionals. Some of these technologies allow an analyst to view data in a source system in the same way that she would in a data warehouse without needing a data warehouse. One of these is an SAP product called HANA. Remember from this blog's first post that the point of a data warehouse is to create a new environment for data (apart from the source system), in which it is restructured so that it is optimal for analysis. A product like HANA can process records much faster than other relational database products on the market.

So, revisiting the piggy bank example from this blog's first post, suppose that Rain Man (Dustin Hoffman's character in the movie of the same name) peered inside the piggy bank. Someone with his talent would be able to quickly determine the amount of money inside the piggy bank without the need to place the coins into money rolls. In this case, the same business intelligence can be gained even though the coins remain in the source system (the piggy bank). So, what does this mean for skills in dimensional modeling and ETL development? I would imagine that there is some debate on that. I'll leave that for a future post.

I'll say parenthetically that I'm not trying to endorse or not endorse HANA. I'm simply mentioning it as an example of a piece of technology that is designed to deal with Big Data.

Image: digitalart / FreeDigitalPhotos.net

Friday, April 20, 2012

Slowly Changing Dimensions Solutions To Practice

In the last post I provided a script that allowed you to practice interacting with slowly changing dimensions. The answers to the three questions that I proposed are...

1.) John Farmer earned $62,000 in salary (wages, benefits, and bonus) from Information Systems and $12,200 from Human Resources. A SQL statement like this (although variations of it will answer the question as well) will answer this question...

select a.last_name,
a.first_name,
a.employee_number,
a.department,
a.hire_date,
a.title,
a.appointment_begin_date,
a.appointment_end_date,
sum(b.actual_wages_paid),
sum(b.actual_benefits_paid),
sum(b.actual_bonus_paid)
from dim_employee_scd a,
fact_salary_scd b
where a.key_employee_scd = b.key_employee_scd
and a.employee_number = 2546
group by a.last_name, a.first_name, a.employee_number, a.department, a.hire_date, a.title, a.appointment_begin_date, a.appointment_end_date
order by 3,7

2.) James Couch's salary dropped 36.6% from the first to the second position. It dropped 36.4% from the second to the third position. From a business perspective this seems odd, but the point is to see there there was a change of some sort. A SQL statement like this will answer this question...

select a.last_name,
a.first_name,
a.employee_number,
a.department,
a.hire_date,
a.title,
a.appointment_begin_date,
a.appointment_end_date,
sum(b.actual_wages_paid) + sum(b.actual_benefits_paid) + sum(b.actual_bonus_paid)
from dim_employee_scd a,
fact_salary_scd b
where a.key_employee_scd = b.key_employee_scd
and a.employee_number = 2547
group by a.last_name, a.first_name, a.employee_number, a.department, a.hire_date, a.title, a.appointment_begin_date, a.appointment_end_date
order by 3,7

3.) Assuming that a current appointment is defined as a record with a null value for the appointment_end_date (in reality, a developer would need to verify this with a subject matter expert) the following SQL statement will display salary dollars for active positions only...

select a.last_name,
a.first_name,
a.employee_number,
a.department,
a.hire_date,
a.title,
a.appointment_begin_date,
a.appointment_end_date,
sum(b.actual_wages_paid),
sum(b.actual_benefits_paid),
sum(b.actual_bonus_paid)
from dim_employee_scd a,
fact_salary_scd b
where a.key_employee_scd = b.key_employee_scd
and a.appointment_end_date is null
group by a.last_name, a.first_name, a.employee_number, a.department, a.hire_date, a.title, a.appointment_begin_date, a.appointment_end_date
order by 3,7

As the dimensions (descriptors of measures) move through time, a type 2 slowly changing dimension allows the measures to be grouped by each of those changes. For example, the salary dollars can not only be displayed per person. As the person progresses through the organization by moving to new positions, the salary dollars can be grouped by each stop along the way.

From a technical perspective this amounts to the dimension's natural key expanding from the employee number to the employee number, position title, department, and appointment begin date (or whatever defines an appointment in your organization). From a business perspective this example allows an analyst to view an individual's effect on the organization as she progresses through that organization.

Image: digitalart / FreeDigitalPhotos.net

Saturday, April 14, 2012

Practice Using Slowly Changing Dimensions

A few weeks ago we looked at slowly changing dimensions. I've provided a script that will create a table called dim_employee_scd and fact_salary_scd. These are two slightly altered versions of the same tables in the star that we have been using (if you are just joining us, don't worry, they can stand alone too).

In this case, the dim_employee_scd acts as a slowly changing dimension. As always, don't read too much into the dates and/or salary amounts. This data is entirely fictitious. Run the script (here are instructions if needed), play with the data and see if you can answer these questions...

1.) How much did John Farmer receive in salary from Information Systems as opposed to Human Resources?
2.) As a percentage, how many more (or fewer) dollars did James Couch receive in salary when he moved from the first to the second and then from the second to the third position? Remember, even if the numbers decrease although it appears that he was promoted, that's ok. This is fictitious data and the point is to see there there was a change of some sort.
3.) Try to display salary dollars for active only positions, which may require that you make an assumption about the business rule.

We'll look at some solutions in the future.

Sunday, April 8, 2012

Analysis Using Conformed Dimensions

In the last post we practiced examining data from different fact tables via conformed dimensions. While the syntax may be different from solution to solution, something like the select statement below will allow you to compare salary dollars from the fact_salary table to sales dollars from the fact_transaction table per month for each person.

SELECT decode(paid.calendar_year, null, sales.calendar_year, paid.calendar_year) Year,

decode (paid.calendar_month_number, null, sales.calendar_month_number, paid.calendar_month_number) Month,

decode(paid.last_name, null, sales.last_name, paid.last_name) LastName,

decode(paid.first_name, null, sales.first_name, paid.first_name) FirstName,

paid.amtpaid,

sales.salesamt

FROM

(SELECT d.calendar_year,

d.calendar_month_number,

c.last_name,

c.first_name,

sum(b.actual_wages_paid) amtpaid

FROM fact_salary b,

dim_employee c,

dim_day d

WHERE b.key_date_paid = d.key_day

AND b.key_employee = c.key_employee

GROUP BY d.calendar_year, d.calendar_month_number, c.last_name, c.first_name) paid

FULL OUTER JOIN

(SELECT d.calendar_year,

d.calendar_month_number,

c.last_name,

c.first_name,

sum(a.amount) salesamt

FROM fact_transaction a,

dim_employee c,

dim_day d

WHERE a.key_day = d.key_day

AND a.key_sales_person = c.key_employee

GROUP BY d.calendar_year, d.calendar_month_number, c.last_name, c.first_name) sales

ON paid.calendar_year = sales.calendar_year

AND paid.calendar_month_number = sales.calendar_month_number

AND paid.last_name = sales.last_name

AND paid.first_name = sales.first_name

In order to take advantage of this approach, an analyst must view the data from each fact table at the same grain. In the select statement above, notice that each of the inner selects group by the same dimension attributes (calendar_year, calendar_month_number, last_name, and first_name) and then the outer select statement joins on those attributes. Since there may be some instances in which an individual has sales dollars but no salary dollars and vice versa, an Oracle full outer join is used. If you are confident that one of the select statements will return all of the possible individuals for all possible time periods for that person, then a left or right outer join may be used. In reality a reporting tool will probably take care of this on the back-end, so that a data analyst will not need to be familiar with this exact syntax (although it may be beneficial to conceptually understand how the data is being pulled).

From a business perspective, this kind of analysis will allow a decision maker to view productivity vs. cost for each instance of a business entity (i.e., individual, department, region, line of business, etc.). This is yet another way in which data warehousing can help data become truly valuable.

Image: vichie81 / FreeDigitalPhotos.net