Friday, April 25, 2014

Data Helping To Fight Crime

In her role as the District Attorney of New Jersey, Anne Milgram found the crime fighting arena to be horribly inefficient.  For example, she recalls a time when an individual was arrested and held on a bail amount of $3,500 dollars and was unable to pay that amount.  As a result, he stayed in jail until his case was heard eight months later, which cost the public over $9,000 dollars.  Believing that data could help, Ms. Milgram developed a tool to help correct this, making crime fighting a data driven activity.  This is a great example of somebody who is taking data and turning it into something that is truly valuable.  She describes this experience in this TED talk.

Friday, April 18, 2014

Creating Date Dimensions

One of the tasks of an architect who is implementing a data warehouse is to create the date dimensions.  Because these dimensions have no source system, they are often created in something like Excel and then inserted directly into the dimension tables themselves.  Searching the internet will probably reveal some code that will automatically create these dimensions.  I've created one such PL/SQL block here that will create three date dimensions, a daily, monthly, and yearly grain.  Because this will probably only be run once, there is no error handling nor is there a way to add to the date dimension (it will only create from scratch).  Feel free and add those features if you so desire.  Here is an explanation of the parameters:

1.) p_begin_year - This is the first year of the range of dates that you wish to create
2.) p_end_year - This is the last year of the range of dates that you wish to create
3.) p_create_tables - This is intended to have a value of YES or NO.  Set it to YES if you want the tables to be created and populated.  Set it to no if you only want the tables to be populated (e.g., they have already been created).
4.) p_include_fiscal_year - If you would like some data related to your fiscal year in the dim_day and dim_month tables, then set this value to YES.  Otherwise, set it to NO.
5.) p_fiscal_year_begin_month_no - If the answer to #4 is YES, then set this to the calendar month number that is the first month in your fiscal year.  In other words, if your fiscal year begins in July, set this to 7.  If it begins in October, set it to 10, etc.

If nothing else, this will be a starting point for you with regards to creating a date dimension.  Examine the code, tweak it, etc.

Note: This has been written in PL/SQL, meaning that it will only run in an Oracle database.  If you would like to do this with a system from another database vendor, similar scripts can be written in that vendor's native language.

declare
p_begin_year int := 2001;
p_end_year int := 2014;
p_create_tables varchar2(3) := 'YES';
p_include_fiscal_year varchar2(3) := 'YES';
p_fiscal_year_begin_month_no int := 7;

d_first_day date;
d_last_day date;
d_loop_day date;

i_counter int;

begin
if upper(p_create_tables) = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_DAY '
|| '( '
|| 'KEY_DAY INT, '
|| 'CALENDAR_DAY DATE, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'MONTH_NAME VARCHAR2(50), '
|| 'CALENDAR_QUARTER INT, '
|| 'CALENDAR_YEAR INT '
|| ')';
end if;

select to_date('01-JAN-' || p_begin_year, 'DD-MON-YYYY') into d_first_day from dual;
select to_date('31-DEC-' || p_end_year, 'DD-MON-YYYY') into d_last_day from dual;
d_loop_day := d_first_day;
i_counter := 1;

while d_loop_day <= d_last_day
loop  
EXECUTE IMMEDIATE 'insert into dim_day (key_day, calendar_day, calendar_month_number, month_name, calendar_quarter, calendar_year) '
|| 'select ' || i_counter || ', '
|| '''' || to_char(d_loop_day, 'DD-MON-YYYY') || ''', '
|| 'extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YYYY'')),'
|| 'to_char(to_date(''' || d_loop_day || ''', ''DD-MON-YY''), ''MONTH''), '
|| 'case when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 1 and 3 then 1  '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 4 and 6 then 2 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 7 and 9 then 3 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 10 and 12 then 4 '
|| 'else -1 '
|| 'end, '
|| 'extract(year from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) '
|| 'from dual ';

    i_counter := i_counter + 1;
    d_loop_day := trunc(d_loop_day) + 1;
end loop;

commit;

if p_include_fiscal_year = 'YES' then
    if p_create_tables = 'YES' then
      EXECUTE IMMEDIATE 'alter table dim_day '
      || 'add '
      || '( '
        || 'FISCAL_MONTH_NUMBER INT, '
        || 'FISCAL_QUARTER INT, '
        || 'FISCAL_YEAR INT '
      || ') ';
    end if;

EXECUTE IMMEDIATE 'update dim_day o '
|| 'set (o.fiscal_month_number, o.fiscal_year) = '
|| '( '
|| 'select case when i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ' >= 0 then '
|| '(i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ') + 1 '
|| 'else '
|| '12 + (i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ') + 1 '
|| 'end, '
|| 'case when i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ' >= 0 then '
|| 'i.calendar_year '
|| 'else '
|| 'i.calendar_year + 1 '
|| 'end '
|| 'from dim_day i '
|| 'where i.calendar_day = o.calendar_day '
|| ') ';

EXECUTE IMMEDIATE 'update dim_day o '
|| 'set o.fiscal_quarter = '
|| '( '
|| 'select case when i.fiscal_month_number in (1,2,3) then 1 '
|| 'when i.fiscal_month_number in (4,5,6) then 2 '
|| 'when i.fiscal_month_number in (7,8,9) then 3 '
|| 'when i.fiscal_month_number in (10,11,12) then 4 '
|| 'else -1 '
|| 'end '
|| 'from dim_day i '
|| 'where i.calendar_day = o.calendar_day '
|| ') ';

commit;
end if;

if upper(p_create_tables) = 'YES' then
if p_include_fiscal_year = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_MONTH '
|| '( '
|| 'KEY_MONTH INT, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'CALENDAR_MONTH_NAME VARCHAR2(20), '
|| 'CALENDAR_QUARTER_NUMBER INT, '
|| 'CALENDAR_YEAR INT, '
|| 'FISCAL_MONTH_NUMBER INT, '
|| 'FISCAL_QUARTER_NUMBER INT, '
|| 'FISCAL_YEAR INT, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
else
EXECUTE IMMEDIATE 'CREATE TABLE DIM_MONTH '
|| '( '
|| 'KEY_MONTH INT, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'CALENDAR_MONTH_NAME VARCHAR2(20), '
|| 'CALENDAR_QUARTER_NUMBER INT, '
|| 'CALENDAR_YEAR INT, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
end if;
end if;

if p_include_fiscal_year = 'YES' then
EXECUTE IMMEDIATE 'insert into dim_month '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_month_number,  '
|| 'month_name,  '
|| 'calendar_quarter,  '
|| 'calendar_year,  '
|| 'fiscal_month_number,  '
|| 'fiscal_quarter,  '
|| 'fiscal_year,  '
|| 'trunc(calendar_day,''month''), '
|| 'last_day(calendar_day) '
|| 'from dim_day '
      || 'order by calendar_year,  '
 || 'calendar_month_number '
|| ') a ';

commit;
else
EXECUTE IMMEDIATE 'insert into dim_month '
|| 'select rownum, a.* '
|| 'from '
|| '( '
 || 'select distinct calendar_month_number, '
|| 'month_name, '
|| 'calendar_quarter,  '
|| 'calendar_year,  '
|| 'first_day(calendar_day), '
|| 'last_day(calendar_day) '
|| 'from dim_day '
|| ') '
|| 'order by a.calendar_year, '
|| 'a.calendar_month_number ';

commit;
end if;

if upper(p_create_tables) = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_CALENDAR_YEAR '
|| '( '
|| 'KEY_CALENDAR_YEAR NUMBER, '
|| 'CALENDAR_YEAR NUMBER, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
end if;

  EXECUTE IMMEDIATE 'insert into dim_calendar_year '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_year, '
|| '''01-JAN-'' || calendar_year || '''', '
|| '''31-DEC-'' || calendar_year || '''' '
|| 'from dim_month '
    || 'order by calendar_year '
|| ') a ';

 commit;
end;

Image courtesy of Anusorn P nachol / FreeDigitalPhotos.net

Friday, April 11, 2014

Slowly Changing Dimensions - Correction

About two years ago, I wrote this post on slowly changing dimensions.  Several months later I learned in a Kimball class that my understanding of a Type 3 Slowly Changing Dimension (SCD) was incorrect.  Allow me to use this post to right the ship.

My original explanation (incorrect):

A type 3 SCD is essentially a hybrid of a type 1 and type 2 SCD.  In this scenario as history changes, rows are added into the dimension table, consistent with a type 2 SCD.  Each row contains a column with the value that pertains to that moment in time, as does a type 2 SCD.  Each row also contains a column with the current value.  So, continuing with the example above (from the other post), while you will be able to see that the older fact rows pertain to a time in which the employee was a Junior Analyst, you will also be able to see (while looking at this same historical dataset) that the employee is now a Senior Analyst.

My updated explanation (correct):

A type 3 SCD is essentially a hybrid of a type 1 and type 2 SCD.  In this scenario as history changes, rows are added into the dimension table, consistent with a type 2 SCD.  Each row contains a column with the value that pertains to that moment in time, as does a type 2 SCD.  Each row also contains a column with the prior value.  So, continuing with the example above, while you will be able to see that the newer fact rows describe this employee as a Senior Analyst, you will be able to see that his/her prior position was Junior Analyst.

So, a type 3 SCD will look something like this:

key_employee
employee_first_name
employee_last_name
position
prior_position
1 John Smith Junior Analyst (null)
2 John Smith Senior Analyst Junior Analyst

My original explanation above describes a type 6 slowly changing dimension, which will look something like this:

key_employee
employee_first_name
employee_last_name
position
current_position
1 John Smith Junior Analyst Senior Analyst
2 John Smith Senior Analyst Senior Analyst

These links to Ralph Kimball's website provide some great information on slowly changing dimensions:

1.) Type 1
2.) Types 2 and 3
3.) Types 0, 4, 5, 6, and 7

Image courtesy of adamr / FreeDigitalPhotos.net

Friday, April 4, 2014

Factless Fact Tables

Over the past three weeks, we've looked at three ways to design fact tables.

1.) Transaction Fact Table
2.) Period Snapshot Fact Table
3.) Accumulating Snapshot Fact Table

As the term suggests, all of these tables contain facts (measures) that pertain to the business process.  There are instances in which a fact table can be designed to contain no measures but only events.  This is referred to as a factless fact table.  Consider a user who wishes to see the org chart on a particular day.  In this case, a factless fact table may be designed to contain one row per day (if that is the grain) and the keys to the Human Resources descriptors that existed on that day.  Every column in this table will contain a foreign key to a dimension table or a degenerate dimension value.  Consider this plain-english version of this table:

Date
Name
Department
Supervisor
Job Title
1/15/2013 Jason Smith Human Resources Allen Wells HR Associate I
5/31/2013 Jason Smith Human Resources Allen Wells HR Benefits Coordinator
3/15/2014 Jason Smith Information Technology Mike Williams HR Systems Analyst

This tells us that Jason Smith was hired as an HR Associate I on 1/15/2013.  He was promoted to a Benefits Coordinator on 5/31/2013 and then moved to Information Technology on 3/15/2014.  FYI, although the table above only shows three rows there will probably be several more.  This table will probably contain one row per day with the same data until it changes, indicated by the ellipses in the table above.

In the future we will look at writing some ETL to populate a factless fact table.

Image courtesy of watcharakun / FreeDigitalPhotos.net

Friday, March 28, 2014

Accumulating Snapshot Fact Tables

We've been looking at the various fact table design options in data warehousing.  Two weeks ago we examined the transaction fact table and last week we examined the periodic snapshot fact table.  The third and final option is called an accumulating snapshot.  This type of fact table is different from the other two in one big way.  Each row is often revisited.  Consider our banking example from the other posts.  When a deposit is added to a transaction fact table, that row is added and then left alone.  All of the data needed to add and complete that row is known.  The same is true of periodic snapshots.  An accumulating snapshot fact table begins each row and then accumulates data until that row is complete.  Let's consider an example in a different context.  Suppose that a star is built for the purposes of analyzing a help desk ticketing business process.  The following descriptors (in addition to any measures) are a part of the business process:

Date Ticket Opened
Date Ticket Assigned
Date Solution Provided To Customer
Date Customer Accepted Solution
Date Ticket Closed
Ticket Number

Suppose that ticket number 10012 is opened on 3/1/14.  At that point, this row will exist in the fact table:


Date Ticket Opened

Date Ticket Assigned

Date Solution Provided To Customer

Date Customer Accepted Solution

Date Ticket Closed

Ticket Number
3/1/2014 10012

Now, suppose that the ticket is assigned to a technician on 3/3/14.  That same row will be updated to look like this:


Date Ticket Opened

Date Ticket Assigned

Date Solution Provided To Customer

Date Customer Accepted Solution

Date Ticket Closed

Ticket Number
3/1/2014 3/3/2014 10012

If the solution is provided to the customer one day later, the row will be updated to look like this:


Date Ticket Opened

Date Ticket Assigned

Date Solution Provided To Customer

Date Customer Accepted Solution

Date Ticket Closed

Ticket Number
3/1/2014 3/3/2014 3/4/2014 10012

Assuming that the customer accepts the solution on 3/5 and the ticket is closed 3/6, the row will be updated to look like this:


Date Ticket Opened

Date Ticket Assigned

Date Solution Provided To Customer

Date Customer Accepted Solution

Date Ticket Closed

Ticket Number
3/1/2014 3/3/2014 3/4/2014 3/5/2014 10012

and then this:


Date Ticket Opened

Date Ticket Assigned

Date Solution Provided To Customer

Date Customer Accepted Solution

Date Ticket Closed

Ticket Number
3/1/2014 3/3/2014 3/4/2014 3/5/2014 3/6/2014 10012

At this point, the row is left alone.  If you so desire, an accumulating snapshot allows you to store some lags between dates to help with analysis.  This can help ease the burden of using the database to calculate the various lags.

In the future we will look at writing some ETL to populate an accumulating snapshot fact table.

Image courtesy of Vichaya Kiatying-Angsulee / FreeDigitalPhotos.net

Friday, March 21, 2014

Periodic Snapshot Fact Tables

In our last post we looked at one of three ways to design a fact table, called a transaction fact table.  Today, let's look at a second design, called the periodic snapshot.  Remember from Ralph Kimball's teaching (and last week's post) that a transaction fact table gains a row each time that something happens.  Using our banking example, from last week, each deposit or withdrawal will result in a record being inserted.  Looking at only one record will allow us to see that one event.  Adding these records will allow us to see the balance.

A periodic snapshot contains snapshots of the data as it existed at various points in time.  Unlike a transaction fact table, selecting one row (or perhaps a subset of rows if it is semi-additive) will display the current value at that point.  Our example from last week involved opening a checking account at Acme Bank on 2/1/14 and making an initial deposit of $3,000.  Three days later you withdrew $200.  Five days after that, you deposited $1,000.  If a periodic snapshot were written to show the balance at a daily level, a plain-english version may look something like this:

Date
Amount
2/1/2014 $3,000
2/4/2014 $2,800
2/9/2014 $3,800

Now, let's use this table to answer the same questions that we examined last week:

1.) What was the account balance on 2/4?
Unlike the transaction fact table, in order to find the balance on 2/4, we only need to look at the 2/4 row.  The 2/4 row contains a snapshot of the current balance on 2/4, as opposed to only the event that occured on 2/4.  By looking at the 2/4 row, we will see that the balance was $2,800.

2.) What was the account balance on 2/9?
Using the same logic that was explained in #1 above, look only at the 2/9 rows.  This will give you a value of $3,800.

3.) How much was deposited on 2/9?
This type of question cannot be answered using a periodic snapshot.  A periodic snapshot will store the current state of the business process as of the applicable period, but will not store the events leading to the current state.

In the future we will look at writing some ETL to populate a periodic snapshot fact table.  We will also take a look at an additional fact table design.

Image courtesy of cooldesign / FreeDigitalPhotos.net

Friday, March 14, 2014

Transaction Fact Tables

In his book The Data Warehouse Toolkit, Ralph Kimball explains that there are three ways to design a fact table.  The first and probably most typical (my opinion) is called a transaction fact table.  A transaction fact table is a fact table that contains measures, keys to dimension tables, and degenerate dimensions, if applicable.  When using this fact table to examine the current state of something going through the business process, all rows should be summed through the latest time period that is being examined.  This is due to the fact that a row is added to this fact table as an event in its respective business process occurs.  The most common example involves the banking industry.  Suppose you open a checking account at Acme Bank on 2/1/14 and make an initial deposit of $3,000.  Three days later you withdraw $200.  Five days after that, you deposit $1,000.  A "plain-english version" of this fact table (without the descriptors) will look something like this:

Date
Amount
2/1/2014  $3,000
2/4/2014  ($200)
2/9/2014  $1,000

Now, let's use this table to answer these very simple questions:

1.) What was the account balance on 2/4?
Notice how a row was added each time that a deposit or withdrawal was made.  In order to find the balance on 2/4, we must look at everything that happened through 2/4.  If we only look at the 2/4 row, we will only see the $200 withdrawal.  However, it is important to know that $3,000 existed in the account before that withdrawal.  So, if we sum the 2/1 row and the 2/4 row, we will see that the balance on 2/4 (after that transaction posted) was $2,800.

2.) What was the account balance on 2/9?
Using the same logic that was explained in #1 above, sum the 2/1, the 2/4, and the 2/9 rows.  This will give you a value of $3,800.

3.) How much was deposited on 2/9?
When looking at only one of the events that occured as opposed to the sum of everything that occurred, only that one row should be taken into consideration.  The 2/9 row by itself will tell us that $1,000 was deposited into the account.  Unlike #2 above, we do not need to consider the $2,800 that was in the account prior to 2/9.

In the future we will look at writing some ETL to populate a transaction fact table.  We will also take a look at some additional fact table designs.

Image courtesy of twobee / FreeDigitalPhotos.net