Friday, April 25, 2014

Data Helping To Fight Crime

In her role as the District Attorney of New Jersey, Anne Milgram found the crime fighting arena to be horribly inefficient.  For example, she recalls a time when an individual was arrested and held on a bail amount of $3,500 dollars and was unable to pay that amount.  As a result, he stayed in jail until his case was heard eight months later, which cost the public over $9,000 dollars.  Believing that data could help, Ms. Milgram developed a tool to help correct this, making crime fighting a data driven activity.  This is a great example of somebody who is taking data and turning it into something that is truly valuable.  She describes this experience in this TED talk.

Friday, April 18, 2014

Creating Date Dimensions

One of the tasks of an architect who is implementing a data warehouse is to create the date dimensions.  Because these dimensions have no source system, they are often created in something like Excel and then inserted directly into the dimension tables themselves.  Searching the internet will probably reveal some code that will automatically create these dimensions.  I've created one such PL/SQL block here that will create three date dimensions, a daily, monthly, and yearly grain.  Because this will probably only be run once, there is no error handling nor is there a way to add to the date dimension (it will only create from scratch).  Feel free and add those features if you so desire.  Here is an explanation of the parameters:

1.) p_begin_year - This is the first year of the range of dates that you wish to create
2.) p_end_year - This is the last year of the range of dates that you wish to create
3.) p_create_tables - This is intended to have a value of YES or NO.  Set it to YES if you want the tables to be created and populated.  Set it to no if you only want the tables to be populated (e.g., they have already been created).
4.) p_include_fiscal_year - If you would like some data related to your fiscal year in the dim_day and dim_month tables, then set this value to YES.  Otherwise, set it to NO.
5.) p_fiscal_year_begin_month_no - If the answer to #4 is YES, then set this to the calendar month number that is the first month in your fiscal year.  In other words, if your fiscal year begins in July, set this to 7.  If it begins in October, set it to 10, etc.

If nothing else, this will be a starting point for you with regards to creating a date dimension.  Examine the code, tweak it, etc.

Note: This has been written in PL/SQL, meaning that it will only run in an Oracle database.  If you would like to do this with a system from another database vendor, similar scripts can be written in that vendor's native language.

declare
p_begin_year int := 2001;
p_end_year int := 2014;
p_create_tables varchar2(3) := 'YES';
p_include_fiscal_year varchar2(3) := 'YES';
p_fiscal_year_begin_month_no int := 7;

d_first_day date;
d_last_day date;
d_loop_day date;

i_counter int;

begin
if upper(p_create_tables) = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_DAY '
|| '( '
|| 'KEY_DAY INT, '
|| 'CALENDAR_DAY DATE, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'MONTH_NAME VARCHAR2(50), '
|| 'CALENDAR_QUARTER INT, '
|| 'CALENDAR_YEAR INT '
|| ')';
end if;

select to_date('01-JAN-' || p_begin_year, 'DD-MON-YYYY') into d_first_day from dual;
select to_date('31-DEC-' || p_end_year, 'DD-MON-YYYY') into d_last_day from dual;
d_loop_day := d_first_day;
i_counter := 1;

while d_loop_day <= d_last_day
loop  
EXECUTE IMMEDIATE 'insert into dim_day (key_day, calendar_day, calendar_month_number, month_name, calendar_quarter, calendar_year) '
|| 'select ' || i_counter || ', '
|| '''' || to_char(d_loop_day, 'DD-MON-YYYY') || ''', '
|| 'extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YYYY'')),'
|| 'to_char(to_date(''' || d_loop_day || ''', ''DD-MON-YY''), ''MONTH''), '
|| 'case when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 1 and 3 then 1  '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 4 and 6 then 2 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 7 and 9 then 3 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 10 and 12 then 4 '
|| 'else -1 '
|| 'end, '
|| 'extract(year from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) '
|| 'from dual ';

    i_counter := i_counter + 1;
    d_loop_day := trunc(d_loop_day) + 1;
end loop;

commit;

if p_include_fiscal_year = 'YES' then
    if p_create_tables = 'YES' then
      EXECUTE IMMEDIATE 'alter table dim_day '
      || 'add '
      || '( '
        || 'FISCAL_MONTH_NUMBER INT, '
        || 'FISCAL_QUARTER INT, '
        || 'FISCAL_YEAR INT '
      || ') ';
    end if;

EXECUTE IMMEDIATE 'update dim_day o '
|| 'set (o.fiscal_month_number, o.fiscal_year) = '
|| '( '
|| 'select case when i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ' >= 0 then '
|| '(i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ') + 1 '
|| 'else '
|| '12 + (i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ') + 1 '
|| 'end, '
|| 'case when i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ' >= 0 then '
|| 'i.calendar_year '
|| 'else '
|| 'i.calendar_year + 1 '
|| 'end '
|| 'from dim_day i '
|| 'where i.calendar_day = o.calendar_day '
|| ') ';

EXECUTE IMMEDIATE 'update dim_day o '
|| 'set o.fiscal_quarter = '
|| '( '
|| 'select case when i.fiscal_month_number in (1,2,3) then 1 '
|| 'when i.fiscal_month_number in (4,5,6) then 2 '
|| 'when i.fiscal_month_number in (7,8,9) then 3 '
|| 'when i.fiscal_month_number in (10,11,12) then 4 '
|| 'else -1 '
|| 'end '
|| 'from dim_day i '
|| 'where i.calendar_day = o.calendar_day '
|| ') ';

commit;
end if;

if upper(p_create_tables) = 'YES' then
if p_include_fiscal_year = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_MONTH '
|| '( '
|| 'KEY_MONTH INT, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'CALENDAR_MONTH_NAME VARCHAR2(20), '
|| 'CALENDAR_QUARTER_NUMBER INT, '
|| 'CALENDAR_YEAR INT, '
|| 'FISCAL_MONTH_NUMBER INT, '
|| 'FISCAL_QUARTER_NUMBER INT, '
|| 'FISCAL_YEAR INT, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
else
EXECUTE IMMEDIATE 'CREATE TABLE DIM_MONTH '
|| '( '
|| 'KEY_MONTH INT, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'CALENDAR_MONTH_NAME VARCHAR2(20), '
|| 'CALENDAR_QUARTER_NUMBER INT, '
|| 'CALENDAR_YEAR INT, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
end if;
end if;

if p_include_fiscal_year = 'YES' then
EXECUTE IMMEDIATE 'insert into dim_month '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_month_number,  '
|| 'month_name,  '
|| 'calendar_quarter,  '
|| 'calendar_year,  '
|| 'fiscal_month_number,  '
|| 'fiscal_quarter,  '
|| 'fiscal_year,  '
|| 'trunc(calendar_day,''month''), '
|| 'last_day(calendar_day) '
|| 'from dim_day '
      || 'order by calendar_year,  '
 || 'calendar_month_number '
|| ') a ';

commit;
else
EXECUTE IMMEDIATE 'insert into dim_month '
|| 'select rownum, a.* '
|| 'from '
|| '( '
 || 'select distinct calendar_month_number, '
|| 'month_name, '
|| 'calendar_quarter,  '
|| 'calendar_year,  '
|| 'first_day(calendar_day), '
|| 'last_day(calendar_day) '
|| 'from dim_day '
|| ') '
|| 'order by a.calendar_year, '
|| 'a.calendar_month_number ';

commit;
end if;

if upper(p_create_tables) = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_CALENDAR_YEAR '
|| '( '
|| 'KEY_CALENDAR_YEAR NUMBER, '
|| 'CALENDAR_YEAR NUMBER, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
end if;

  EXECUTE IMMEDIATE 'insert into dim_calendar_year '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_year, '
|| '''01-JAN-'' || calendar_year || '''', '
|| '''31-DEC-'' || calendar_year || '''' '
|| 'from dim_month '
    || 'order by calendar_year '
|| ') a ';

 commit;
end;

Image courtesy of Anusorn P nachol / FreeDigitalPhotos.net

Friday, April 11, 2014

Slowly Changing Dimensions - Correction

About two years ago, I wrote this post on slowly changing dimensions.  Several months later I learned in a Kimball class that my understanding of a Type 3 Slowly Changing Dimension (SCD) was incorrect.  Allow me to use this post to right the ship.

My original explanation (incorrect):

A type 3 SCD is essentially a hybrid of a type 1 and type 2 SCD.  In this scenario as history changes, rows are added into the dimension table, consistent with a type 2 SCD.  Each row contains a column with the value that pertains to that moment in time, as does a type 2 SCD.  Each row also contains a column with the current value.  So, continuing with the example above (from the other post), while you will be able to see that the older fact rows pertain to a time in which the employee was a Junior Analyst, you will also be able to see (while looking at this same historical dataset) that the employee is now a Senior Analyst.

My updated explanation (correct):

A type 3 SCD is essentially a hybrid of a type 1 and type 2 SCD.  In this scenario as history changes, rows are added into the dimension table, consistent with a type 2 SCD.  Each row contains a column with the value that pertains to that moment in time, as does a type 2 SCD.  Each row also contains a column with the prior value.  So, continuing with the example above, while you will be able to see that the newer fact rows describe this employee as a Senior Analyst, you will be able to see that his/her prior position was Junior Analyst.

So, a type 3 SCD will look something like this:

key_employee
employee_first_name
employee_last_name
position
prior_position
1 John Smith Junior Analyst (null)
2 John Smith Senior Analyst Junior Analyst

My original explanation above describes a type 6 slowly changing dimension, which will look something like this:

key_employee
employee_first_name
employee_last_name
position
current_position
1 John Smith Junior Analyst Senior Analyst
2 John Smith Senior Analyst Senior Analyst

These links to Ralph Kimball's website provide some great information on slowly changing dimensions:

1.) Type 1
2.) Types 2 and 3
3.) Types 0, 4, 5, 6, and 7

Image courtesy of adamr / FreeDigitalPhotos.net

Friday, April 4, 2014

Factless Fact Tables

Over the past three weeks, we've looked at three ways to design fact tables.

1.) Transaction Fact Table
2.) Period Snapshot Fact Table
3.) Accumulating Snapshot Fact Table

As the term suggests, all of these tables contain facts (measures) that pertain to the business process.  There are instances in which a fact table can be designed to contain no measures but only events.  This is referred to as a factless fact table.  Consider a user who wishes to see the org chart on a particular day.  In this case, a factless fact table may be designed to contain one row per day (if that is the grain) and the keys to the Human Resources descriptors that existed on that day.  Every column in this table will contain a foreign key to a dimension table or a degenerate dimension value.  Consider this plain-english version of this table:

Date
Name
Department
Supervisor
Job Title
1/15/2013 Jason Smith Human Resources Allen Wells HR Associate I
5/31/2013 Jason Smith Human Resources Allen Wells HR Benefits Coordinator
3/15/2014 Jason Smith Information Technology Mike Williams HR Systems Analyst

This tells us that Jason Smith was hired as an HR Associate I on 1/15/2013.  He was promoted to a Benefits Coordinator on 5/31/2013 and then moved to Information Technology on 3/15/2014.  FYI, although the table above only shows three rows there will probably be several more.  This table will probably contain one row per day with the same data until it changes, indicated by the ellipses in the table above.

In the future we will look at writing some ETL to populate a factless fact table.

Image courtesy of watcharakun / FreeDigitalPhotos.net