Friday, April 25, 2014
Data Helping To Fight Crime
In her role as the District Attorney of New Jersey, Anne Milgram found the crime fighting arena to be horribly inefficient. For example, she recalls a time when an individual was arrested and held on a bail amount of $3,500 dollars and was unable to pay that amount. As a result, he stayed in jail until his case was heard eight months later, which cost the public over $9,000 dollars. Believing that data could help, Ms. Milgram developed a tool to help correct this, making crime fighting a data driven activity. This is a great example of somebody who is taking data and turning it into something that is truly valuable. She describes this experience in this TED talk.
Friday, April 18, 2014
Creating Date Dimensions
One of the tasks of an architect who is implementing a data warehouse is to create the date dimensions. Because these dimensions have no source system, they are often created in something like Excel and then inserted directly into the dimension tables themselves. Searching the internet will probably reveal some code that will automatically create these dimensions. I've created one such PL/SQL block here that will create three date dimensions, a daily, monthly, and yearly grain. Because this will probably only be run once, there is no error handling nor is there a way to add to the date dimension (it will only create from scratch). Feel free and add those features if you so desire. Here is an explanation of the parameters:
1.) p_begin_year - This is the first year of the range of dates that you wish to create
2.) p_end_year - This is the last year of the range of dates that you wish to create
3.) p_create_tables - This is intended to have a value of YES or NO. Set it to YES if you want the tables to be created and populated. Set it to no if you only want the tables to be populated (e.g., they have already been created).
4.) p_include_fiscal_year - If you would like some data related to your fiscal year in the dim_day and dim_month tables, then set this value to YES. Otherwise, set it to NO.
5.) p_fiscal_year_begin_month_no - If the answer to #4 is YES, then set this to the calendar month number that is the first month in your fiscal year. In other words, if your fiscal year begins in July, set this to 7. If it begins in October, set it to 10, etc.
If nothing else, this will be a starting point for you with regards to creating a date dimension. Examine the code, tweak it, etc.
Note: This has been written in PL/SQL, meaning that it will only run in an Oracle database. If you would like to do this with a system from another database vendor, similar scripts can be written in that vendor's native language.
declare
p_begin_year int := 2001;
p_end_year int := 2014;
p_create_tables varchar2(3) := 'YES';
p_include_fiscal_year varchar2(3) := 'YES';
p_fiscal_year_begin_month_no int := 7;
d_first_day date;
d_last_day date;
d_loop_day date;
i_counter int;
begin
if upper(p_create_tables) = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_DAY '
|| '( '
|| 'KEY_DAY INT, '
|| 'CALENDAR_DAY DATE, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'MONTH_NAME VARCHAR2(50), '
|| 'CALENDAR_QUARTER INT, '
|| 'CALENDAR_YEAR INT '
|| ')';
end if;
select to_date('01-JAN-' || p_begin_year, 'DD-MON-YYYY') into d_first_day from dual;
select to_date('31-DEC-' || p_end_year, 'DD-MON-YYYY') into d_last_day from dual;
d_loop_day := d_first_day;
i_counter := 1;
while d_loop_day <= d_last_day
loop
EXECUTE IMMEDIATE 'insert into dim_day (key_day, calendar_day, calendar_month_number, month_name, calendar_quarter, calendar_year) '
|| 'select ' || i_counter || ', '
|| '''' || to_char(d_loop_day, 'DD-MON-YYYY') || ''', '
|| 'extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YYYY'')),'
|| 'to_char(to_date(''' || d_loop_day || ''', ''DD-MON-YY''), ''MONTH''), '
|| 'case when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 1 and 3 then 1 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 4 and 6 then 2 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 7 and 9 then 3 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 10 and 12 then 4 '
|| 'else -1 '
|| 'end, '
|| 'extract(year from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) '
|| 'from dual ';
i_counter := i_counter + 1;
d_loop_day := trunc(d_loop_day) + 1;
end loop;
commit;
if p_include_fiscal_year = 'YES' then
if p_create_tables = 'YES' then
EXECUTE IMMEDIATE 'alter table dim_day '
|| 'add '
|| '( '
|| 'FISCAL_MONTH_NUMBER INT, '
|| 'FISCAL_QUARTER INT, '
|| 'FISCAL_YEAR INT '
|| ') ';
end if;
EXECUTE IMMEDIATE 'update dim_day o '
|| 'set (o.fiscal_month_number, o.fiscal_year) = '
|| '( '
|| 'select case when i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ' >= 0 then '
|| '(i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ') + 1 '
|| 'else '
|| '12 + (i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ') + 1 '
|| 'end, '
|| 'case when i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ' >= 0 then '
|| 'i.calendar_year '
|| 'else '
|| 'i.calendar_year + 1 '
|| 'end '
|| 'from dim_day i '
|| 'where i.calendar_day = o.calendar_day '
|| ') ';
EXECUTE IMMEDIATE 'update dim_day o '
|| 'set o.fiscal_quarter = '
|| '( '
|| 'select case when i.fiscal_month_number in (1,2,3) then 1 '
|| 'when i.fiscal_month_number in (4,5,6) then 2 '
|| 'when i.fiscal_month_number in (7,8,9) then 3 '
|| 'when i.fiscal_month_number in (10,11,12) then 4 '
|| 'else -1 '
|| 'end '
|| 'from dim_day i '
|| 'where i.calendar_day = o.calendar_day '
|| ') ';
commit;
end if;
if upper(p_create_tables) = 'YES' then
if p_include_fiscal_year = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_MONTH '
|| '( '
|| 'KEY_MONTH INT, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'CALENDAR_MONTH_NAME VARCHAR2(20), '
|| 'CALENDAR_QUARTER_NUMBER INT, '
|| 'CALENDAR_YEAR INT, '
|| 'FISCAL_MONTH_NUMBER INT, '
|| 'FISCAL_QUARTER_NUMBER INT, '
|| 'FISCAL_YEAR INT, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
else
EXECUTE IMMEDIATE 'CREATE TABLE DIM_MONTH '
|| '( '
|| 'KEY_MONTH INT, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'CALENDAR_MONTH_NAME VARCHAR2(20), '
|| 'CALENDAR_QUARTER_NUMBER INT, '
|| 'CALENDAR_YEAR INT, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
end if;
end if;
if p_include_fiscal_year = 'YES' then
EXECUTE IMMEDIATE 'insert into dim_month '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_month_number, '
|| 'month_name, '
|| 'calendar_quarter, '
|| 'calendar_year, '
|| 'fiscal_month_number, '
|| 'fiscal_quarter, '
|| 'fiscal_year, '
|| 'trunc(calendar_day,''month''), '
|| 'last_day(calendar_day) '
|| 'from dim_day '
|| 'order by calendar_year, '
|| 'calendar_month_number '
|| ') a ';
commit;
else
EXECUTE IMMEDIATE 'insert into dim_month '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_month_number, '
|| 'month_name, '
|| 'calendar_quarter, '
|| 'calendar_year, '
|| 'first_day(calendar_day), '
|| 'last_day(calendar_day) '
|| 'from dim_day '
|| ') '
|| 'order by a.calendar_year, '
|| 'a.calendar_month_number ';
commit;
end if;
if upper(p_create_tables) = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_CALENDAR_YEAR '
|| '( '
|| 'KEY_CALENDAR_YEAR NUMBER, '
|| 'CALENDAR_YEAR NUMBER, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
end if;
EXECUTE IMMEDIATE 'insert into dim_calendar_year '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_year, '
|| '''01-JAN-'' || calendar_year || '''', '
|| '''31-DEC-'' || calendar_year || '''' '
|| 'from dim_month '
|| 'order by calendar_year '
|| ') a ';
commit;
end;
Image courtesy of Anusorn P nachol / FreeDigitalPhotos.net
1.) p_begin_year - This is the first year of the range of dates that you wish to create
2.) p_end_year - This is the last year of the range of dates that you wish to create
3.) p_create_tables - This is intended to have a value of YES or NO. Set it to YES if you want the tables to be created and populated. Set it to no if you only want the tables to be populated (e.g., they have already been created).
4.) p_include_fiscal_year - If you would like some data related to your fiscal year in the dim_day and dim_month tables, then set this value to YES. Otherwise, set it to NO.
5.) p_fiscal_year_begin_month_no - If the answer to #4 is YES, then set this to the calendar month number that is the first month in your fiscal year. In other words, if your fiscal year begins in July, set this to 7. If it begins in October, set it to 10, etc.
If nothing else, this will be a starting point for you with regards to creating a date dimension. Examine the code, tweak it, etc.
Note: This has been written in PL/SQL, meaning that it will only run in an Oracle database. If you would like to do this with a system from another database vendor, similar scripts can be written in that vendor's native language.
declare
p_begin_year int := 2001;
p_end_year int := 2014;
p_create_tables varchar2(3) := 'YES';
p_include_fiscal_year varchar2(3) := 'YES';
p_fiscal_year_begin_month_no int := 7;
d_first_day date;
d_last_day date;
d_loop_day date;
i_counter int;
begin
if upper(p_create_tables) = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_DAY '
|| '( '
|| 'KEY_DAY INT, '
|| 'CALENDAR_DAY DATE, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'MONTH_NAME VARCHAR2(50), '
|| 'CALENDAR_QUARTER INT, '
|| 'CALENDAR_YEAR INT '
|| ')';
end if;
select to_date('01-JAN-' || p_begin_year, 'DD-MON-YYYY') into d_first_day from dual;
select to_date('31-DEC-' || p_end_year, 'DD-MON-YYYY') into d_last_day from dual;
d_loop_day := d_first_day;
i_counter := 1;
while d_loop_day <= d_last_day
loop
EXECUTE IMMEDIATE 'insert into dim_day (key_day, calendar_day, calendar_month_number, month_name, calendar_quarter, calendar_year) '
|| 'select ' || i_counter || ', '
|| '''' || to_char(d_loop_day, 'DD-MON-YYYY') || ''', '
|| 'extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YYYY'')),'
|| 'to_char(to_date(''' || d_loop_day || ''', ''DD-MON-YY''), ''MONTH''), '
|| 'case when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 1 and 3 then 1 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 4 and 6 then 2 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 7 and 9 then 3 '
|| 'when extract(month from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) between 10 and 12 then 4 '
|| 'else -1 '
|| 'end, '
|| 'extract(year from to_date(''' || d_loop_day || ''', ''DD-MON-YY'')) '
|| 'from dual ';
i_counter := i_counter + 1;
d_loop_day := trunc(d_loop_day) + 1;
end loop;
commit;
if p_include_fiscal_year = 'YES' then
if p_create_tables = 'YES' then
EXECUTE IMMEDIATE 'alter table dim_day '
|| 'add '
|| '( '
|| 'FISCAL_MONTH_NUMBER INT, '
|| 'FISCAL_QUARTER INT, '
|| 'FISCAL_YEAR INT '
|| ') ';
end if;
EXECUTE IMMEDIATE 'update dim_day o '
|| 'set (o.fiscal_month_number, o.fiscal_year) = '
|| '( '
|| 'select case when i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ' >= 0 then '
|| '(i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ') + 1 '
|| 'else '
|| '12 + (i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ') + 1 '
|| 'end, '
|| 'case when i.calendar_month_number - ' || p_fiscal_year_begin_month_no || ' >= 0 then '
|| 'i.calendar_year '
|| 'else '
|| 'i.calendar_year + 1 '
|| 'end '
|| 'from dim_day i '
|| 'where i.calendar_day = o.calendar_day '
|| ') ';
EXECUTE IMMEDIATE 'update dim_day o '
|| 'set o.fiscal_quarter = '
|| '( '
|| 'select case when i.fiscal_month_number in (1,2,3) then 1 '
|| 'when i.fiscal_month_number in (4,5,6) then 2 '
|| 'when i.fiscal_month_number in (7,8,9) then 3 '
|| 'when i.fiscal_month_number in (10,11,12) then 4 '
|| 'else -1 '
|| 'end '
|| 'from dim_day i '
|| 'where i.calendar_day = o.calendar_day '
|| ') ';
commit;
end if;
if upper(p_create_tables) = 'YES' then
if p_include_fiscal_year = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_MONTH '
|| '( '
|| 'KEY_MONTH INT, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'CALENDAR_MONTH_NAME VARCHAR2(20), '
|| 'CALENDAR_QUARTER_NUMBER INT, '
|| 'CALENDAR_YEAR INT, '
|| 'FISCAL_MONTH_NUMBER INT, '
|| 'FISCAL_QUARTER_NUMBER INT, '
|| 'FISCAL_YEAR INT, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
else
EXECUTE IMMEDIATE 'CREATE TABLE DIM_MONTH '
|| '( '
|| 'KEY_MONTH INT, '
|| 'CALENDAR_MONTH_NUMBER INT, '
|| 'CALENDAR_MONTH_NAME VARCHAR2(20), '
|| 'CALENDAR_QUARTER_NUMBER INT, '
|| 'CALENDAR_YEAR INT, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
end if;
end if;
if p_include_fiscal_year = 'YES' then
EXECUTE IMMEDIATE 'insert into dim_month '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_month_number, '
|| 'month_name, '
|| 'calendar_quarter, '
|| 'calendar_year, '
|| 'fiscal_month_number, '
|| 'fiscal_quarter, '
|| 'fiscal_year, '
|| 'trunc(calendar_day,''month''), '
|| 'last_day(calendar_day) '
|| 'from dim_day '
|| 'order by calendar_year, '
|| 'calendar_month_number '
|| ') a ';
commit;
else
EXECUTE IMMEDIATE 'insert into dim_month '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_month_number, '
|| 'month_name, '
|| 'calendar_quarter, '
|| 'calendar_year, '
|| 'first_day(calendar_day), '
|| 'last_day(calendar_day) '
|| 'from dim_day '
|| ') '
|| 'order by a.calendar_year, '
|| 'a.calendar_month_number ';
commit;
end if;
if upper(p_create_tables) = 'YES' then
EXECUTE IMMEDIATE 'CREATE TABLE DIM_CALENDAR_YEAR '
|| '( '
|| 'KEY_CALENDAR_YEAR NUMBER, '
|| 'CALENDAR_YEAR NUMBER, '
|| 'FIRST_DAY DATE, '
|| 'LAST_DAY DATE '
|| ') ';
end if;
EXECUTE IMMEDIATE 'insert into dim_calendar_year '
|| 'select rownum, a.* '
|| 'from '
|| '( '
|| 'select distinct calendar_year, '
|| '''01-JAN-'' || calendar_year || '''', '
|| '''31-DEC-'' || calendar_year || '''' '
|| 'from dim_month '
|| 'order by calendar_year '
|| ') a ';
commit;
end;
Image courtesy of Anusorn P nachol / FreeDigitalPhotos.net
Friday, April 11, 2014
Slowly Changing Dimensions - Correction
About two years ago, I wrote this post on slowly changing dimensions. Several months later I learned in a Kimball class that my understanding of a Type 3 Slowly Changing Dimension (SCD) was incorrect. Allow me to use this post to right the ship.
My original explanation (incorrect):
A type 3 SCD is essentially a hybrid of a type 1 and type 2 SCD. In this scenario as history changes, rows are added into the dimension table, consistent with a type 2 SCD. Each row contains a column with the value that pertains to that moment in time, as does a type 2 SCD. Each row also contains a column with the current value. So, continuing with the example above (from the other post), while you will be able to see that the older fact rows pertain to a time in which the employee was a Junior Analyst, you will also be able to see (while looking at this same historical dataset) that the employee is now a Senior Analyst.
My updated explanation (correct):
A type 3 SCD is essentially a hybrid of a type 1 and type 2 SCD. In this scenario as history changes, rows are added into the dimension table, consistent with a type 2 SCD. Each row contains a column with the value that pertains to that moment in time, as does a type 2 SCD. Each row also contains a column with the prior value. So, continuing with the example above, while you will be able to see that the newer fact rows describe this employee as a Senior Analyst, you will be able to see that his/her prior position was Junior Analyst.
So, a type 3 SCD will look something like this:
key_employee
|
employee_first_name
|
employee_last_name
|
position
|
prior_position
|
1 | John | Smith | Junior Analyst | (null) |
2 | John | Smith | Senior Analyst | Junior Analyst |
My original explanation above describes a type 6 slowly changing dimension, which will look something like this:
key_employee
|
employee_first_name
|
employee_last_name
|
position
|
current_position
|
1 | John | Smith | Junior Analyst | Senior Analyst |
2 | John | Smith | Senior Analyst | Senior Analyst |
These links to Ralph Kimball's website provide some great information on slowly changing dimensions:
1.) Type 1
2.) Types 2 and 3
3.) Types 0, 4, 5, 6, and 7
Image courtesy of adamr / FreeDigitalPhotos.net
Friday, April 4, 2014
Factless Fact Tables
Over the past three weeks, we've looked at three ways to design fact tables.
1.) Transaction Fact Table
2.) Period Snapshot Fact Table
3.) Accumulating Snapshot Fact Table
As the term suggests, all of these tables contain facts (measures) that pertain to the business process. There are instances in which a fact table can be designed to contain no measures but only events. This is referred to as a factless fact table. Consider a user who wishes to see the org chart on a particular day. In this case, a factless fact table may be designed to contain one row per day (if that is the grain) and the keys to the Human Resources descriptors that existed on that day. Every column in this table will contain a foreign key to a dimension table or a degenerate dimension value. Consider this plain-english version of this table:
Date
|
Name
|
Department
|
Supervisor
|
Job Title
|
1/15/2013 | Jason Smith | Human Resources | Allen Wells | HR Associate I |
… | ||||
5/31/2013 | Jason Smith | Human Resources | Allen Wells | HR Benefits Coordinator |
… | ||||
3/15/2014 | Jason Smith | Information Technology | Mike Williams | HR Systems Analyst |
This tells us that Jason Smith was hired as an HR Associate I on 1/15/2013. He was promoted to a Benefits Coordinator on 5/31/2013 and then moved to Information Technology on 3/15/2014. FYI, although the table above only shows three rows there will probably be several more. This table will probably contain one row per day with the same data until it changes, indicated by the ellipses in the table above.
In the future we will look at writing some ETL to populate a factless fact table.
Friday, March 28, 2014
Accumulating Snapshot Fact Tables
We've been looking at the various fact table design options in data warehousing. Two weeks ago we examined the transaction fact table and last week we examined the periodic snapshot fact table. The third and final option is called an accumulating snapshot. This type of fact table is different from the other two in one big way. Each row is often revisited. Consider our banking example from the other posts. When a deposit is added to a transaction fact table, that row is added and then left alone. All of the data needed to add and complete that row is known. The same is true of periodic snapshots. An accumulating snapshot fact table begins each row and then accumulates data until that row is complete. Let's consider an example in a different context. Suppose that a star is built for the purposes of analyzing a help desk ticketing business process. The following descriptors (in addition to any measures) are a part of the business process:
Date Ticket Opened
Date Ticket Assigned
Date Solution Provided To Customer
Date Customer Accepted Solution
Date Ticket Closed
Ticket Number
Suppose that ticket number 10012 is opened on 3/1/14. At that point, this row will exist in the fact table:
3/1/2014 | 10012 |
Now, suppose that the ticket is assigned to a technician on 3/3/14. That same row will be updated to look like this:
3/1/2014 | 3/3/2014 | 10012 |
If the solution is provided to the customer one day later, the row will be updated to look like this:
3/1/2014 | 3/3/2014 | 3/4/2014 | 10012 |
Assuming that the customer accepts the solution on 3/5 and the ticket is closed 3/6, the row will be updated to look like this:
3/1/2014 | 3/3/2014 | 3/4/2014 | 3/5/2014 | 10012 |
and then this:
3/1/2014 | 3/3/2014 | 3/4/2014 | 3/5/2014 | 3/6/2014 | 10012 |
At this point, the row is left alone. If you so desire, an accumulating snapshot allows you to store some lags between dates to help with analysis. This can help ease the burden of using the database to calculate the various lags.
In the future we will look at writing some ETL to populate an accumulating snapshot fact table.
Image courtesy of Vichaya Kiatying-Angsulee / FreeDigitalPhotos.net
Friday, March 21, 2014
Periodic Snapshot Fact Tables
In our last post we looked at one of three ways to design a fact table, called a transaction fact table. Today, let's look at a second design, called the periodic snapshot. Remember from Ralph Kimball's teaching (and last week's post) that a transaction fact table gains a row each time that something happens. Using our banking example, from last week, each deposit or withdrawal will result in a record being inserted. Looking at only one record will allow us to see that one event. Adding these records will allow us to see the balance.
A periodic snapshot contains snapshots of the data as it existed at various points in time. Unlike a transaction fact table, selecting one row (or perhaps a subset of rows if it is semi-additive) will display the current value at that point. Our example from last week involved opening a checking account at Acme Bank on 2/1/14 and making an initial deposit of $3,000. Three days later you withdrew $200. Five days after that, you deposited $1,000. If a periodic snapshot were written to show the balance at a daily level, a plain-english version may look something like this:
Date
|
Amount
|
2/1/2014 | $3,000 |
2/4/2014 | $2,800 |
2/9/2014 | $3,800 |
Now, let's use this table to answer the same questions that we examined last week:
1.) What was the account balance on 2/4?
Unlike the transaction fact table, in order to find the balance on 2/4, we only need to look at the 2/4 row. The 2/4 row contains a snapshot of the current balance on 2/4, as opposed to only the event that occured on 2/4. By looking at the 2/4 row, we will see that the balance was $2,800.
2.) What was the account balance on 2/9?
Using the same logic that was explained in #1 above, look only at the 2/9 rows. This will give you a value of $3,800.
3.) How much was deposited on 2/9?
This type of question cannot be answered using a periodic snapshot. A periodic snapshot will store the current state of the business process as of the applicable period, but will not store the events leading to the current state.
In the future we will look at writing some ETL to populate a periodic snapshot fact table. We will also take a look at an additional fact table design.
Image courtesy of cooldesign / FreeDigitalPhotos.net
Friday, March 14, 2014
Transaction Fact Tables
In his book The Data Warehouse Toolkit, Ralph Kimball explains that there are three ways to design a fact table. The first and probably most typical (my opinion) is called a transaction fact table. A transaction fact table is a fact table that contains measures, keys to dimension tables, and degenerate dimensions, if applicable. When using this fact table to examine the current state of something going through the business process, all rows should be summed through the latest time period that is being examined. This is due to the fact that a row is added to this fact table as an event in its respective business process occurs. The most common example involves the banking industry. Suppose you open a checking account at Acme Bank on 2/1/14 and make an initial deposit of $3,000. Three days later you withdraw $200. Five days after that, you deposit $1,000. A "plain-english version" of this fact table (without the descriptors) will look something like this:
Now, let's use this table to answer these very simple questions:
1.) What was the account balance on 2/4?
Notice how a row was added each time that a deposit or withdrawal was made. In order to find the balance on 2/4, we must look at everything that happened through 2/4. If we only look at the 2/4 row, we will only see the $200 withdrawal. However, it is important to know that $3,000 existed in the account before that withdrawal. So, if we sum the 2/1 row and the 2/4 row, we will see that the balance on 2/4 (after that transaction posted) was $2,800.
2.) What was the account balance on 2/9?
Using the same logic that was explained in #1 above, sum the 2/1, the 2/4, and the 2/9 rows. This will give you a value of $3,800.
3.) How much was deposited on 2/9?
When looking at only one of the events that occured as opposed to the sum of everything that occurred, only that one row should be taken into consideration. The 2/9 row by itself will tell us that $1,000 was deposited into the account. Unlike #2 above, we do not need to consider the $2,800 that was in the account prior to 2/9.
In the future we will look at writing some ETL to populate a transaction fact table. We will also take a look at some additional fact table designs.
Image courtesy of twobee / FreeDigitalPhotos.net
Date
|
Amount
|
2/1/2014 | $3,000 |
2/4/2014 | ($200) |
2/9/2014 | $1,000 |
Now, let's use this table to answer these very simple questions:
1.) What was the account balance on 2/4?
Notice how a row was added each time that a deposit or withdrawal was made. In order to find the balance on 2/4, we must look at everything that happened through 2/4. If we only look at the 2/4 row, we will only see the $200 withdrawal. However, it is important to know that $3,000 existed in the account before that withdrawal. So, if we sum the 2/1 row and the 2/4 row, we will see that the balance on 2/4 (after that transaction posted) was $2,800.
2.) What was the account balance on 2/9?
Using the same logic that was explained in #1 above, sum the 2/1, the 2/4, and the 2/9 rows. This will give you a value of $3,800.
3.) How much was deposited on 2/9?
When looking at only one of the events that occured as opposed to the sum of everything that occurred, only that one row should be taken into consideration. The 2/9 row by itself will tell us that $1,000 was deposited into the account. Unlike #2 above, we do not need to consider the $2,800 that was in the account prior to 2/9.
In the future we will look at writing some ETL to populate a transaction fact table. We will also take a look at some additional fact table designs.
Image courtesy of twobee / FreeDigitalPhotos.net
Friday, March 7, 2014
Business Intelligence Yesterday, Today, and Tomorrow
They say that the only thing in life that is guaranteed not to change is change itself. The past few decades have proven this to be the case with technology. Old dial-in modems have given way to broadband access. Land line phones have practically been replaced by cell phones. Original flip phones are old news compared to today's smart phones...you get the idea. The key to navigating these waters is to remember the business that you are in. Both old modems and new modems connect users to an online experience. Land line phones allowed people to remain in touch with each other as do today's smartphones. In each case, the mechanics may have changed, but the goal remains. Companies that have lost themselves in the mechanics of the technology have a hard time moving to a new one. However, companies that are attached to the overall goal and merely see the technology as today's way of achieving that goal are often quicker to embrace change...and are able to survive.
These principles apply to individuals as well. The BI industry is in the midst of some huge change and it is important to remember the goal of BI. The traditional model of ETL jobs running in the middle of the night so that the data in the data warehouse will be available the next day is becoming less and less acceptable. Waiting several minutes for queries to run is becoming less acceptable as well. This has given way to the creation of in-memory database solutions that allow data scientists to analyze large datasets very quickly. The technologies are changing...but the goal is not (check out this Ralph Kimball white paper). When considering business intelligence solutions 10 years ago, today, or 10 years from now, one commonality exists. That commonality is the logical architecture. A business process must be understood in terms of its measures and descriptors so that it can be analyzed. A traditional data warehouse will create a place in which the data can physically reside on disk, based on that architecture. Solutions like SAP's HANA implement that architecture not on disk but in memory. In another decade, or so, another solution may exist.
Those individuals that understand that BI involves presenting the measures and descriptors of the business processes of an organization to its leaders will not only survive but will enjoy these changes. While we BI professionals must learn the mechanics of the best solution of the day in order to practically reach that goal, we also must expect that those mechanics will change.
What's the main objective of a BI professional? Not to write ETL. Not to display data using a certain tool. The main objective is to enable the leaders of the organization to make great decisions by providing good data.
Image courtesy of cooldesign / FreeDigitalPhotos.net
Friday, February 28, 2014
The Bus Matrix
One of the invaluable tools that Ralph Kimball describes in his book The Data Warehouse Toolkit is the bus matrix. The bus matrix is basically a grid that will ultimately allow you to see the relationships between fact tables and their conformed dimensions.
Recall from prior posts that a fact table contains measures (or events for factless fact tables) that pertain to a business process. A dimension table contains the descriptors of those measures. Dimension tables should be reused by multiple fact tables if more than one business process uses that dimension. For example, the salary fact table may describe the payee using the dim_employee dimension table. The help desk ticketing fact table may describe the person to whom a ticket is/was assigned using that same dim_employee dimension table. That concept is explained in this post.
A bus matrix is a two-dimensional grid that lists the business processes (which will become fact tables) along the left and the descriptors (which will become dimension tables) across the top. In the middle, an X or a check mark is placed at the intersection of a fact and dimension that belong in the same star schema. Consider this example:
When an analyst is gathering requirements in an effort to understand what needs to be warehoused, he can easily list the business processes that come from the conversation along the left of a white board. He can also list the descriptors (i.e., day, person, department, product, etc.) along the top. Later, these items can be translated into table names, resulting in a bus matrix.
Creating a bus matrix is a great idea (thanks to Mr. Kimball for that) for the following reasons:
1.) You can easily see the facts and dimensions that reside in your data warehouse. Entity Relationship Diagrams provide some great information, although they can get pretty large for a large data warehouse. If seeing the relationships at a high level is necessary, a bus matrix will allow that to be done very easily.
2.) As you add to your data warehouse you can revisit this document and add to it. Revisiting the bus matrix will help to ensure that you use the conformed dimensions as opposed to inadvertently re-creating one.
3.) We have been treating the bus matrix as a document that can be used to communicate some of the technical relationships of the data warehouse. That is not a bad use, but consider a version of the bus matrix that simply lists the business processes and descriptors (not their respective tables). Such a document will essentially describe the organization. The business processes and the entities that somehow touch those processes are all displayed visually, giving the executives a high level view of the makeup of their organization.
All of the cool BI that provides flashy new toys begins with working through these fundamentals first. The bus matrix helps the leaders of an organization think through the beginnings of their data management strategy.
Business Intelligence is a great industry with a very bright future. Have fun! Are you interested in entering this industry or do you know somebody who is? Consider this.
Image courtesy of ddpavumba / FreeDigitalPhotos.net
Recall from prior posts that a fact table contains measures (or events for factless fact tables) that pertain to a business process. A dimension table contains the descriptors of those measures. Dimension tables should be reused by multiple fact tables if more than one business process uses that dimension. For example, the salary fact table may describe the payee using the dim_employee dimension table. The help desk ticketing fact table may describe the person to whom a ticket is/was assigned using that same dim_employee dimension table. That concept is explained in this post.
A bus matrix is a two-dimensional grid that lists the business processes (which will become fact tables) along the left and the descriptors (which will become dimension tables) across the top. In the middle, an X or a check mark is placed at the intersection of a fact and dimension that belong in the same star schema. Consider this example:
Descriptors
|
|||||
Time
|
Employee
|
Vendor
|
Department
|
||
Business
Processes |
Payroll |
x
|
x
|
x
|
|
Shipping |
x
|
x
|
x
|
||
Accounts Receivable |
x
|
x
|
|||
Sales |
x
|
x
|
x
|
When an analyst is gathering requirements in an effort to understand what needs to be warehoused, he can easily list the business processes that come from the conversation along the left of a white board. He can also list the descriptors (i.e., day, person, department, product, etc.) along the top. Later, these items can be translated into table names, resulting in a bus matrix.
Creating a bus matrix is a great idea (thanks to Mr. Kimball for that) for the following reasons:
1.) You can easily see the facts and dimensions that reside in your data warehouse. Entity Relationship Diagrams provide some great information, although they can get pretty large for a large data warehouse. If seeing the relationships at a high level is necessary, a bus matrix will allow that to be done very easily.
2.) As you add to your data warehouse you can revisit this document and add to it. Revisiting the bus matrix will help to ensure that you use the conformed dimensions as opposed to inadvertently re-creating one.
3.) We have been treating the bus matrix as a document that can be used to communicate some of the technical relationships of the data warehouse. That is not a bad use, but consider a version of the bus matrix that simply lists the business processes and descriptors (not their respective tables). Such a document will essentially describe the organization. The business processes and the entities that somehow touch those processes are all displayed visually, giving the executives a high level view of the makeup of their organization.
All of the cool BI that provides flashy new toys begins with working through these fundamentals first. The bus matrix helps the leaders of an organization think through the beginnings of their data management strategy.
Business Intelligence is a great industry with a very bright future. Have fun! Are you interested in entering this industry or do you know somebody who is? Consider this.
Image courtesy of ddpavumba / FreeDigitalPhotos.net
Subscribe to:
Posts (Atom)