Valuable Data: March 2014

Friday, March 28, 2014

Accumulating Snapshot Fact Tables

We've been looking at the various fact table design options in data warehousing. Two weeks ago we examined the transaction fact table and last week we examined the periodic snapshot fact table. The third and final option is called an accumulating snapshot. This type of fact table is different from the other two in one big way. Each row is often revisited. Consider our banking example from the other posts. When a deposit is added to a transaction fact table, that row is added and then left alone. All of the data needed to add and complete that row is known. The same is true of periodic snapshots. An accumulating snapshot fact table begins each row and then accumulates data until that row is complete. Let's consider an example in a different context. Suppose that a star is built for the purposes of analyzing a help desk ticketing business process. The following descriptors (in addition to any measures) are a part of the business process:

Date Ticket Opened
Date Ticket Assigned
Date Solution Provided To Customer
Date Customer Accepted Solution
Date Ticket Closed
Ticket Number

Suppose that ticket number 10012 is opened on 3/1/14. At that point, this row will exist in the fact table:

Date Ticket Opened	Date Ticket Assigned	Date Solution Provided To Customer	Date Customer Accepted Solution	Date Ticket Closed	Ticket Number
3/1/2014					10012

Now, suppose that the ticket is assigned to a technician on 3/3/14. That same row will be updated to look like this:

Date Ticket Opened	Date Ticket Assigned	Date Solution Provided To Customer	Date Customer Accepted Solution	Date Ticket Closed	Ticket Number
3/1/2014	3/3/2014				10012

If the solution is provided to the customer one day later, the row will be updated to look like this:

Date Ticket Opened	Date Ticket Assigned	Date Solution Provided To Customer	Date Customer Accepted Solution	Date Ticket Closed	Ticket Number
3/1/2014	3/3/2014	3/4/2014			10012

Assuming that the customer accepts the solution on 3/5 and the ticket is closed 3/6, the row will be updated to look like this:

Date Ticket Opened	Date Ticket Assigned	Date Solution Provided To Customer	Date Customer Accepted Solution	Date Ticket Closed	Ticket Number
3/1/2014	3/3/2014	3/4/2014	3/5/2014		10012

and then this:

Date Ticket Opened	Date Ticket Assigned	Date Solution Provided To Customer	Date Customer Accepted Solution	Date Ticket Closed	Ticket Number
3/1/2014	3/3/2014	3/4/2014	3/5/2014	3/6/2014	10012

At this point, the row is left alone. If you so desire, an accumulating snapshot allows you to store some lags between dates to help with analysis. This can help ease the burden of using the database to calculate the various lags.

In the future we will look at writing some ETL to populate an accumulating snapshot fact table.

Image courtesy of Vichaya Kiatying-Angsulee / FreeDigitalPhotos.net

Friday, March 21, 2014

Periodic Snapshot Fact Tables

In our last post we looked at one of three ways to design a fact table, called a transaction fact table. Today, let's look at a second design, called the periodic snapshot. Remember from Ralph Kimball's teaching (and last week's post) that a transaction fact table gains a row each time that something happens. Using our banking example, from last week, each deposit or withdrawal will result in a record being inserted. Looking at only one record will allow us to see that one event. Adding these records will allow us to see the balance.

A periodic snapshot contains snapshots of the data as it existed at various points in time. Unlike a transaction fact table, selecting one row (or perhaps a subset of rows if it is semi-additive) will display the current value at that point. Our example from last week involved opening a checking account at Acme Bank on 2/1/14 and making an initial deposit of $3,000. Three days later you withdrew $200. Five days after that, you deposited $1,000. If a periodic snapshot were written to show the balance at a daily level, a plain-english version may look something like this:

Date	Amount
2/1/2014	$3,000
2/4/2014	$2,800
2/9/2014	$3,800

Now, let's use this table to answer the same questions that we examined last week:

1.) What was the account balance on 2/4?

Unlike the transaction fact table, in order to find the balance on 2/4, we only need to look at the 2/4 row. The 2/4 row contains a snapshot of the current balance on 2/4, as opposed to only the event that occured on 2/4. By looking at the 2/4 row, we will see that the balance was $2,800.

2.) What was the account balance on 2/9?

Using the same logic that was explained in #1 above, look only at the 2/9 rows. This will give you a value of $3,800.

3.) How much was deposited on 2/9?

This type of question cannot be answered using a periodic snapshot. A periodic snapshot will store the current state of the business process as of the applicable period, but will not store the events leading to the current state.

In the future we will look at writing some ETL to populate a periodic snapshot fact table. We will also take a look at an additional fact table design.

Image courtesy of cooldesign / FreeDigitalPhotos.net

Friday, March 14, 2014

Transaction Fact Tables

In his book The Data Warehouse Toolkit, Ralph Kimball explains that there are three ways to design a fact table. The first and probably most typical (my opinion) is called a transaction fact table. A transaction fact table is a fact table that contains measures, keys to dimension tables, and degenerate dimensions, if applicable. When using this fact table to examine the current state of something going through the business process, all rows should be summed through the latest time period that is being examined. This is due to the fact that a row is added to this fact table as an event in its respective business process occurs. The most common example involves the banking industry. Suppose you open a checking account at Acme Bank on 2/1/14 and make an initial deposit of $3,000. Three days later you withdraw $200. Five days after that, you deposit $1,000. A "plain-english version" of this fact table (without the descriptors) will look something like this:

Date	Amount
2/1/2014	$3,000
2/4/2014	($200)
2/9/2014	$1,000

Now, let's use this table to answer these very simple questions:

1.) What was the account balance on 2/4?
Notice how a row was added each time that a deposit or withdrawal was made. In order to find the balance on 2/4, we must look at everything that happened through 2/4. If we only look at the 2/4 row, we will only see the $200 withdrawal. However, it is important to know that $3,000 existed in the account before that withdrawal. So, if we sum the 2/1 row and the 2/4 row, we will see that the balance on 2/4 (after that transaction posted) was $2,800.

2.) What was the account balance on 2/9?
Using the same logic that was explained in #1 above, sum the 2/1, the 2/4, and the 2/9 rows. This will give you a value of $3,800.

3.) How much was deposited on 2/9?
When looking at only one of the events that occured as opposed to the sum of everything that occurred, only that one row should be taken into consideration. The 2/9 row by itself will tell us that $1,000 was deposited into the account. Unlike #2 above, we do not need to consider the $2,800 that was in the account prior to 2/9.

In the future we will look at writing some ETL to populate a transaction fact table. We will also take a look at some additional fact table designs.

Image courtesy of twobee / FreeDigitalPhotos.net

Friday, March 7, 2014

Business Intelligence Yesterday, Today, and Tomorrow

They say that the only thing in life that is guaranteed not to change is change itself. The past few decades have proven this to be the case with technology. Old dial-in modems have given way to broadband access. Land line phones have practically been replaced by cell phones. Original flip phones are old news compared to today's smart phones...you get the idea. The key to navigating these waters is to remember the business that you are in. Both old modems and new modems connect users to an online experience. Land line phones allowed people to remain in touch with each other as do today's smartphones. In each case, the mechanics may have changed, but the goal remains. Companies that have lost themselves in the mechanics of the technology have a hard time moving to a new one. However, companies that are attached to the overall goal and merely see the technology as today's way of achieving that goal are often quicker to embrace change...and are able to survive.

These principles apply to individuals as well. The BI industry is in the midst of some huge change and it is important to remember the goal of BI. The traditional model of ETL jobs running in the middle of the night so that the data in the data warehouse will be available the next day is becoming less and less acceptable. Waiting several minutes for queries to run is becoming less acceptable as well. This has given way to the creation of in-memory database solutions that allow data scientists to analyze large datasets very quickly. The technologies are changing...but the goal is not (check out this Ralph Kimball white paper). When considering business intelligence solutions 10 years ago, today, or 10 years from now, one commonality exists. That commonality is the logical architecture. A business process must be understood in terms of its measures and descriptors so that it can be analyzed. A traditional data warehouse will create a place in which the data can physically reside on disk, based on that architecture. Solutions like SAP's HANA implement that architecture not on disk but in memory. In another decade, or so, another solution may exist.

Those individuals that understand that BI involves presenting the measures and descriptors of the business processes of an organization to its leaders will not only survive but will enjoy these changes. While we BI professionals must learn the mechanics of the best solution of the day in order to practically reach that goal, we also must expect that those mechanics will change.

What's the main objective of a BI professional? Not to write ETL. Not to display data using a certain tool. The main objective is to enable the leaders of the organization to make great decisions by providing good data.

Image courtesy of cooldesign / FreeDigitalPhotos.net