Friday, May 25, 2012

Hadoop MapReduce

As mentioned before, one of the challenges associated with turning data into something that is truly valuable is the challenge of analyzing Big Data.  One of the tools being used by the business intelligence community to meet this challenge is a software framework called Hadoop.  Hadoop takes advantage of the MapReduce concept to quickly process large amounts of data.  To try and explain how this works, consider this example.

If mom goes to the grocery store alone, getting the groceries on her list will take a certain amount of time.  If mom, dad, and the two kids all go and each takes responsibility for getting 25% of the items, the total duration will be lessened.  However, this will require that the items in the four separate grocery carts are consolidated into one before being presented at the cash register.

In the same way, Hadoop places data onto different servers in a cluster.  When Hadoop is asked to execute a job, the servers each work on a piece of that job in parallel.  In other words, the job is divided and is mapped to the various servers.  When complete, each server has one piece of the solution.  That solution is then compiled (the reduce step) so that the many parts of the solution are reduced to the one complete solution.  This video does a great job of explaining this concept.


Saturday, May 19, 2012

Practice Using Degenerate Dimensions

In the previous post, we looked at degenerate dimensions.  Rather than create a new SQL script that you can use to practice we can revisit the transaction data star that we used in the February 25th post.  The script is here and the ERD is below. 

Notice that the TRANSACTION_NUMBER and TRANSACTION_LINE columns are in the FACT_TRANSACTION table.  If we created a new dimension table, called DIM_TRANSACTION, with this structure (in Oracle)...

) would contain as many rows as the FACT_TRANSACTION table.  This is not wrong, but the effort involved in creating the surrogate keys which will only be used by one fact row may not be necessary.  If you wish, run the create table statement that I provided above, add some data, and create some surrogate keys so that it can be joined to the fact table.  You will see that the reporting capabilities from one design to the other are the same, with a bit more effort being required to maintain the DIM_TRANSACTION table.  This is why I opted to place these columns directly in the fact table as degenerate dimensions.  When faced with this design question, consider which is best for your organization and move forward with turning data into valuable data.

Saturday, May 12, 2012

Degenerate Dimensions

One of the purposes of dimensional modeling is to identify what needs to be measured (facts) and identify the context needed to make those measures meaningful (dimensions).  So, for example, if an organization wishes to measure revenue from direct sales, the fact in this model is the revenue amount.  The context (or dimensions) may include things like sales person, date of sale, item sold, quantity, and/or a host of other things. 

As discussed previously, a typical design is to place the facts of a business process that are measured at the same grain into one table (fact table) and to group the contextual pieces into other tables (dimension tables).  Occasionally, a piece of context will exist at the same grain as the fact table.  For example, using the same scenario as we used above, suppose that the sales revenue needs to be stored at the sales transaction level.  If a transaction number exists for each individual transaction, then a dimension table that contains this value will contain as many rows as the fact table.  While there is nothing wrong with this approach, it may seem like extra work to create an additional table with a surrogate key that will only be referenced by one row of the associated fact table. 

As a result, some designers may opt to make this value a degenerate dimension (mentioned briefly at the end of a previous post).  A degenerate dimension is a dimension value that exists directly in the fact table, as opposed to the fact table containing a foreign key that points to that record in a dimension table.  The database will not need to allocate any additional space for the surrogate keys in this case, since they add little value anyway.  Designing a dimension object like this to be a degenerate dimension is not a must in this situation.  Using the standard approach of creating a separate table is fine too.

As with everything that we do as Business Intelligence experts, when faced with this design question it is best to consider the value that each option brings to the table.  Asking some of the following questions may be a good start...

1.) Do I have too many dimension tables included in this star already? 
2.) How will the reporting tools react to this design? 
3.) Will the report writers become confused if I design it this way?

If needed, spend some time absorbing the concept of degenerate dimensions.  We will look at some examples later.

Image: photostock /

Friday, May 4, 2012

What Does Big Data Mean For Us?

In the previous post we looked at the problem of Big Data and the industry's move toward more efficient relational database products (or even non-relational database products) that may remove the need for a data warehouse.  Some may be inclined to wonder what this means for those who have skills in data modeling and ETL development.  In order to answer this question I'm going to point to another blog that dealt with this question recently.  It is the Star Schema Central blog (on my blog list on the right side of the screen) which is maintained by Chris Adamson.  The link to that particular post is here.  He does a great job of explaining why dimensional modeling (the logical design of a data warehouse) will still be important in the world of Big Data.  This is an area that is still being defined, so the possibilities are still being explored.
While so much is still up in the air regarding the ways in which data will be modeled or ETL'd in the world of Big Data, consider that as you move forward in your career your overall mission is not to model data a certain way or even to write ETL.  Those are just a few techniques that can be used today and those techniques may become outdated.  The techniques may change over time but the main goal is to provide meaningful data to decision makers quickly and efficiently.  By focusing on this, no matter what the technique or technology of the day is, you will be turning your data into something of true value.

Image: Idea go /