Friday, June 29, 2012

Big Data Defined

A typical question asked of business intelligence professionals is - what is big data?  In a previous post I provided some explanation of the big data challenge.  As I've learned more about this very interesting piece of business intelligence, I believe that a more complete explanation is available.  Consider this...

Suppose you own a day care, meaning that you are in the business of caring for children each day.  You have determined that in order for you to have a truly challenging day, three things have to occur.  First, you must have a large number of kids to care for on that given day.  How large is large?  There are no hard and fast rules...large enough that your current resources are strained at best and inadequate at worst.  Second, let's assume that you have no help, meaning that you need to know what each child is doing right now.  Not knowing what one of them was doing for an hour and then finding out later that he was painting the refrigerator is not acceptable.  Third, you have a very diverse group of kids.  This does not refer to diversity with regards to race or ethnicity but diversity with regards to personality.  Some of the kids love to play outside and some inside.  Some are into puzzles and others are into riding bicycles.  As a result, keeping the kids engaged in activities that they enjoy and in which they are gifted can be challenging. 

If two of the three challenges exist, the day is still challenging but not to the same extent.  You can know what each child is doing in a diverse group if you have a small number of them.  You can deal with a large number of kids if they are ALL sitting in the same room, doing the same thing.  It's the combination of the three problems that presents a challenge.  Now, consider this in terms of big data.  Big data is characterized as having three V's. 

1.) Volume - In order to data to be considered big data it must be large.  How large is large?  I'm not so sure that there is a hard and fast boundary between "normal" data and big data in terms of size.  However, I would assume that if managing the data for the purposes of business intelligence presents problems because of its size, then this V applies.

2.) Velocity - Part of the big data problem involves dealing with the speed at which the data comes in.  Decision makers want to know what is happening in their business and in the marketplace now, as opposed to experiencing a lag.  For example, if an announcement was just made with regards to a new line of business, what are people saying about that on Twitter now?  If there is a need to stream data in some way so that it can be analyzed in real time, then this V applies.

3.) Variety - Part of the big data problem involves dealing with various types of data.  Relational databases are good at storing structured data.  Structured data is data in which each element is placed into a fixed area (such as columns and rows in a database or an XML schema) that was created for that specific element's characteristics.  Dates belong here, integers belong there, etc.  Unstructured data, such as a tweet or the body of an email message, are more free form.  In other words, there is nothing governing the type of data that is stored in those environments.  If decision makers want to analyze various types of data (such as both structured and unstructured data) then this V applies.

While challenges may exist when experiencing only one or two of the above mentioned V's, the industry generally agrees that if all three characteristics apply to your data, then you have big data.

Also, remember to visit http://www.brianciampa.com/ for more information on data warehousing and business intelligience.

Image: FreeDigitalPhotos.net

Friday, June 22, 2012

Snowflake Schema

Going back to the piggy bank example from this blog's first post, we know that the value of a data warehouse lies with allowing a user to analyze data easily.  This is mainly achieved through denormalization.  This often differs from the value of a source system which lies with getting data into the system quickly.  This is mainly achieved through normalization.

So, if we were to run a select statement against a source system that is designed to return (among other things) an employee name as well as his department, those two things may be stored in two tables.  The employee table may simply store a foreign key to the department table, which stores the names of the departments.  The query may find 'John Smith' in the employee table and then the number 6 in the department field.  The database will have to go to the department table, look up the row with the number 6 to see that the value is 'Human Resources' and then return those two values, along with any other values that were specified in the select statement, in the query results.  Doing this for several rows and for several different kinds of values will take some time.

In a typical data warehouse design, however, all of these values may exist in the same table and be repeated.  This way, the query can go to one table and find 'John Smith' and 'Human Resources' without having to translate a foreign key.  This keeps things quick.

However, there are some instances in which some normalization is appropriate in a star schema.  This normalization is not to the extent that it exists in the source system, which is often third normal form, but it is normalization none the less.  Consider the ERD below (click on it to expand)...



This is yet another addition to the star schema that we have been using for the past several weeks.  Notice the DIM_EXECUTIVE table at the bottom.  It is a dimension table that is joined directly to the DIM_EMPLOYEE dimension table.  It is not joined to the fact table.  Why would we not add the four non-key values directly to the DIM_EMPLOYEE table?  We could and that would not be incorrect, but what if there is not much executive information compared to the number of employees?

So, if the DIM_EXECUTIVE table contains 10 rows and the DIM_EMPLOYEE table contains 4,000,000 rows, then placing this executive information into the DIM_EMPLOYEE table will result in 3,000,990 rows of empty space.  In this case, it might make sense.

When a dimension table joins to another dimension table the star schema is now referred to as a snowflake schema.  The "second layer" of dimension tables causes the tables, when they are depicted in an ERD, to resemble a snowflake.

I will add that when in doubt, it is probably best not to snowflake.  Snowflaking causes some additional complexity on the part of the reporting tools when it comes to interpreting the data.

To experiment with analyzing this kind of data, you can run the script found here in an Oracle environment.  Rather than continue adding to the script little by little, this script will create ALL of the tables and data depicted in the ERD.  As with last week's post on junk dimensions, you have some options with regards to snowflaking or not.  Remember that your number one goal is to turn your organization's data into something that is truly valuable, thus enabling your leaders to make excellent decisions.  Make your decision with that in mind.

More on snowflaking can be found in the books listed in the additional reading section to the right as well as by visiting http://www.brianciampa.com/ and, under the Concepts menu, clicking Terms or Advanced.

Update: The Kimball training that I received taught me that I was incorrect in classifying this type of design as a snowflake schema.  A snowflake schema contains a completely normalized version of the dimension tables.  This post describes an outrigger table.  The example above should still be considered a star schema.  Sorry for my mistake.

Image(s): FreeDigitalPhotos.net

Friday, June 15, 2012

Junk Dimensions

When the time comes to move out of a house or apartment, one of the more tedious tasks involves packing your belongings into boxes.  While these boxes are typically organized based on their destination in the new house or their contents, there always seems to be a few items that just don't fit into one of these categories.  So, that last box contains these miscellaneous items with no good description.

This challenge is sometimes encountered in the area of dimensional modeling as well.  Remember from this post that there are instances in which a dimension value may not fit into any of the other dimension tables.  In this case, it is appropriate to add the value to the fact table as a degenerate dimension. 

Now, consider the scenario in which there may be a number of low cardinality dimension values that do not fit into an existing dimension table.  While adding each of these values to the fact table is not incorrect, another option exists.  Suppose that the source system that feeds the transaction star that we've typically used contains the following contextual elements (dimensions) for each transaction...

1.) Was this item paid using cash, debit, or credit?
2.) Was this sale originated via a phone call, an online order, or a walk-in?
3.) Was this a promotional item (Y/N)?

Because these are all low cardinality values, consider this approach.  Each of the possible combinations are placed into a dimension table, assigned a surrogate key, and then joined to the fact table for analysis.  This is referred to as a junk dimension.  The contents of this junk dimension will be something like...

KEY_TRANSACTION_
JUNK
PAYMENT_
METHOD
ORIGINATION PROMOTIONAL_
ITEM
1 Cash Phone Call Yes
2 Cash Online Yes
3 Cash Walk-In Yes
4 Cash Phone Call No
5 Cash Online No
6 Cash Walk-In No
7 Credit Phone Call Yes
8 Credit Online Yes
9 Credit Walk-In Yes
10 Credit Phone Call No
11 Credit Online No
12 Credit Walk-In No
13 Debit Phone Call Yes
14 Debit Online Yes
15 Debit Walk-In Yes
16 Debit Phone Call No
17 Debit Online No
18 Debit Walk-In No

Creating a junk dimension with items that are not low cardinality is probably not a good design technique.  If you were to place the transaction line number, something that is unique to each fact row, into this dimension, the need to create all possible combinations of the junk dimension values will cause this table to contain more rows than the fact table.  In this case, the high cardinality items would best be placed directly into the fact table as degenerate dimensions, leaving the junk dimension to contain the low cardinality items.

If you wish, you may use one of the following methods to take a look at this data in Oracle...

1.) If you have already created the transaction star from the post on conformed dimensions, run this script to add a junk dimension
2.) If you have not created the above mentioned transaction star, then run this script to create it first, and then run this one to add a junk dimension.

This ERD displays this star schema with the new junk dimension in the lower left-hand corner (click the image to enlarge).


As I mentioned earlier, placing these low cardinality items into the fact table as degenerate dimensions is not a bad approach.  However, in that case, each fact's value will have to be stored on the fact row which will require more space than the junk dimension.  Comparing that cost with the cost of having the ETL lookup the surrogate key in the junk dimension for placement into fact table will have to be considered.  You have some options in this case.

Remember that your job is to enable excellent decision making for the leaders of your organization.  Consider the options and select the one that will best lead you down that path.

Update: In this post I mentioned that I made a slight error in my explanation of junk dimensions.  The examples of junk dimensions that I have provided included the word "junk" in the name.  Margy Ross suggests not naming it as such, which makes a lot of sense.  Encountering a table with junk in the name may cause some confusion (perhaps even concern) for an analyst who is not well-versed in dimensional modeling.

Image courtesy of FreeDigitalPhotos.net

Friday, June 8, 2012

Slowly Changing Dimensions Explained...Using Twitter

I recently changed my twitter profile picture and after seeing the results I realized that this would be a great way to explain slowly changing dimensions.  When I uploaded the new picture and then viewed my timeline of tweets (my tweets) there was no trace of the old picture.  I could not tell that it had ever been used.  Even the tweets that had been associated with the old picture at one point now displayed the new one.  This is the behavior of a type 1 slowly changing dimension.  History is not kept.  Only the dimension values that are current are displayed for ALL facts.

Now, imagine that the old picture remained for those tweets that were sent when it was active and that the tweets sent AFTER the new picture was uploaded contained the new picture.  Viewing my timeline of my tweets will show me when I uploaded new pictures.  Twitter does not work this way, but if it did this would mimic the behavior of a type 2 slowly changing dimension.  As history changes, new rows are added to the dimension table.  Any fact rows that occur AFTER that change will point to the new row.  The fact rows that occurred before that change will continue to point to the older row.

Now, imagine that the timeline on twitter showed two pictures for each tweet.  One picture represented the picture that was current at the time of the tweet.  The other picture represented the picture that is the most current.  The latest tweets would probably show the same picture (most current one) twice.  If this were the case, we could easily see that when I tweeted about something two months ago, my profile used one picture but now it uses another.  This would mimic the behavior of a type 3 slowly changing dimension, which is a hybrid between type 1 and type 2.  A type 3 slowly changing dimension will add a new row as dimension objects change (as in a type 2).  Each dimension row, however, in that history will contain a column (or columns) that represent the current value.  So, if a new row is added, the current value columns for the historical rows will be updated with the new current information.  An analyst can easily see that at the time of a certain fact, the dimension values were one thing but now, they are something else.

Of course, writing ETL so that it populates a type 2 or type 3 slowly changing dimension is more complex than a type 1.  Creating a dimension to include history when the source system itself does not include history may cause difficulty if a dimension needs to be reloaded (although there are ways around this as well).  Consider all of the available options and make a decision that will turn your user's data into something that is truly valuable.

For some hands-on examples of slowly changing dimensions take a look at this post.  Also, remember that you can read this blog and access other data warehousing information by going to http://www.brianciampa.com/.

Saturday, June 2, 2012

Display Your Value

When I think about the traditional methods of job seeking in the IT industry I have to stop and reflect.  My experience with job seeking in this arena has usually consisted of looking for an open position, submitting a resume, attending the interview, answering the interview questions, and hoping that something in that entire process impressed them enough to make me a good offer.  However, somebody in an artistic area, such as a photographer, is expected to show concrete examples of what he can do before jobs (freelance or otherwise) are offered.  In my opinion, the difference is this: I am asking for a job...the artist is putting his value on display. 

One of the challenges here is that aside from some visual areas like web design, IT professionals typically build or support infrastructure in the back-end.  So, to make a comparison to home building, the IT professionals would be synonymous with those installing the plumbing as opposed to those laying the carpet.  The nature of that work is not very visual.

However, what if you could market that skill in a visual way, which would allow you to put your value on display?  A potential employer could actually see (not just discuss) what you have to offer.  As someone who has conducted interviews before, I know that it can be a bit difficult to derive solely from an interview, even if a test is involved, whether or not somebody can move through the process of warehousing data well.  Consider this as a method to display your ability to move through that process well...

1.) In Oracle, MSSQL, MySQL or another system with which you are familiar build a fictitious source system.
2.) Envision how users may want to analyze that data.
3.) Design and build some stars that will allow for this type of analysis.  Be sure and show your ability to use type 1, 2, and 3 slowly changing dimensions, conformed dimensions, degenerate dimensions, factless fact tables, and other techniques that you want to market. 
4.) Write some ETL jobs to populate those stars.
5.) Create a website that displays this process for a potential employer to examine.

This does not have to cost any money.  If a database environment is not available to you, consider using free versions of Oracle or MSSQL that run on a local machine.  Also, Google Sites is a free option for webdesign if you don't want to rent space on a server.

The intent here is not to market yourself as an expert in one specific piece of technology (although your knowledge about one of them will be apparent) but to display your approach to warehousing data.  Consider my portfolio as an example.  One of the best parts is that I incurred no monetary expenses when building this.  It only "cost" me some time and creativity...and I learned a lot along the way.

Also, you can now access this blog via the home page at http://www.brianciampa.com/.

Free images from FreeDigitalPhotos.net