Valuable Data: February 2012

Saturday, February 25, 2012

Practice Makes Perfect

As with any skill, only so much can be learned from a book (or a blog post). In this post, I’m hoping to offer some hands-on training that will allow you to play with some data a bit. If you have an Oracle environment available to you (check the previous post if you do not), you can run the script found here. If you are unfamiliar with running a script like this, some suggestions are in the previous post as well. The following tables with data will be created…

This is a very basic star that holds data related to sales transactions. Here is a quick breakdown of the tables…

Dim_Product (22 rows): This is a dimension table that contains a product’s overall group (larger grouping) and then the individual product numbers and names within that group (smaller grouping).

Dim_Employee (19 rows): This is a dimension table that contains information regarding the employees that sold the products.

Dim_Day (365 rows): This is a date dimension table that contains one row for each day as well as higher-level date groupings. In this example, this table only holds data in calendar year 2011 (1/1/2011 – 12/31/2011).

Fact_Transaction (300 rows): This is a fact table that contains one row for each transaction line item. So, if an individual purchased three items on the same transaction (i.e., brought three items to the cash register) that transaction will produce three rows in this fact table.

Data Warehousing is ideal for top-down analysis. Once the script has been run in Oracle, and the tables have been created, consider the following example…

select b.calendar_year, sum(a.amount)

from fact_transaction a,

dim_day b,

dim_employee c,

dim_product d

where a.key_day = b.key_day

and a.key_sales_person = c.key_employee

and a.key_product = d.key_product

group by b.calendar_year

This select statement shows the dollars sold per fiscal year, although there is only one year in this case. If you want to throw in the product_group to see the dollars sold per year, per product_group, that will give you an additional level of detail…

select b.calendar_year, d.product_group, sum(a.amount)

from fact_transaction a,

dim_day b,

dim_employee c,

dim_product d

where a.key_day = b.key_day

and a.key_sales_person = c.key_employee

and a.key_product = d.key_product

group by b.calendar_year, d.product_group

Do you see what is happening? You can continue drilling down until you get to a specific transaction_line if you wish.

An analyst may examine the data at a high-level and notice that one quarter had an enormous number of sales dollars. She can keep adding descriptors to the select clause (and group by them) to eventually find where those dollars lie. Did one sales rep have a stellar quarter? Did one product group do exceptionally well? Did sales go through the roof on one particular day out of the quarter? Did one product alone do exceptionally well? Once you have the information you can turn that into action. Does somebody deserve a raise for their extraordinary work? Did one marketing strategy work remarkably well, meaning that we need to consider implementing it across the entire organization?

This type of structure can turn data into valuable data. Valuable data can enable great decisions. Have fun playing with the data. See how many ways you can slice and dice it. We will do some more with this star in future posts.

Sunday, February 19, 2012

Hands-On Training

As with anything else, the way to become familiar with data management is to constantly work with data. I would like to offer some of that through this blog, in addition to continuing to examine some of the concepts of data warehousing. I plan to provide scripts that will create data in your own Oracle instance so that you can practice working with it. If you do not have access to (and permission to use) an Oracle instance through your employer, your school, or some other entity, one option is to use a couple of free products explained below.

Oracle offers a free product called Oracle Database Express Edition that, among other things, allows users to become familiar with data management. This is a smaller version of the Oracle database that can be loaded onto a local machine running Windows or Linux. Oracle Database Express Edition can be downloaded here.

Oracle also offers a free integrated development environment (IDE) called Oracle SQL Developer that allows you to easily query data and build database objects. Oracle SQL Developer can be downloaded here.

As a test, download the very simple sql script here. This will create one basic table and insert three records. Here are two ways of running it…

From an IDE - Open the file from an IDE (such as Oracle SQL Developer although others may be used as well) and run the script. In Oracle SQL Developer, this involves clicking the Run Script button on the tool bar or pressing F5.

Using SQLPlus (in Windows)

1.) On the machine on which Oracle is installed go to the windows command line (from the Run box type CMD).

2.) Navigate to the folder that contains the script using the cd syntax.

3.) Type the following: sqlplus {schema name}/{password} @{filename} So, if you are loading this into a schema called “myschema” with a password being “mypassword”, the command would look like this: sqlplus myschema/mypassword @test_tbl Including the “.sql” extension with the filename is not necessary.

4.) Press Enter.

You should see that one table was created, three rows were created, and the transactions were committed. You can simply drop the test_tbl table at this point. We will still be looking at some of the theory behind data warehousing, but along with that I’ll provide some paths for hands-on training. I don’t want to get lost in the mechanics of getting the data into the database, so if you have a better way of loading a script like this, by all means use it (and share with us if you would like).

Find an Oracle environment that you works well for you, or create one using the products above (or something else if you like). In the next post or two, I’ll provide a star schema that we can look at.

As an aside, I’m not trying to endorse Oracle as being a better product for building a data warehouse than something else. My experience is with this product so it is the easiest and quickest way for me to share data in this way. If you would like to take my scripts and update them so that they are appropriate for another product, feel absolutely free to do that.

Image: Ambro / FreeDigitalPhotos.net

Saturday, February 11, 2012

Natural Keys vs. Surrogate Keys

A prior post examined fact and dimension tables. The concept is fairly easy to understand but there is some more complexity to dimension tables that I want to touch on. That complexity surrounds the ways in which records are uniquely identified. Two terms called a natural key and a surrogate key come into play.

Natural Key: A natural key is a key that uniquely identifies a piece of descriptive data (dimension) from a business perspective. Giving a business expert this value will mean something to him.

Surrogate Key: A surrogate key is a key that is internal to the data warehouse. It most often takes the form of an auto generated number that serves as the primary key of a dimension table. Giving a business expert this value will mean nothing to him.

Consider the following table (also used in the other post)…

This table is a dimension table that contains data about products. From the business’ point of view, the way that you uniquely identify a product is by the product number. A user of the data warehouse can compare the product_number field to the source system and find it in there. The product_number field, therefore, is known as the natural key.

However, giving the business users the key_product value will be meaningless to them. No source system will have that value. As new products are added to the dimension table, the ETL auto generates a number and places it into this field. The only reason for this is so that it can be used for the primary key/foreign key relationship between this dimension table and the fact table. Therefore, the key_product field is the surrogate key.

Why can’t the natural key serve as the surrogate key? Good question. It’s a bit much to answer in this post but I’ll basically say that there may be instances in which more than one record will exist in the dimension table for a given product.

Saturday, February 4, 2012

Star Schema

In a prior post we looked at the value-add of a data warehouse using the following example. These relationships may seem a bit unusual to somebody who is not familiar with data warehousing, so I wanted to touch on this just a bit. When tables relate to each other in this way, they are a part of a star schema. Each star schema has two types of tables – Fact tables and Dimension tables.

Fact Tables: A Fact table contains the data that is to be measured. In the example above (a simple example) the fact_transaction table is the fact table. The actual_sales_price is the item that is measured. In this case, only one measureable item is included. If we wanted to add another measureable item (i.e., sales_price_before_discount) we could do that. This table can contain any number of additive items that fit the grain. What is grain? A fact table’s grain is basically the answer to this question: What does each row in the fact table represent? In this case, the grain is one row for each transaction line item. So, if a customer purchases three items, then that will result in three rows (among all of the others that pertain to other transactions) in this fact table. Generally speaking (of course, there are a few exceptions) the fact table itself does not contain a description of this measurable data. Another kind of table fulfills this role.

Dimension Tables: A Dimension table describes the data that is contained in the fact tables. The idea is that you aggregate the measureable objects (i.e., sum the actual sales price) and then group by the dimension objects. You can analyze the actual sales prices for each product, for each year, for each month, and so on. Notice that the dim_product table, for example, contains the lower-level product data and the higher-level product group data. A developer who is creating tables in third-normal form would probably place this lower-level data and higher-level data into two different tables. In the data warehousing world, this may not happen. Looking at the raw data in this table may show something like…

Product Department Name	Product Group Name	Product Name
Men’s Dept	Shoes	Brand A Sneakers
Men’s Dept	Shoes	Brand B Dress Shoes
Women’s Dept	Accessories	Wrist Watch
Women’s Dept	Accessories	Beaded Necklace
Children’s Dept	Athletic Wear	Sweat Pants

Notice that some of the data repeats. This allows you to easily see the actual sales price at a high level (i.e., price per department) and then as you add additional descriptive data to the select statement and group by it you can drill down to a more granular level (i.e., price per department per product group). In some cases, a descriptive element belongs by itself. Notice the dd_transaction_no field in the fact_transaction table. This describes the fact in terms of the transaction. In other words, which transaction did this refer to? This doesn’t belong in any of the existing dimension tables. Rather than just create a dimension table with a key and this transaction number, we can place it directly into the fact table. It is known as a degenerate dimension in this case.

Fact tables and dimension tables are the two main types of tables used in data warehousing. Others exist which we may examine in the future. For now, these are the fundamentals of this type of system.