Saturday, February 4, 2012

Star Schema

In a prior post we looked at the value-add of a data warehouse using the following example.  These relationships may seem a bit unusual to somebody who is not familiar with data warehousing, so I wanted to touch on this just a bit.  When tables relate to each other in this way, they are a part of a star schema.  Each star schema has two types of tables – Fact tables and Dimension tables.

Fact Tables: A Fact table contains the data that is to be measured.  In the example above (a simple example) the fact_transaction table is the fact table.  The actual_sales_price is the item that is measured.  In this case, only one measureable item is included.  If we wanted to add another measureable item (i.e., sales_price_before_discount) we could do that.  This table can contain any number of additive items that fit the grain.  What is grain?  A fact table’s grain is basically the answer to this question: What does each row in the fact table represent?  In this case, the grain is one row for each transaction line item.  So, if a customer purchases three items, then that will result in three rows (among all of the others that pertain to other transactions) in this fact table.  Generally speaking (of course, there are a few exceptions) the fact table itself does not contain a description of this measurable data.  Another kind of table fulfills this role.
Dimension Tables: A Dimension table describes the data that is contained in the fact tables.  The idea is that you aggregate the measureable objects (i.e., sum the actual sales price) and then group by the dimension objects.  You can analyze the actual sales prices for each product, for each year, for each month, and so on.  Notice that the dim_product table, for example, contains the lower-level product data and the higher-level product group data.  A developer who is creating tables in third-normal form would probably place this lower-level data and higher-level data into two different tables.  In the data warehousing world, this may not happen.  Looking at the raw data in this table may show something like…

Product Department Name
Product Group Name
Product Name
Men’s Dept
Shoes
Brand A Sneakers
Men’s Dept
Shoes
Brand B Dress Shoes
Women’s Dept
Accessories
Wrist Watch
Women’s Dept
Accessories
Beaded Necklace
Children’s Dept
Athletic Wear
Sweat Pants


Notice that some of the data repeats.  This allows you to easily see the actual sales price at a high level (i.e., price per department) and then as you add additional descriptive data to the select statement and group by it you can drill down to a more granular level (i.e., price per department per product group).  In some cases, a descriptive element belongs by itself.  Notice the dd_transaction_no field in the fact_transaction table.  This describes the fact in terms of the transaction.  In other words, which transaction did this refer to?  This doesn’t belong in any of the existing dimension tables.  Rather than just create a dimension table with a key and this transaction number, we can place it directly into the fact table.  It is known as a degenerate dimension in this case.
Fact tables and dimension tables are the two main types of tables used in data warehousing.  Others exist which we may examine in the future.  For now, these are the fundamentals of this type of system.

No comments:

Post a Comment