Valuable Data

Friday, February 28, 2014

The Bus Matrix

One of the invaluable tools that Ralph Kimball describes in his book The Data Warehouse Toolkit is the bus matrix. The bus matrix is basically a grid that will ultimately allow you to see the relationships between fact tables and their conformed dimensions.

Recall from prior posts that a fact table contains measures (or events for factless fact tables) that pertain to a business process. A dimension table contains the descriptors of those measures. Dimension tables should be reused by multiple fact tables if more than one business process uses that dimension. For example, the salary fact table may describe the payee using the dim_employee dimension table. The help desk ticketing fact table may describe the person to whom a ticket is/was assigned using that same dim_employee dimension table. That concept is explained in this post.

A bus matrix is a two-dimensional grid that lists the business processes (which will become fact tables) along the left and the descriptors (which will become dimension tables) across the top. In the middle, an X or a check mark is placed at the intersection of a fact and dimension that belong in the same star schema. Consider this example:

		Descriptors
		Time	Employee	Vendor	Department
Business Processes	Payroll	x	x		x
	Shipping	x		x	x
	Accounts Receivable	x			x
	Sales	x	x		x

When an analyst is gathering requirements in an effort to understand what needs to be warehoused, he can easily list the business processes that come from the conversation along the left of a white board. He can also list the descriptors (i.e., day, person, department, product, etc.) along the top. Later, these items can be translated into table names, resulting in a bus matrix.

Creating a bus matrix is a great idea (thanks to Mr. Kimball for that) for the following reasons:

1.) You can easily see the facts and dimensions that reside in your data warehouse. Entity Relationship Diagrams provide some great information, although they can get pretty large for a large data warehouse. If seeing the relationships at a high level is necessary, a bus matrix will allow that to be done very easily.

2.) As you add to your data warehouse you can revisit this document and add to it. Revisiting the bus matrix will help to ensure that you use the conformed dimensions as opposed to inadvertently re-creating one.

3.) We have been treating the bus matrix as a document that can be used to communicate some of the technical relationships of the data warehouse. That is not a bad use, but consider a version of the bus matrix that simply lists the business processes and descriptors (not their respective tables). Such a document will essentially describe the organization. The business processes and the entities that somehow touch those processes are all displayed visually, giving the executives a high level view of the makeup of their organization.

All of the cool BI that provides flashy new toys begins with working through these fundamentals first. The bus matrix helps the leaders of an organization think through the beginnings of their data management strategy.

Business Intelligence is a great industry with a very bright future. Have fun! Are you interested in entering this industry or do you know somebody who is? Consider this.

Image courtesy of ddpavumba / FreeDigitalPhotos.net

Friday, October 12, 2012

Kimball Conference Lessons Learned

I recently had the distinct privilege of attending Ralph Kimball's Dimensional Modeling In Depth class, as described in this post. Learning directly from icons such as Ralph Kimball and Margy Ross has been a huge blessing and a very enjoyable experience. One of the most eye-opening (and valuable) experiences has involved tweaking my understanding regarding concepts that I thought I understood...but found that I didn't. Some of these misunderstandings have even come out in this blog, so I'll use this post to correct some of those...

1.) Junk Dimensions - The examples of junk dimensions that I have provided included the word "junk" in the name. Margy Ross suggests not naming it as such, which makes a lot of sense. Encountering a table with junk in the name may cause some confusion (perhaps even concern) for an analyst who is not well-versed in dimensional modeling.

2.) Snowflake Schema - The terms portion of my website provides the following definition for a snowflake schema

Occasionally there are reasons to join one dimension table to another dimension table. A schema in which this occurs is referred to as a snowflake schema. The ERD, in this case, will show this “second layer” of dimension tables as being similar in appearance to a snowflake.

This dimensional modeling class proved to me that this definition is a bit misleading. Joining one dimension table to another, such as the one on the terms portion of my website, is referred to as an outrigger. A snowflake schema involves an attempt to completely denormalize a dimension.

3.) Type 3 Slowly Changing Dimension - In this post, I described Type 3 slowly changing dimensions as being a hybrid between type 1 and type 2. In reality, this hybrid is actually referred to as a type 6 (I need to update the other post). So, what is a type 3 slowly changing dimension? I'll save that explanation for a future post; however, the type 3 is not the hybrid that I thought it was.

One of the advantages of attending a course like this is that you get to bounce your knowledge against some of the most brilliant minds in the industry. In some cases they help to affirm what you already know. In other cases they correct what you already "know"...which turns you into a stronger asset for your organization and for the industry.

For more information on data warehousing concepts visit www.brianciampa.com. For data that can be used to practice modeling and/or ETL, click on Career in Data Warehousing and then click Grow. Also, if you need a fresh approach to marketing your data warehousing skillset, consider The Data Warehouse Portfolio.

Image Courtesy of David Castillo Dominici / FreeDigitalPhotos.net

Saturday, October 6, 2012

Kimball University

Just as a young quarterback would be thrilled to meet Peyton Manning or a young cook would jump at the chance to meet Paula Dean, I enjoyed that experience this week in the context of my vocation. I sat at the feet of Margy Ross and Ralph Kimball. When you mention these names amongst non-data warehousing professionals, you are often met with confused looks. However, those in the data warehousing arena know these individuals as some of the most brilliant minds when it comes to modeling data.

Margy Ross spent the first two days polishing our skills on some of the more basic pieces of dimensional modeling. She is a very professional yet light-hearted lady with a true talent for teaching these concepts. I was blessed to learn that I had a few things wrong regarding some concepts that I thought I understand. Some of those have even come out in this blog; the corrections for which I'll save for a future post.

Ralph Kimball spent days three and four going over some advanced concepts with regards to dimensional modeling (and a bit of ETL). He is just as light-hearted, having the ability to explain very complex data warehousing concepts with humor inserted where appropriate. I spent a fair portion of the class laughing, and I still smile as I remember their humor. This was not a dry class, as some would expect (for those who enjoy data warehousing, that is). Both Margy and Ralph are brilliant minds who have the humility to (1) impart some of their knowledge to serious data warehousing students in an interesting way and (2) answer concise questions that apply to your specific organization in a one-on-one setting (assuming you can catch them after class). Here were some of my personal highlights...

1.) Ralph signed my copy of The Data Warehouse Toolkit by writing "Brian, keep to the grain. Ralph Kimball".

2.) I've had a design conundrum at work regarding a many-to-many problem related to this post. I asked him about it after class and he affirmed my suggested solution.

3.) Ralph began his first class on day three by explaining the way in which a data warehouse developer will begin asking a user what needs to exist in that user's data warehouse (which does not involve asking the question in that exact way). As a part of that conversation, Ralph made an example out of yours truly, as though I were a successful account manager looking to implement a data warehouse. Of course he was painting a fictitious scenario to make his point but it was still a cool moment.

I would highly recommend the Dimensionsal Modeling In Depth course to anybody interested in the data warehousing arena. I'm not affiliated with the Kimball Group in any way (other than being a fan) so I will not profit by anybody taking the course. I recommend it because it is simply that good.

Image courtesy of smokedsalmon / FreeDigitalPhotos.net

Friday, September 28, 2012

Data Extraction Techniques

One of the challenges of writing ETL involves deciding which records to pull from the source system. If a salary star schema is built to contain payroll data, then the ETL job will refresh on a recurring basis to add new data to that star. If some data related to salary dollars already exists in the star and some new data makes it into the source system, how can the ETL job know to load the new data into the star and avoid reloading the records that were previously loaded (since they already exist in the data warehouse)? There is no one way of doing this, but consider these options...

1.) Use the source system's datetime stamps - An ETL developer should consider himself very blessed if his source system contains an insert and update datetime stamp for each record in the system. If this is the case, then he can simply store the date and time of the most recent successful extract and then pull everything that has been inserted or updated since then. It will be important for the ETL developer to understand how those audit columns are populated in each table to make sure that necessary data will not be left out and unnecessary data will not be included due to odd occurances (i.e., a source system table is reloaded for some reason and records that pertain to data that was initially added three years ago have yesterday's timestamp, etc.).

2.) Compare to the prior extract - In his book The Data Warehouse ETL Toolkit : Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, Ralph Kimball suggests always saving the prior extract in the staging area. A new extract can be compared to this prior extract and the differences will be uploaded. Finally the new extract will overwrite the prior extract and be used as the baseline for the next update.

3.) Always pull as of a certain date - In this approach, the salary dollars may always be extracted for the current calendar year (for example). So, as the year goes on each extraction will be slightly larger than the last. In April, all salary dollars that date back to January will be extracted from the source. In May, all salary dollars that date back to January will be extracted from the source as well. When using this option in conjunction with loading a fact table, the ETL developer will need to delete all of the applicable year's records from the fact table and then load the extract. Although several records will be deleted and simply reloaded (with no change), this is one way of grabbing what is needed from the source when a simpler solution is not possible. However, any updates made to the data that pertain to a different year will be missed. This will need to be taken into consideration. When using this option in conjunction with loading a dimension table, the ETL developer will need to compare to the dimension table to see if any of the dimensions have changed and only load or update the changes. Deleting records from the dimension table and reloading is not a good strategy since the fact table has foreign keys that reference the dimension table.

4.) Refresh the fact table - In this case all of the data from the source system will always be extracted and the fact table will be truncated and reloaded each time. This will only be an acceptable solution if the fact table can be loaded in a timely fashion. Also, this will only be acceptable with a fact table most of the time. Even if an ETL developer plans to refresh an entire star (fact and dimension tables) she must consider the possibility that some of these dimensions are conformed dimensions, meaning that other fact tables reference them.

There are obviously more options than these four. With whatever option you choose, it is important to build the job so that it can be run multiple times with no negative impact. In other words, running the job twice should not result in duplicate records showing up. Even though the initial extract may place more records than are needed into the staging area, the job should be "smart enough" to know exactly what to load into the data warehouse and what to exclude.

For more information on data warehousing techniques as well as data with which you can practice writing your own ETL, visit www.brianciampa.com/careerindatawarehousing/grow.html. Also, if you need a fresh approach to marketing your skillset, consider The Data Warehouse Portfolio.

Image courtesy of digitalart / FreeDigitalPhotos.net

Friday, September 21, 2012

To Constrain or Not Constrain

One of the advantages of using a relational database is that you can mandate that certain relationships MUST exist within your data. If somebody tries to enter data into the database that does not relate to other data correctly, the database can be configured to reject that "bad data" until it is corrected. Since many data warehouses reside in relational databases, using this feature is an option for the data warehousing team as well.

Consider this example...suppose that a star contains a DIM_PRODUCT table that contains the products that are used by that business process. If the Men's Raincoat product has a KEY_PRODUCT value of 27 (surrogate key) then the records in the corresponding fact tables that pertain to this product will have a KEY_PRODUCT value of 27 (foreign key). That's a lesson from Database 101. If somebody removes the Men's Raincoat product from the DIM_PRODUCT table while records that point to it exist in the fact table, then those fact table records will point to nothing...and become meaningless.

The advantage of enforcing this constraint within the database itself is that if somebody tries to remove the Men's Raincoat product from the DIM_PRODUCT table the database will not allow it until the "child records" from the fact table have been deleted or repointed. Using this feature sounds like a no-brainer (and it may be) but a data warehouse provides an additional twist to this kind of decision since so much data is being inserted.

Option 1: Enforce Constraints - This will ensure that the relationship between the fact and dimension tables are always valid from a technical perspective. However, each time that the data is loaded via the ETL job, the fact table must look to make sure that a parent record exists in the dimension. If (and only if) it exists, it will load that record into the fact. Doing that for each and every record will ensure good data intregity but it can also slow a job down.

Option 2: Do Not Enforce Constraints - This will probably result in a faster and more efficient ETL job. However, the possiblity of the fact table containing some orphan records exists.

It is up to each data warehousing team to decide which is best for their particular situation. In some cases it is appropriate to physically enforce constraints (option 1). In other cases, it may be appropriate to logically enforce constraints (option 2), meaning that the data warehousing team will need to periodically run SQL statements that specifically look for orphaned records.

If you are looking for more information on data warehousing and/or data that can be used to practice ETL and architecture skills, visit www.brianciampa.com and click on Career in Data Warehousing. Also, if you are looking for a fresh way to market your data warehousing skills, consider The Data Warehouse Portfolio.

Image: FreeDigitalPhotos.net

Friday, September 14, 2012

Bridge Tables

Each of the examples that we've used thus far in this blog have involved situations in which each fact row was associated with only on dimension row, per dimension table. In other words there has always been a many-to-one relationship between the fact table and its associated dimensions. In the real world, this relationship will not always be the case.

Suppose that a fact table contains employee salary data and each employee can be associated with multiple departments at once. Conceptually, the design would be something like this...

One possible way of resolving this is to use what some people call a bridge table. This is essentially a crosswalk between the fact and dimension designed to resolve the many-to-many problem. The design for such a table is below.

The EMPLOYEE_DEPT_BRIDGE table will not contain any data that will be seen in query results. It will only be used as a link between the FACT_SALARY and the DIM_DEPARTMENT tables. However, there is a rule that must be well understood by any and all analysts that select from this table: You must always group by the department in your query results if your select statement joins to the DIM_DEPARTMENT table (via the bridge, of course) in any way. For example, suppose that John Smith (KEY_EMPLOYEE value is 123) has $1,000 in wages and is associated with both accounting and human resources (part time in each, perhaps). Consider this SQL statement...

select a.key_employee_name,
sum(a.wage_amount)
from fact_salary a,
employee_dept_bridge b,
dim_department c
where a.key_employee = b.key_employee
and b.key_department = c.key_department
and a.key_employee = 123
group by a.key_employee_name

Because the EMPLOYEE_DEPT_BRIDGE table will contain two rows for John Smith (one for accounting and one for human resources), these results will show a salary of $2,000 which will appear incorrect to the user.

KEY_EMPLOYEE	SUM(A.WAGE_AMT)
123	2,000

Now, consider this SQL statement...

select a.key_employee_name,
department_name,
sum(a.wage_amount)
from fact_salary a,
employee_dept_bridge b,
dim_department c
where a.key_employee = b.key_employee
and b.key_department = c.key_department
and a.key_employee = 123
group by a.key_employee_name,
department_name

The results will show his $1,000 salary associated with each department. This will appear to be more correct to the user, assuming they understand that he is associated with both departments.

KEY_EMPLOYEE	DEPARTMENT_NAME	SUM(A.WAGE_AMT)
123	ACCOUNTING	1,000
123	HUMAN RESOURCES	1,000

In this case, educating your analysts and report writers to use this table correctly is a key component to the success of this design.

Friday, September 7, 2012

ETL - Practice Loading A Fact - Solution

In a prior post we looked at the process by which a fact table is loaded. I need to apologize, since the ERD of the FACT_TRANSACTION table that I posted in that prior post was incorrect. It has since been updated. The four things that an ETL job needs to accomplish are...

1.) Select all necessary data from the source
a. Measures
b. Natural Keys of the dimension objects
c. Degenerate Dimension objects
2.) Transform the data as needed
a. Aggregate measures and group by the appropriate objects
b. Add foreign keys to dimension tables
3.) Delete any existing rows from the fact table that will be replaced by the new data
4.) Load the fact table

Consider this PL/SQL procedure as a procedure that will accomplish these things.

Extract

The STAGE_TRANSACTION_DATA table pulls all of the necessary elements from the source system into the staging area. I am treating one schema as both the staging area and the presentation area. This is probably unrealistic in a production environment but it keeps things simple for these purposes.

Transform

1.) The STAGE_TRANS_DATA_PROD_KEY table is used to grab the KEY_PRODUCT from the DIM_PRODUCT dimension. Notice that this is done by joining the product's natural key, which is the PRODUCT_NO (product number), to the DIM_PRODUCT table. An outerjoin is used so that if a record comes into the staging area for a product that does not exist in the DIM_PRODUCT table, that record will not be excluded from the fact table. It will exist with a value of -1 and will need to be addressed after the ETL has run.

2.) The STAGE_TRANS_DATA_DATE_KEY table is used to transform the TRANSACTION_DATE into a key that will join to a date dimension. I created the date dimension on my own, so I'll let you try and tackle that one. Those dimensions typically have no source and are created as needed by the ETL developer. Again, an outer join is used so that if a date does not exist in that dimension a value of -1 is placed into the fact table.

Load

1.) This fact table, because it is so small, is truncated and then repopulated each time that the ETL runs. Date logic can be written so that only a certain time period will be refreshed with each run, but for simplicity's sake I did it this way.

2.) Finally, the data is loaded into the FACT_TRANSACTION table.

Prior posts have talked about turning data into valuable data. This post shows you how to accomplish that in a hands-on way.

Remember to take a look at www.brianciampa.com (specifically, click on Career In Data Warehousing and then Grow) for larger datasets that can be used for this kind of practice.

Image: FreeDigitalPhotos.net