Valuable Data: Big Data

Showing posts with label Big Data. Show all posts

Friday, April 25, 2014

Data Helping To Fight Crime

In her role as the District Attorney of New Jersey, Anne Milgram found the crime fighting arena to be horribly inefficient. For example, she recalls a time when an individual was arrested and held on a bail amount of $3,500 dollars and was unable to pay that amount. As a result, he stayed in jail until his case was heard eight months later, which cost the public over $9,000 dollars. Believing that data could help, Ms. Milgram developed a tool to help correct this, making crime fighting a data driven activity. This is a great example of somebody who is taking data and turning it into something that is truly valuable. She describes this experience in this TED talk.

Friday, June 29, 2012

Big Data Defined

A typical question asked of business intelligence professionals is - what is big data? In a previous post I provided some explanation of the big data challenge. As I've learned more about this very interesting piece of business intelligence, I believe that a more complete explanation is available. Consider this...

Suppose you own a day care, meaning that you are in the business of caring for children each day. You have determined that in order for you to have a truly challenging day, three things have to occur. First, you must have a large number of kids to care for on that given day. How large is large? There are no hard and fast rules...large enough that your current resources are strained at best and inadequate at worst. Second, let's assume that you have no help, meaning that you need to know what each child is doing right now. Not knowing what one of them was doing for an hour and then finding out later that he was painting the refrigerator is not acceptable. Third, you have a very diverse group of kids. This does not refer to diversity with regards to race or ethnicity but diversity with regards to personality. Some of the kids love to play outside and some inside. Some are into puzzles and others are into riding bicycles. As a result, keeping the kids engaged in activities that they enjoy and in which they are gifted can be challenging.

If two of the three challenges exist, the day is still challenging but not to the same extent. You can know what each child is doing in a diverse group if you have a small number of them. You can deal with a large number of kids if they are ALL sitting in the same room, doing the same thing. It's the combination of the three problems that presents a challenge. Now, consider this in terms of big data. Big data is characterized as having three V's.

1.) Volume - In order to data to be considered big data it must be large. How large is large? I'm not so sure that there is a hard and fast boundary between "normal" data and big data in terms of size. However, I would assume that if managing the data for the purposes of business intelligence presents problems because of its size, then this V applies.

2.) Velocity - Part of the big data problem involves dealing with the speed at which the data comes in. Decision makers want to know what is happening in their business and in the marketplace now, as opposed to experiencing a lag. For example, if an announcement was just made with regards to a new line of business, what are people saying about that on Twitter now? If there is a need to stream data in some way so that it can be analyzed in real time, then this V applies.

3.) Variety - Part of the big data problem involves dealing with various types of data. Relational databases are good at storing structured data. Structured data is data in which each element is placed into a fixed area (such as columns and rows in a database or an XML schema) that was created for that specific element's characteristics. Dates belong here, integers belong there, etc. Unstructured data, such as a tweet or the body of an email message, are more free form. In other words, there is nothing governing the type of data that is stored in those environments. If decision makers want to analyze various types of data (such as both structured and unstructured data) then this V applies.

While challenges may exist when experiencing only one or two of the above mentioned V's, the industry generally agrees that if all three characteristics apply to your data, then you have big data.

Also, remember to visit http://www.brianciampa.com/ for more information on data warehousing and business intelligience.

Image: FreeDigitalPhotos.net

Friday, May 25, 2012

Hadoop MapReduce

As mentioned before, one of the challenges associated with turning data into something that is truly valuable is the challenge of analyzing Big Data. One of the tools being used by the business intelligence community to meet this challenge is a software framework called Hadoop. Hadoop takes advantage of the MapReduce concept to quickly process large amounts of data. To try and explain how this works, consider this example.

If mom goes to the grocery store alone, getting the groceries on her list will take a certain amount of time. If mom, dad, and the two kids all go and each takes responsibility for getting 25% of the items, the total duration will be lessened. However, this will require that the items in the four separate grocery carts are consolidated into one before being presented at the cash register.

In the same way, Hadoop places data onto different servers in a cluster. When Hadoop is asked to execute a job, the servers each work on a piece of that job in parallel. In other words, the job is divided and is mapped to the various servers. When complete, each server has one piece of the solution. That solution is then compiled (the reduce step) so that the many parts of the solution are reduced to the one complete solution. This video does a great job of explaining this concept.

Image(s): FreeDigitalPhotos.net

Friday, May 4, 2012

What Does Big Data Mean For Us?

In the previous post we looked at the problem of Big Data and the industry's move toward more efficient relational database products (or even non-relational database products) that may remove the need for a data warehouse. Some may be inclined to wonder what this means for those who have skills in data modeling and ETL development. In order to answer this question I'm going to point to another blog that dealt with this question recently. It is the Star Schema Central blog (on my blog list on the right side of the screen) which is maintained by Chris Adamson. The link to that particular post is here. He does a great job of explaining why dimensional modeling (the logical design of a data warehouse) will still be important in the world of Big Data. This is an area that is still being defined, so the possibilities are still being explored.

While so much is still up in the air regarding the ways in which data will be modeled or ETL'd in the world of Big Data, consider that as you move forward in your career your overall mission is not to model data a certain way or even to write ETL. Those are just a few techniques that can be used today and those techniques may become outdated. The techniques may change over time but the main goal is to provide meaningful data to decision makers quickly and efficiently. By focusing on this, no matter what the technique or technology of the day is, you will be turning your data into something of true value.

Image: Idea go / FreeDigitalPhotos.net

Friday, April 27, 2012

Big Data

One item that comes up often in discussions regarding Business Intelligence is Big Data. Big Data refers to a data set that is so large that using traditional relational database technology to interact with it is very difficult and perhaps impractical. So, let's say that a star schema is designed in hopes of helping to increase business intelligence in some area and an ETL job is written to populate that star. If that ETL job takes days to run due to the size and/or complexity of the data (not due to the inefficiency of the code within the ETL job, that can be corrected by rewriting it) then that data set may be referred to as Big Data...and it can be quite frustrating.

There are new technologies being developed that help to deal with this "thorn in the side" of data warehousing professionals. Some of these technologies allow an analyst to view data in a source system in the same way that she would in a data warehouse without needing a data warehouse. One of these is an SAP product called HANA. Remember from this blog's first post that the point of a data warehouse is to create a new environment for data (apart from the source system), in which it is restructured so that it is optimal for analysis. A product like HANA can process records much faster than other relational database products on the market.

So, revisiting the piggy bank example from this blog's first post, suppose that Rain Man (Dustin Hoffman's character in the movie of the same name) peered inside the piggy bank. Someone with his talent would be able to quickly determine the amount of money inside the piggy bank without the need to place the coins into money rolls. In this case, the same business intelligence can be gained even though the coins remain in the source system (the piggy bank). So, what does this mean for skills in dimensional modeling and ETL development? I would imagine that there is some debate on that. I'll leave that for a future post.

I'll say parenthetically that I'm not trying to endorse or not endorse HANA. I'm simply mentioning it as an example of a piece of technology that is designed to deal with Big Data.

Image: digitalart / FreeDigitalPhotos.net