Business intelligence (BI), big data, dashboards, reports, pie charts... These are only some of the fancy buzzwords (ok maybe not the 'reports' one...) we hear everyday. Software vendors show us nice graphs with gauges and needles which exist for the sole purpose to help making decisions. This is all good and well but currently we use terms interchangeably, mix up concepts and think that if we simply plug a dashboard on a corporate database, we will find the answer to any question we have. And no, the answer is not 42 (at least not for any other reason than pure coincidence...).
These days, data is everywhere. It is cheap to store and is getting increasingly cheaper to analyze in order to find hidden patterns or facts that can help us make better decisions. For example, blood sugar level testing instruments do not just give a reading of one's current blood sugar. They map it onto a graph over time and can provide information regarding the trend of one's blood sugar level and make recommendations regarding food choices etc. Coupled with a cheap smartphone app, you could probably receive automated text messages based on your location telling you something like "Hi! This is bloody here, your blood sugar level reader. Are you sure you want to eat that crap again? You know how my graph will go in the red at the next reading. Why don't you go for something healthier? By the way your health insurance agrees with me and would like to jack up your premium if you decide to ignore my friendly advice". Sounds too imaginative? Perhaps. But the point is, from a technological standpoint, there is nothing preventing this from happening. In fact, we already have dongles for car insurance companies that measure one's driving style to give out discounts to good drivers (or rather, increase good drivers' premium less than the bad ones...).
But let's get back on topic here. Although at a different scale, the examples I gave above still apply. In an organization, we have access to a myriad of data sources, from internal systems (e.g., a CRM, an ERP etc.) to external systems (e.g., partners, social networking sites etc.). So data is cheap. But does that mean that we necessarily want more? Should we hog data in hopes to (maybe) find the answer to our questions?
There is no right answer to this question. However there are some questions that we can ask ourselves to help make up our minds on this.
What type of decision are we trying to make?
Are we concerned about our strategy or our operations? This is a crucial question because the structure of the data, its value based on its "expiration date" are very much affected by this. A business unit may need very current data to make a decision on the spot, every day, at the same time, or whenever a given event occurs. On the other hand, a manager may only need the data on a monthly or even quarterly basis. What this entails is that the level of detail and summarization of the data available to support these two types of decisions vary greatly. Now this does not mean that a strategic decision does not rely on the same data than the operational decision-makers. But the granularity of this data will likely be very different. In essence, it goes back to the eternal question of tailoring the message to the audience we are trying to reach. In my experience, operations need the data as soon as possible because it influences their every move. On the other hand, management may say they need the data as soon as possible while it is only actually needed at regular intervals.
How much data do we need?
Just like the question regarding how soon should the data be available, when you ask someone which pieces of data they want and give them a list, they will often pick them all. I have personally lived through this many times and it can be frustrating when you just know that the information is irrelevant but the user still wants it. It is important that everybody understands that there are costs associated with data. It may be screen real-estate for presentation purposes (fields take space after all), extra computing resources and time for calculations. It may also help to use mock scenarios to help users describe the actual pieces of data they need to make a decision. For example, if an extra piece of data is needed in 1% of cases but doubles the processing time, it may be unnecessary. Again, here the objective is to determine what is the exact value of a piece of data.This means knowing its tangible benefits and costs.
Another important point related to this question is whether we need to keep or store all the data we can. Again, here it is important to understand the costs and benefits associated with these strategies. There may also be legal requirements to follow. For example, one may be forced by law to keep data for 5 years. But beyond that, is the data still relevant for the business? Do we need to integrate data from 5 different social networking sites or do 2 of them give us enough confidence to make certain decisions? One interesting principle in the realm of big data is the idea that we cannot know everything. However, we can sample out of everything and determine a level of confidence within which we can make a decision based on this sample. I think it is a fair way to look at data. Maybe out of 10TB of raw data a random sample of 2TB is sufficient to gain 95% confidence. If using 9TB gives us 98%, does it justify the extra costs associated with storing and processing this data? Perhaps, perhaps not. But the question is worth asking.
How do we need to present this data?
Presentation of the relevant data is essential. But we often get caught up with the fancy visual gizmos which look nice but may actually be relatively poor conveyors of information or be effectively unusable on newer platforms such as mobile devices. Also, some people are better with numbers which they can then manipulate in a spreadsheet for example, while others are better with graphics. There is no universal answer to this question, but it needs to be carefully considered. Here, the goal is to avoid the two extremes where one goes "I want to drill down to the details" or "I want an overview, this is too much detailed information". There are countless books and techniques to help here, but the works of Edward R. Tufte, regardless of their age, are still highly regarded by many (and with reason!).
Are we going to even use this data?
This question is rarely asked, but it is essential. Do we want data to justify our choices or help us make them? If we make up our mind then pick the data that fits our decision, it is worthless. Numbers are easy to manipulate, just look at polls and politics. In fact, using hard evidence as a guide for decision-making is not a trivial exercise. One needs to resist the urge of thinking "I know how this plays out, no need for the data but I will include it because it backs up my argument, this way my a$$ is covered".
But can we trust this data?
The old adage, "garbage in, garbage out" comes to mind as I write this. I have the good fortune to be teaching Masters students a class on BI technologies. In this class, we start from a transactional database for an organization and slowly move toward building a data warehouse which we feed via an ETL (Extract-Transform-Load) process before trying out multidimensional analysis using OLAP cubes and data mining. One thing that stands out for students who are not used to all this is how complex this infrastructure is. Not only is it complex, but it requires a very good knowledge of the business, its relevant processes and associated data. Plus, it is layered so changes in one layer most often have a cascading effect on other layers. One important lesson that comes out of this the fact that everything we do relies on one key assumption: that the data in the transactional system is reliable. Well of course it should be, people use it everyday. Yes but do they use it the way you believe they do and entering the data where you think it belongs?
A simple example: say you have an assembly with 3 steps, A, B and C. Users are instructed to "punch-in" everytime they perform a step so that you can measure the productivity of each step. Now, users think this is stupid and decide that it will take less overall time to simply do the 3 steps and then punch-in 3 times. The result? A process that may be optimal but which steps are not accurately represented in the application. The same goes with workarounds for cumbersome or restrictive systems, usage of fields for unintended purposes etc. In BI we try to go back to the reality of the process using data from systems. If there is a disconnect between the two, we cannot provide the data we want.
So it all boils down to this: can you trust your data? Does it reflect te reality of the business and not just the reality of the business as seen through the eyes of the system? This is something which is worth checking before embarking on a big initiative to try and use that data for decision-making purposes. This is even more important when using external sources as you may not be able to judge of its quality. Plus the provider of this data may not be liable for its accuracy or reliability.
To conclude...
It may seem somewhat naive to put it like this but before we can make decisions with data, we need to make decisions about data. This is a first step which may lead to one's decision to not use data as much as originally thought, simply because one's processes cannot provide the level of data quality and reliability required to make decisions based on numbers alone.