How to Solve Problems with Simple Predictive Analytics
Predictive analytics applies to a variety of business problems faced today, and more people are beginning to recognize its value. Businesses and nonprofits are using predictive analytics to answer real business questions like “What segment of potential donors will respond best to our message” and “Why am I losing customers, and how can I stop them from leaving?”
Even though the use of predictive analytics hold so much value for businesses and nonprofits, the general problem with implementing them is that the knowledge of how to do so is not readily available. Many people struggle when trying to make sense of good analysis practices, choosing appropriate predictive models for a given situation, and understanding the underlying statistics. To fill this gap of knowledge and provide an easy way to learn and take advantage of predictive analytics, Vault Analytics will be releasing a new book on August 2.
It contains detailed chapters describing how to do good analysis, how to choose an appropriate predictive model for your situation, and how to make sure the statistics powering the model are set up right. This is all done and explained in the familiar environment of Excel 2007, so that it can benefit those who may not have access to more advanced predictive analytical packages such as SAS and SPSS.
If you’d like to download the first few chapters for free, or pre-order the book, you can do so here.
Otherwise, below I’ve copied a section from the book that I think is extremely valuable for anyone new to data analysis. It describes two of the most important fundamentals: Seeing the data in context, and segmentation.
Seeing the Data in Context
Understanding what the data are telling you within the context of the business situation being analyzed is extremely important. This will help you avoid making faulty conclusions and keep your analysis appropriate for the business question being answered. The best way to learn this fundamental is to see it in action, so we will take an example.
We will look at a type of direct mail campaign analysis. We want to know how many calls are expected to come into our call center after we execute the campaign. First, we take some historical data showing us the percentage of total calls coming in according to the number of days after starting a mail campaign, shown below.
After creating a scatter plot of the data, we try to fit a logarithmic regression line as a model, shown seen below.
Even though the R2 tells us that the fit is good, the model may not be the best way to explain this data when the context and purpose of this analysis are considered. We want the model to be able to predict what percentage of total calls will come in from a mailing campaign so we can staff the call center. If I were to use the line above as the model, I would be predicting low values for incoming calls between about day 20 and 100, and high values thereafter. Because of this error, we would not be staffing the call center correctly.
To create a better model, I would consider the fact that, in this context, it is not necessary to fit a trend model to the entire data set. Consider the following model, which can be used to predict the percentage of total calls coming in between days 4 and 35 after the mailing campaign:
You will notice that this trend model does not contain the same high and low errors as the previous model did. Further, upon doing some calculations on the data in the spreadsheet, we know that anything before day 4 makes up for just 8% of all calls, and anything after day 35 makes up for just 15% of all calls. I have highlighted with a model the time period of the biggest growth to the call percentage, while summarizing the remaining percentages on either side. This will give just the right amount of information needed to staff the call center, while minimizing errors I would have made trying to fit a single trend model to the data.
The point here is to look at the data in the context of the purpose of the analysis. What are you going to use the predictive model for? Is it necessary to fit a model to the entire data set? How exact do you need to be with the prediction? What is the most important part of the data set to model? These and other questions are important to consider when performing analysis.
The second fundamental of analysis is the practice of segmenting the data. As with seeing the data in context, this is best described with an example. Consider the analysis presented below, which shows a linear regression model to predict how much someone will likely donate to your cause according to their age.
The fit of the model is extremely weak, and there seems to be no relationship between donation and age. However, this data was taken and aggregated from two different cities, Boston and New York. If we separate out the data according to those two cities (otherwise known as segmenting by them), we get the following when we run a regression analysis:
By segmenting the data first, we notice that there is, in fact, a relationship between donation and age, but that relationship differs depending on what city you are in.