Product Customer
Satisfaction Analysis
Supervised Learning: Multiple Regression
Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables.
Analyzing the 1980s HBAT dataset--using X6 through X18 as our explanatory variables and X19 as the response variable, which is the customer satisfaction index. A total of 13 independent variables out of the 23 total variables. Our object is to build and test the adequacy of a model designed to predict Customer Satisfaction, x19.
The dataset is originally aimed at HBAT Industries customers' purchasing managers. (dataset source: kaggle)
First Run
Our fit linear model is:
fit <- lm(formula = x19 ~ x18 + x17 + x16 + x15 + x14 + x13 + x12 + x11 + x10 + x9 + x8 + x7 + x6, data = HBAT)
We perform a summary and we get our First-Order Model, which is:
E(y) = -1.336 + 0.377*x6 - 0.456*x7 + 0.035*x8 + 0.154*x9 +.....-0.249*x18 + Error
We assume the Error has a mean of 0 and a constant variable, the variables are independent from each other and it is a normal distribution.
Right of the bat, we get R-squared = 0.8039 and adjusted R-squared = 0.7742. This is considered a good since an R-squared larger than 0.5.
From the summary of fit, we can also see that the variables with the lowest p-value is:
x12 (Salesforce Image), x6 (Product Quality), and x7 (E-Commerce Activities / Web Site) are good and useful explanatory variables (green) for x19 (Customer Satisfaction).
While x15 (New Products) and x17 (Price Flexibility) are very poor indicators since they have high p-values (red).
And x10 (Advertising) and x8 (Tech Support) are so-so indicators. For the high p-values (x15, x17) we can choose to omit them from the data.
Also by using the coef() function in R, we can also confirm x12 as a very good predictor of Customer Satisfaction since its coefficient is a positive 0.8274 while x15 has a very low and negative coefficient of -0.0029. Highlighting the fact that it is a bad predictor of Customer Satisfaction, x19.
The null and alternative hypothesis states:
Ho =B1 =B2 =B3 =...Bn =0
H a = at least one parameter does not equal zero
Test statistic: F = 27.11
p-value: p= <2.2e-16 ( at α=0.01)
R-squared = 0.8039
Adjusted R-squared = 0.7742
The extremely small p-value indicates that the model is statistically adequate for predicting Customer Satisfaction, x19.
Second Run with Stepwise
To start off the second run we use the qqPlot() function in R to remove outliers. One removes data entry (outliers): 99, 25, 70, & 10.
HBAT table is now renamed newHBAT3 and the fit linear model is now called fit_o4. A summary is now performed and is shown below. The previous R-squared=0.8039 & the NEW R-squared =0.8654 while the previous adjusted R-squared = 0.7742 & the NEW adjusted R-squared = 0.8436
Test statistic, went from F=27.11 to NEW F=39.57 One can say, that the removal of data entry 99, 25 70, & 10 did do good rather than harm to the dataset.
A Non-Normality diagnostic and Influential Observations was also performed as seen below. From this, one can say that the distribution of studentized residuals of the newly improved data set newHBAT3 fellows a normal distribution curve very well.
newHBAT3 also passes the Global Test Model Assumptions
With the influential observation of newHBAT3, one can tell that the new outliers are 83, 93, 33, & 32. The circles are proportional to Cook’s Distance.
The residuals are further broken down by dependent (explanatory) variables as seen below. and as seen variable x12, x11, and x6 has the least worse residuals. These 3 variables also has the highest coefficients as seen in the summary on page 1. (x12=0.8423, x11=0.3233, x6=0.4239)
A Non-independance of Errors (by using the Durbin Watson Test) and Multicollinearity (using the vif() funtion) is also performed.
From the Durbin Watson Test, the rule of thumb is that if the statistic falls between 1.5 to 2.5 it is considered relatively normal. Which is 2.2438. The data is normal and the dependent variables are relatively independent.
From the vif() function one sees that they are a couple of variable with high multicollinearity, such as x18, x17 & x11
Stepwise---The multicollinearity diagnostics helps the stepwise step. One performs a forward and backward stepwise regression function and summarizes it. Only variables x17, x16, x12, x11, x7, and x6 are kept in the fit linear model to predict Customer Satisfaction, x19. A QQ Plot is also performed.
The null and alternative hypothesis states:
Ho = B1 = B2 = B3 = . . . Bn = 0
H a = at least one parameter does not equal zero
The larger values of R-squared=0.86, adjusted R-squared=85, F-statistic 90.26 of the dataset with dropped outliers and stepwise transformation and the same extremely small p-value indicates that the model is statistically adequate for predicting Customer Satisfaction, x19.