Breast Cancer Analysis

Supervised Learning: Logistic Regression
Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on.

The data for this model was obtained from the University of Wisconsin-Madison by Dr. William H. Wolberg. The study was conducted between January 1989 and November 1991 within 8 groups. A total sample size of 699 was collected.

The goal is to build a model that best predicts if the cancer is Benign or Malignant.

In our initial run, using all the variables, we obtained p-values less than 0.05 for few variables. The initial run is not efficient model, even though our pchisq is 0. The key variables that show poor p-values were Uniformity of Cell shape Size and Single Epithelial cell size (0.95 and 0.72).

We also did a final check for multicollinearity and all the final variables check out with values less than 2.

Hypothesis Testing:
Null Hypothesis: Ho: Model is NOT Significant.

Alternate Hypothesis: Ha: Model is Significant.


Our phchisq value = 0, hence, the null hypothesis is ruled out since our model is significant.


Also, by performing the Wald test, (in the context of logistic regression it is used to determine whether a certain predictor variable X is significant or not.)

The p-value in the wald test is significantly smaller than 0.05, further solidifying that our model is significant.


Final Model:

E(Y) = [exp(1.130e+01-5.774e-01*x1-4.394e+00*x21.279e-04*x3-4.108e-01*x4-2.180e+00*x5)] ____________________________________________________________________________

 [1+exp(1.130e+01-5.774e-01*x1-4.394e+00 *x21.279e-04*x3-4.108e-01*x4-2.180e+00*x5)]

We calculated the 95% confidence interval and obtained our Exponential Coeffcients (as well as for 95% CI) for our final model. 

Along with an ANOVA Table with chi-squared test and a McFadden R of 0.8736044  

We further divided our datasets into two subsets: Training dataset (70%) and Validation dataset (30%). This was done in order to prevent overfitting,i.e., to check apparent relationships in the training data that do not hold in general.