Breast Cancer Analysis

Supervised Learning: Logistic Regression
Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on.

The data for this model was obtained from the University of Wisconsin-Madison by Dr. William H. Wolberg. The study was conducted between January 1989 and November 1991 within 8 groups. A total sample size of 699 was collected.

The goal is to build a model that best predicts if the cancer is Benign or Malignant.

In our initial run, using all the variables, we obtained p-values less than 0.05 for few variables. The initial run is not efficient model, even though our pchisq is 0. The key variables that show poor p-values were Uniformity of Cell shape Size and Single Epithelial cell size (0.95 and 0.72).

We also did a final check for multicollinearity and all the final variables check out with values less than 2.

Hypothesis Testing:
Null Hypothesis: Ho: Model is NOT Significant.

Alternate Hypothesis: Ha: Model is Significant.

Our phchisq value = 0, hence, the null hypothesis is ruled out since our model is significant.

Also, by performing the Wald test, (in the context of logistic regression it is used to determine whether a certain predictor variable X is significant or not.)

The p-value in the wald test is significantly smaller than 0.05, further solidifying that our model is significant.

Final Model:

E(Y) = [exp(1.130e+01-5.774e-01*x1-4.394e+00*x21.279e-04*x3-4.108e-01*x4-2.180e+00*x5)] ____________________________________________________________________________

[1+exp(1.130e+01-5.774e-01*x1-4.394e+00 *x21.279e-04*x3-4.108e-01*x4-2.180e+00*x5)]

We calculated the 95% confidence interval and obtained our Exponential Coeffcients (as well as for 95% CI) for our final model.

Along with an ANOVA Table with chi-squared test and a McFadden R 2 of 0.8736044

We further divided our datasets into two subsets: Training dataset (70%) and Validation dataset (30%). This was done in order to prevent overfitting,i.e., to check apparent relationships in the training data that do not hold in general.

A successful significant data model was built that showed an overall accuracy rate of 96.9%. When put through training and validation datasets, the model retained an accuracy rate above 95%.
The error rate predicted was marginal. The overall error rate showed up to 3%. When put through training and validation datasets, the model retained an error rate below 5%.
McFadden R 2 points out to 0.87, which points out to the fact that the data is a good-fit for the model.
This model can be utilized for further use in Medical research and studies and to predict if cancer cells are benign and malignant.