Steak & Cheating
Supervised Learning: Logistic Regression
Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on.
Is your Significant Other more likely to cheat on you if they order their steak RARE?
Should you call an Uber & run out the door the moment that raw meat lands on your table during your date?
Using FiveThirtyEight's Steak Survey dataset (source: github). I restructured the dataset to only have values of 1 and 0. I did not include any null values.
The features used for this logistic regression model are:
"Risk Taker? (Lottery A or B)", "Do you ever smoke cigarettes?", "Do you ever drink alcohol?", "Do you ever gamble?", "Have you ever been skydiving?", "Do you ever drive above the speed limit?", "No Steak", "Rare", "Medium Rare", "Medium", "Medium Well", "Well", "Male", "Female", "60+", "18-29", "30-44", "45-60", "$0 - $24,999", "$100,000 - $149,999", "$150,000+", "$25,000 - $49,999", "$50,000 - $99,999", "Bachelor degree", "Graduate degree", "High school degree", "Less than high school degree", "Some college or Associate degree"
with the y as "Have you ever cheated on your significant other?"
Using Python libraries sklearn & pandas, I ran the model a couple of times
with test sizes of 0.1, 0.2, 0.25, 0.3, 0.5
# split X and y into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.3)
# instantiate the model (using the default parameters)
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)
#confussion matrix
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
With the range of 75.9% to the high of
89.5% accuracy rates BUT.....
Using the sklearn & matplotlib
AUC score for the case ranges of a low 0.47 to barely a 0.50 rate. AUC score 1 represents perfect classifier, and 0.5 represents a worthless classifier.
# ROC graph
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC: Can this model predict if your SigFig is unfaithful')
plt.legend(loc="lower right")
plt.savefig('Log_ROC_2')
plt.show()
Using the statsmodels library for the feature stats below.
The final model with respective p-values for each feature and overall p-value of 0.299 with DF of 27 & pseudo R-square (0.062) --> then psuedo R=0.249
Also looking at the p-values for each individual feature--a lot of them except for "No Steak" are above 0.05 which is not a good thing. The p-values are high.
So, in conclusion even though the 75.9% to 89.5% accuracy rate is good.. with the poor AUC score range of 0.47 to 0.5, high feature p-values, and overall high pseudo R value, this model is okay but leans to the crappy side more. This does not predict "infidelity" that well. I do like the survey questions as they do have a running theme "risk averse-ness" questions on top of the typical demographic tiers.
logit_model=sm.Logit(y,x)
result=logit_model.fit()
print(result.summary2())