Diabetes Type II Diagnosis Analysis

Unsupervised Learning: PCA & K-means Clustering
Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. K-means Clustering is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

In this analysis, we investigate if physical features that correspond to obesity are prevalent to diabetes type II, a type of life-long disease where the body produces the hormone insulin but the cells do not properly use it as they should, thus failing to regulated a type of sugar called glucose (WebMD, n.d.). And if it is a vital indicator of diabetes type II other than the glycosolated hemoglobin (AC1 level).

The data set we use is from Dr. John Schorling of the Department of Medicine at the University of Virginia (Schorling, 1997). It contains 404 observations and 19 variables. We omitted rows with missing entries and omitted two variables that would have severely impacted our sample size; therefore, our final data set has 366 observation and 17 variables. 

We then use principal component analysis (PCA) as a technique to seek underlying factors in the data set and preserve variation within the variables. The top four principal components, which accounted for 64.5% cumulative proportion, were chosen. 

The first component, PC1, was mostly composed of weight, waist size, and hip size, and can be attributed to physical features. PC1 accounts for 27.3% proportion of total variation alone. The other three components can be attributed to sugar blood work, vital signs, and lipid blood work. We are then able to conclude that physical features like obesity play an important role to the prevalence of diabetes type II and can be considered the most significant indicator, followed by the level of glycosolated hemoglobin.  

Pre-analysis diagnostics is performed. We check for Non-normality, Nonlinearity, Multicollinearity, and Non-Independence of Errors. We also draw a correlation plot. See Appendix.

The diabetes type II data set of 404 observation and 19 variables is examined. Rows with missing entries are omitted and two variables, Second Systolic & Diastolic Blood Pressures (bp.2s, bp. 2d) are removed as it would have severely impacted the number of observations. If bp.2s and bp.2d were not removed it would have caused the analysis 262 observations. A Scree Plot is done to see where the “knee” is along with looking at the cumulative proportion of explained variances. We run the PCA function. As a result, the top four principal components, which accounted for 64.5% cumulative proportion, were selected.

PC1 correspond to physical features that can relate to obesity, as the variables for weight, hip, and waist contribute to PC1 the most.

PC1 = 0.934*weight + 0.926*hip + ... - 0.161*hdl

PC2 correspond to sugar blood work, as the variables for stab.glu and glyhb contribute to PC2 the most.

PC2 = 0.872*glyhb + 0.816*stab.glu+ ...
                + 0.148*time.ppn

PC3 correspond to blood pressure vital signs, as the variables bp.1s and bp.1d contribute the most to PC3 the most.

PC3 = 0.839*bp.1s + 0.804*bp.1d + ... - 0.267*time.ppn

PC4 corresponds to lipid blood work, as the variables hdl and ratio negatively and positively contribute to PC4 the most.

PC4 = 0.829*ratio - 0.878*hdl + ... + 0.154*waist

 

It is also nice to note the highest and lowest scores. The highest & lowest PC Scores are: PC1 with 7.54 & -3.362, PC2 with 4.40 & -4.34, PC3 with 4.17 & -3.58, and PC4 with 3.49 & -3.24. It can be seen in the Individual - PCA graph above that observation number 56 has the highest PC1 score.

Regarding limitations of this research, we should mention that the data was collected within the African American population in central Virginia; therefore, this study is specific to this population. The study has a small sample size of 404 to begin with, and it was cut down to 366 due to missing column entries. The 19 variables were also reduced to 17 as two variables, second systolic & diastolic blood pressure, had missing values in 262 of the 404 observations. For future work, one should follow up with a subject matter expert before removing any variables. Another idea to improve this study is to collected data in more random and diverse ethnic area so the sample can be applied to a more general population.

In conclusion, principal components analysis has allowed us to identify obesity as a leading indicator of diabetes type II, followed by the level of glycosolated hemoglobin. Physical features that closely relate to obesity such as: weight, waist size and hip size can be vital tell-signs of diabetes type II and can be helpful factors to better diagnose diabetes type II.  

Appendix