Engineering

2024

Diabetes Risk Prediction — ML Study

Academic project · Python · scikit-learn

Pythonscikit-learnpandasKNNSVMRandom ForestK-Means

768

patients in dataset

8

physiological features

3

classifiers trained

75.3%

SVM best accuracy

The Pima Indians Diabetes Database contains 768 patient records with 8 physiological measurements — Glucose, BMI, Insulin, Age, Blood Pressure, Skin Thickness, Diabetes Pedigree Function, and number of pregnancies — and a binary outcome: diabetic or not.

The goal was not just to reach the highest accuracy. It was to understand why the models behave the way they do, and to validate those patterns unsupervised — without looking at the labels.

The data has a hidden problem

Before any model was trained, the dataset needed cleaning. Zero values appear for physiological variables that are biologically impossible — Glucose = 0, BMI = 0. These are missing values, not real measurements. They were imputed with the median for each outcome class, preserving the class-specific distributions.

Feature correlation matrix
Correlation matrix — Glucose and BMI show the strongest correlation with the diabetes outcome. These two features drive most of the predictive signal.

Three classifiers, three perspectives

~Tested

KNN — k=11

73.2% accuracy. Optimal k found by cross-validation. Sensitive to feature scale — StandardScaler applied. Interpretable but limited.

~Tested

Random Forest — 100 trees

74.0% accuracy. Feature importance: Glucose > BMI > Age. Robust to outliers, but only marginal gain over KNN here.

Deployed

SVM — RBF kernel

75.3% accuracy. Best AUC on ROC curve. Grid search over C and γ. The decision boundary generalises best on this dataset.

ROC curves for all three classifiers
ROC curves — all three classifiers. SVM achieves the highest AUC, confirming its superiority beyond simple accuracy.

Unsupervised validation: K-Means finds the same truth

The most striking result wasn't from a supervised model. K-Means (k=2) was run on Glucose and BMI only — without using the labels at all.

K-Means found the diabetic cluster without looking at a single label. The structure was real, not learned.

K-Means clustering on Glucose and BMI
K-Means (k=2) on Glucose × BMI — Cluster 1 (low glucose, low BMI) maps to non-diabetic patients. Cluster 2 (high glucose, high BMI) maps to diabetic patients. The cluster boundaries align with the SVM decision boundary.

The two clusters align almost exactly with the SVM decision boundary. This means the two-class structure is genuinely present in the geometry of the data — the supervised models are not inventing a separation that isn't there.

What this demonstrates

This project demonstrates the full ML pipeline — not just the modelling step. Data cleaning with biological reasoning, EDA before any model is fit, supervised learning with cross-validation tuning, and unsupervised validation to check that the supervised boundary reflects real structure. The K-Means result is the one worth remembering: it shows the classification problem is real, not an artefact of the training procedure.