A machine learning classification study comparing six algorithms across two diabetes datasets to identify the most effective model for early diagnosis. Research findings published at IVSP 2024 and indexed in the ACM Digital Library.
Published Paper: Machine Learning Algorithms for Diabetes Diagnosis Prediction — ACM, IVSP 2024
Diabetes affects over 37 million U.S. adults, with 1 in 5 cases going undiagnosed. Early and accurate prediction is critical for timely treatment. This project evaluates six supervised ML classifiers across two distinct patient datasets to determine which algorithm provides the most robust diagnostic performance.
- Source: National Institute of Diabetes and Digestive and Kidney Diseases
- Population: Female patients of Pima Indian heritage, age 21+
- Features (8): Number of pregnancies, glucose level, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, age
- Target: Binary — diabetic / non-diabetic
- Source: BIT Mesra, Dept. of Computer Science and Engineering
- Population: 952 instances, mixed gender, no demographic restrictions
- Features (17): Age, gender, family history, high blood pressure, physical activity, BMI, smoking, alcohol consumption, sleep history, sleep quality, medication use, junk food intake, stress levels, blood pressure level, pregnancies, prediabetes status, urination frequency
- Target: Binary — diabetic / non-diabetic
| Model | Accuracy | Precision | Recall |
|---|---|---|---|
| Gaussian Naïve Bayes | 79.22% | 67.44% | 61.70% |
| Decision Tree | 75.97% | 66.67% | 76.59% |
| Random Forest | 81.82% | 73.17% | 63.83% |
| K-Nearest Neighbor | 78.57% | 65.91% | 61.70% |
| Linear SVC | 82.47% | 76.32% | 61.70% |
| Ridge Classifier | 83.12% | 78.38% | 61.70% |
Best model: Ridge Classifier — 83.12% accuracy
| Model | Accuracy | Precision | Recall |
|---|---|---|---|
| Gaussian Naïve Bayes | 80.00% | 74.29% | 89.66% |
| Decision Tree | 95.00% | 100% | 89.66% |
| Random Forest | 95.00% | 96.43% | 93.10% |
| K-Nearest Neighbor | 81.67% | 82.14% | 79.31% |
| Linear SVC | 78.33% | 75.00% | 82.76% |
| Ridge Classifier | 80.00% | 75.76% | 86.21% |
Best models: Decision Tree & Random Forest — both 95.00% accuracy
- Data Collection — Two Kaggle datasets selected for their overlapping features (age, BMI, blood pressure, pregnancies) to enable direct algorithm comparison across different feature dimensionalities
- Preprocessing — Missing value imputation, outlier removal, RobustScaler feature normalization
- Feature Extraction — Correlation matrix analysis to detect and remove multicollinear features before model training
- Model Training — Six scikit-learn classifiers trained and evaluated on each dataset
- Evaluation — Models compared on accuracy, precision, and recall
- Analysis — Investigated whether the number of features (8 vs. 17) influences algorithm performance
DiabetesPredictionProject/
├── data/raw/ # Raw dataset files
├── notebooks/ # Jupyter notebooks with analysis
├── src/ # Python scripts
└── ML_AlgorithmsDiabetesPrediction.pdf # Full project report
- Language: Python
- Libraries: scikit-learn, Pandas, NumPy, Matplotlib, Seaborn
- Algorithms: Ridge Classifier, Random Forest, Decision Tree, KNN, Linear SVC, Gaussian Naïve Bayes
Amanda Zambrana, Loreen Fanek, Pablo S. Carrera, Md L. Ali, and Mourya R. Narasareddygari. 2024. Machine Learning Algorithms for Diabetes Diagnosis Prediction. In 2024 6th International Conference on Image, Video and Signal Processing (IVSP 2024), March 14–16, 2024, Ikuta, Japan. ACM. https://doi.org/10.1145/3655755.3655781