Skip to content

psalarc/DiabetesPredictionProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML-Based Diabetes Risk Prediction

A machine learning classification study comparing six algorithms across two diabetes datasets to identify the most effective model for early diagnosis. Research findings published at IVSP 2024 and indexed in the ACM Digital Library.

Published Paper: Machine Learning Algorithms for Diabetes Diagnosis Prediction — ACM, IVSP 2024


Overview

Diabetes affects over 37 million U.S. adults, with 1 in 5 cases going undiagnosed. Early and accurate prediction is critical for timely treatment. This project evaluates six supervised ML classifiers across two distinct patient datasets to determine which algorithm provides the most robust diagnostic performance.


Datasets

Dataset 1 — Pima Indians Diabetes (Kaggle)

  • Source: National Institute of Diabetes and Digestive and Kidney Diseases
  • Population: Female patients of Pima Indian heritage, age 21+
  • Features (8): Number of pregnancies, glucose level, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, age
  • Target: Binary — diabetic / non-diabetic

Dataset 2 — Diabetes 2019 (Kaggle)

  • Source: BIT Mesra, Dept. of Computer Science and Engineering
  • Population: 952 instances, mixed gender, no demographic restrictions
  • Features (17): Age, gender, family history, high blood pressure, physical activity, BMI, smoking, alcohol consumption, sleep history, sleep quality, medication use, junk food intake, stress levels, blood pressure level, pregnancies, prediabetes status, urination frequency
  • Target: Binary — diabetic / non-diabetic

Results

Table 1 — Pima Indians Diabetes Dataset

Model Accuracy Precision Recall
Gaussian Naïve Bayes 79.22% 67.44% 61.70%
Decision Tree 75.97% 66.67% 76.59%
Random Forest 81.82% 73.17% 63.83%
K-Nearest Neighbor 78.57% 65.91% 61.70%
Linear SVC 82.47% 76.32% 61.70%
Ridge Classifier 83.12% 78.38% 61.70%

Best model: Ridge Classifier — 83.12% accuracy

Table 2 — Diabetes 2019 Dataset

Model Accuracy Precision Recall
Gaussian Naïve Bayes 80.00% 74.29% 89.66%
Decision Tree 95.00% 100% 89.66%
Random Forest 95.00% 96.43% 93.10%
K-Nearest Neighbor 81.67% 82.14% 79.31%
Linear SVC 78.33% 75.00% 82.76%
Ridge Classifier 80.00% 75.76% 86.21%

Best models: Decision Tree & Random Forest — both 95.00% accuracy


Methodology

  1. Data Collection — Two Kaggle datasets selected for their overlapping features (age, BMI, blood pressure, pregnancies) to enable direct algorithm comparison across different feature dimensionalities
  2. Preprocessing — Missing value imputation, outlier removal, RobustScaler feature normalization
  3. Feature Extraction — Correlation matrix analysis to detect and remove multicollinear features before model training
  4. Model Training — Six scikit-learn classifiers trained and evaluated on each dataset
  5. Evaluation — Models compared on accuracy, precision, and recall
  6. Analysis — Investigated whether the number of features (8 vs. 17) influences algorithm performance

Repository Structure

DiabetesPredictionProject/
├── data/raw/                          # Raw dataset files
├── notebooks/                         # Jupyter notebooks with analysis
├── src/                               # Python scripts
└── ML_AlgorithmsDiabetesPrediction.pdf  # Full project report

Technologies

  • Language: Python
  • Libraries: scikit-learn, Pandas, NumPy, Matplotlib, Seaborn
  • Algorithms: Ridge Classifier, Random Forest, Decision Tree, KNN, Linear SVC, Gaussian Naïve Bayes

Citation

Amanda Zambrana, Loreen Fanek, Pablo S. Carrera, Md L. Ali, and Mourya R. Narasareddygari. 2024. Machine Learning Algorithms for Diabetes Diagnosis Prediction. In 2024 6th International Conference on Image, Video and Signal Processing (IVSP 2024), March 14–16, 2024, Ikuta, Japan. ACM. https://doi.org/10.1145/3655755.3655781

About

Six ML classifiers benchmarked on two diabetes datasets. Best accuracy: Ridge Classifier 83.12% (Pima), Random Forest & Decision Tree 95% (Diabetes 2019). Published at ACM IVSP 2024.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages