Skip to content

Jashwanth33/ML-Project

Repository files navigation

Fraud Detection in Healthcare

Python scikit-learn TensorFlow License

Machine learning system for detecting fraudulent healthcare claims using advanced classification algorithms.

ML Pipeline

`mermaid flowchart TD A[Raw Claims Data] --> B[Data Ingestion] B --> C[Data Cleaning] C --> D[Feature Engineering] D --> E[Feature Selection] E --> F[Model Training] F --> G[Model Evaluation] G --> H[Model Deployment] H --> I[Prediction API] I --> J[Dashboard]

C --> C1[Handle Missing Values]
C --> C2[Remove Duplicates]
C --> C3[Data Validation]

D --> D1[Claim Amount Features]
D --> D2[Provider Features]
D --> D3[Patient Features]
D --> D4[Temporal Features]

E --> E1[Correlation Analysis]
E --> E2[Feature Importance]
E --> E3[Dimensionality Reduction]

F --> F1[Random Forest]
F --> F2[XGBoost]
F --> F3[Neural Network]
F --> F4[Ensemble]

`

System Architecture

`mermaid graph TB subgraph "Data Sources" Claims[Claims Database] Providers[Provider Data] Patients[Patient Data] end

subgraph "Processing Layer"
    ETL[ETL Pipeline]
    FeatureStore[Feature Store]
end

subgraph "ML Layer"
    Training[Model Training]
    Inference[Model Inference]
    Registry[Model Registry]
end

subgraph "Application Layer"
    API[REST API]
    Dashboard[Analytics Dashboard]
    Alerts[Alert System]
end

Claims --> ETL
Providers --> ETL
Patients --> ETL
ETL --> FeatureStore
FeatureStore --> Training
Training --> Registry
Registry --> Inference
Inference --> API
API --> Dashboard
API --> Alerts

`

Fraud Detection Flow

`mermaid flowchart TD A[New Claim] --> B[Preprocessing] B --> C[Feature Extraction] C --> D{Model Ensemble}

D --> E[Random Forest]
D --> F[XGBoost]
D --> G[Neural Network]

E --> H[Vote Aggregation]
F --> H
G --> H

H --> I{Fraud Probability}
I -->|High > 0.8| J[Block Claim]
I -->|Medium 0.5-0.8| K[Manual Review]
I -->|Low < 0.5| L[Approve Claim]

J --> M[Alert Investigation]
K --> N[Review Queue]
L --> O[Process Payment]

M --> P[Update Database]
N --> P
O --> P

`

Project Structure

ML-Project/ │ ├── data/ │ ├── raw/ │ │ ├── claims.csv # Raw claims data │ │ ├── providers.csv # Provider information │ │ └── patients.csv # Patient data │ ├── processed/ │ │ ├── features.csv # Engineered features │ │ └── cleaned.csv # Cleaned data │ └── data_dictionary.md │ ├── notebooks/ │ ├── 01_EDA.ipynb # Exploratory Data Analysis │ ├── 02_Feature_Engineering.ipynb # Feature engineering │ ├── 03_Model_Training.ipynb # Model training │ ├── 04_Evaluation.ipynb # Model evaluation │ └── 05_Deployment.ipynb # Deployment prep │ ├── src/ │ ├── __init__.py │ ├── data/ │ │ ├── __init__.py │ │ ├── data_loader.py # Data loading │ │ ├── preprocessor.py # Data preprocessing │ │ └── feature_engine.py # Feature engineering │ │ │ ├── models/ │ │ ├── __init__.py │ │ ├── random_forest.py # Random Forest model │ │ ├── xgboost_model.py # XGBoost model │ │ ├── neural_network.py # Neural network model │ │ ├── ensemble.py # Ensemble model │ │ ├── trainer.py # Training pipeline │ │ └── saved/ │ │ ├── best_model.pkl │ │ └── scaler.pkl │ │ │ ├── evaluation/ │ │ ├── __init__.py │ │ ├── metrics.py # Evaluation metrics │ │ └── visualizer.py # Visualization │ │ │ ├── api/ │ │ ├── __init__.py │ │ ├── app.py # Flask API │ │ ├── routes.py # API routes │ │ └── schemas.py # Data schemas │ │ │ └── utils/ │ ├── __init__.py │ ├── logger.py # Logging │ └── helpers.py # Utility functions │ ├── models/ │ ├── random_forest/ │ ├── xgboost/ │ ├── neural_network/ │ └── ensemble/ │ ├── api/ │ ├── app.py │ └── requirements.txt │ ├── tests/ │ ├── test_data.py │ ├── test_models.py │ └── test_api.py │ ├── configs/ │ ├── model_config.yaml │ └── training_config.yaml │ ├── docs/ │ ├── METHODOLOGY.md │ ├── FEATURES.md │ └── API.md │ ├── requirements.txt ├── setup.py ├── Dockerfile └── README.md

Features Used

Category Feature Description
Claim claim_amount Total claim amount
Claim procedure_code Medical procedure code
Claim diagnosis_code Diagnosis code
Provider provider_specialty Provider specialty
Provider provider_location Geographic location
Patient patient_age Patient age
Patient patient_gender Patient gender
Temporal claim_date Date of claim
Temporal days_since_last Days since last claim

Model Performance

Model Accuracy Precision Recall F1-Score
Random Forest 94.2% 92.5% 95.1% 93.8%
XGBoost 95.8% 94.2% 96.5% 95.3%
Neural Network 93.5% 91.8% 94.8% 93.3%
Ensemble 96.2% 95.1% 97.0% 96.0%

Installation

`�ash git clone https://github.com/Jashwanth33/ML-Project.git cd ML-Project

pip install -r requirements.txt

Train models

python src/models/trainer.py

Run API

python src/api/app.py `

API Usage

`python import requests

Predict fraud

response = requests.post("http://localhost:5000/predict", json={ "claim_amount": 5000, "procedure_code": "99213", "diagnosis_code": "J06.9", "provider_specialty": "Internal Medicine", "patient_age": 45, "patient_gender": "M" })

print(response.json())

{"fraud_probability": 0.12, "is_fraud": false}

`

Contributing

  1. Fork the repository
  2. Create your feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

MIT License

Author

Jashwanth - GitHub

Releases

No releases published

Packages

 
 
 

Contributors

Languages