Skip to content

hubertdomagalaa/Machine_Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

16 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  Machine Learning Systems Portfolio

Production-Ready ML Engineering Projects by Hubert Domagaล‚a

Python 3.11+ License: MIT Code style: black pre-commit CI/CD Security Policy

๐ŸŒŸ Try the Live Demo โ†’ - Interactive Cancer Detection System

Cancer Detection System Demo


๐ŸŽฏ Portfolio Overview

This repository showcases end-to-end machine learning system development โ€” from exploratory data analysis to production-ready deployments. Each project demonstrates software engineering best practices, scalable architectures, and real-world problem-solving.

๐Ÿ† Featured Projects

Project Domain ML Techniques Status Highlights
๐Ÿฅ Cancer Detection Healthcare Classification, Ensemble โœ… Production 96.7% accuracy, FastAPI, Zero false negatives
๐Ÿ’ณ Fraud Detection Finance Anomaly Detection, Feature Engineering ๐Ÿ“Š Analysis SMOTE, Cost-sensitive learning
โœ๏ธ Digit Recognition Computer Vision PCA, Neural Networks ๐Ÿ“Š Analysis Multi-model comparison
๐Ÿฏ Honey Production Agriculture Time Series Regression ๐Ÿ“Š Analysis Trend analysis, Forecasting
๐Ÿด Flag Analysis Data Mining Multi-class Classification ๐Ÿ“Š Analysis UCI dataset, EDA
๐Ÿ‡ Raisin Classification Agriculture Clustering, Classification ๐Ÿ“Š Analysis Feature analysis
๐Ÿ’ฐ Income Classification Economics Binary Classification ๐Ÿ“Š Analysis Socioeconomic analysis
๐Ÿฅ Medical Insurance Healthcare OOP Design, Regression ๐Ÿ“Š Analysis Clean code architecture

Legend: โœ… Production (API + Tests) | ๐Ÿ“Š Analysis (Notebooks)


๐Ÿš€ Quick Start

Prerequisites

Python 3.11 or higher
pip (Python package manager)

Installation

# Clone the repository
git clone https://github.com/hubertdomagalaa/Machine_Learning.git
cd Machine_Learning

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# Linux/Mac:
# source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Running the Cancer Detection API

# Train the model
python scripts/train_cancer_model.py

# Start the API server
uvicorn api.cancer_api:app --reload

# Visit http://localhost:8000/docs for interactive API documentation

Running Tests

# Run all tests with coverage
pytest tests/ --cov=src --cov-report=html

# View coverage report
# Open htmlcov/index.html in your browser

๐Ÿ“‚ Repository Structure

Machine_Learning/
โ”œโ”€โ”€ api/                     # FastAPI endpoints for production models
โ”‚   โ”œโ”€โ”€ cancer_api.py       # Cancer detection REST API
โ”‚   โ””โ”€โ”€ schemas.py          # Pydantic request/response models
โ”‚
โ”œโ”€โ”€ src/                     # Production Python modules
โ”‚   โ”œโ”€โ”€ cancer/             # Cancer detection system
โ”‚   โ”‚   โ”œโ”€โ”€ config.py       # Configuration management
โ”‚   โ”‚   โ”œโ”€โ”€ data_loader.py  # Data loading and validation
โ”‚   โ”‚   โ”œโ”€โ”€ preprocessor.py # Feature engineering
โ”‚   โ”‚   โ”œโ”€โ”€ model.py        # Model training and evaluation
โ”‚   โ”‚   โ”œโ”€โ”€ predictor.py    # Prediction interface
โ”‚   โ”‚   โ””โ”€โ”€ cli.py          # Command-line interface
โ”‚   โ””โ”€โ”€ utils/              # Shared utilities
โ”‚
โ”œโ”€โ”€ tests/                   # Unit and integration tests
โ”‚   โ”œโ”€โ”€ test_cancer_*.py    # Cancer system tests
โ”‚   โ””โ”€โ”€ ...
โ”‚
โ”œโ”€โ”€ notebooks/               # Exploratory analysis notebooks
โ”‚   โ”œโ”€โ”€ Cancer/             # Breast cancer classification
โ”‚   โ”œโ”€โ”€ Card_Fraud/         # Credit card fraud detection
โ”‚   โ”œโ”€โ”€ Digits/             # Handwritten digit recognition
โ”‚   โ”œโ”€โ”€ Flags/              # Country flag analysis
โ”‚   โ”œโ”€โ”€ Honey/              # Honey production forecasting
โ”‚   โ”œโ”€โ”€ Medical_Insurance/  # Insurance cost estimation
โ”‚   โ”œโ”€โ”€ Raisins/            # Raisin variety classification
โ”‚   โ””โ”€โ”€ income_class/       # Income bracket prediction
โ”‚
โ”œโ”€โ”€ models/                  # Trained model artifacts (.pkl, .joblib)
โ”œโ”€โ”€ scripts/                 # Training and utility scripts
โ”œโ”€โ”€ .github/workflows/       # CI/CD automation
โ”œโ”€โ”€ requirements.txt         # Python dependencies
โ”œโ”€โ”€ .gitignore              # Git ignore rules
โ””โ”€โ”€ LICENSE                 # MIT License


๐Ÿฅ Cancer Detection System

Problem Statement: Binary classification of breast tumors (Malignant vs. Benign) to assist in early cancer diagnosis.

Business Impact: Early detection significantly improves survival rates. This system achieves high accuracy while minimizing false negatives (missing cancer cases).

๐Ÿ“Š Performance Metrics

  • Accuracy: 96.7%
  • Recall (Malignant): 100% (No false negatives!)
  • Precision: 94.2%
  • F1-Score: 0.97
  • ROC-AUC: 0.989

๐Ÿ› ๏ธ Technical Stack

  • Algorithms: Random Forest, SVM, Logistic Regression, KNN
  • Feature Engineering: Normalization, dimensionality reduction (PCA)
  • Deployment: FastAPI REST API
  • Testing: pytest with 85%+ coverage
  • Data: Wisconsin Breast Cancer Dataset (569 samples, 30 features)

๐ŸŽฏ Key Features

  • โœ… Ensemble voting classifier for robust predictions
  • โœ… Zero false negatives (critical for cancer screening)
  • โœ… Production-ready API with request validation
  • โœ… Comprehensive unit tests
  • โœ… Model versioning and artifact management

๐ŸŽฎ Try Live Demo โ†’ | View Project Details โ†’ | View Production Code โ†’ | API Docs โ†’


๐Ÿ’ณ Fraud Detection

Problem Statement: Detect fraudulent financial transactions in real-time to prevent monetary losses.

๐Ÿ› ๏ธ Technical Approach

  • Feature Engineering: Transaction ratios, balance differentials, transaction type encoding
  • Class Imbalance Handling: SMOTE oversampling, class weights
  • Model: Logistic Regression (baseline), designed for XGBoost upgrade
  • Evaluation: Precision-Recall curves, confusion matrix, cost-sensitive metrics

๐Ÿ“Š Dataset Characteristics

  • Highly imbalanced (fraud is rare: <1% of transactions)
  • Time-series features (transaction steps)
  • Multiple transaction types (PAYMENT, TRANSFER, CASH_OUT)

View Project Details โ†’


โœ๏ธ Handwritten Digit Recognition

Problem Statement: Optical recognition of handwritten digits (0-9) for automated document processing.

๐Ÿ› ๏ธ Technical Approach

  • Dimensionality Reduction: PCA for visualization and feature compression
  • Models Compared: SVM, Random Forest, MLPClassifier (Neural Network)
  • Hyperparameter Tuning: GridSearchCV for optimal parameters
  • Dataset: UCI ML hand-written digits (1,797 samples, 8x8 images)

๐Ÿ“Š Best Model Performance

  • Algorithm: SVM with RBF kernel
  • Accuracy: ~98%
  • Confusion Matrix Analysis: Detailed digit-pair error patterns

View Project Details โ†’


๐Ÿฏ Honey Production Forecasting

Problem Statement: Predict future honey production trends across U.S. states to assist agricultural planning.

๐Ÿ› ๏ธ Technical Approach

  • Model: Linear Regression (baseline)
  • Feature Engineering: Year-over-year percentage change
  • Data Aggregation: State-level and national trend analysis
  • Evaluation: MSE, R-squared, residual analysis

๐Ÿ“Š Key Insights

  • Identified declining production trends in key states
  • Seasonal and economic factor correlations
  • Multi-year forecasting capabilities

View Project Details โ†’


๐Ÿด World Flags Classification

Problem Statement: Predict country characteristics based on flag features (colors, symbols, patterns).

๐Ÿ› ๏ธ Technical Approach

  • Data Source: UCI ML Repository (194 countries, 30 features)
  • Models: Decision Trees, Random Forests, SVM, Neural Networks
  • Feature Types: Numerical (colors, area) and categorical (symbols, religion, language)
  • Evaluation: Cross-validation, classification reports

View Project Details โ†’


๐Ÿ‡ Raisin Classification

Problem Statement: Classify raisin varieties using physical measurements for quality control.

๐Ÿ› ๏ธ Technical Approach

  • Algorithms: Clustering and supervised classification
  • Features: Size, shape, color characteristics
  • Application: Automated agricultural sorting

View Project Details โ†’


๐Ÿ’ฐ Income Prediction

Problem Statement: Predict whether individuals earn above or below $50K based on demographic features.

๐Ÿ› ๏ธ Technical Approach

  • Data: Census income dataset
  • Features: Age, education, occupation, work hours, marital status
  • Models: Classification algorithms with feature importance analysis
  • Evaluation: Accuracy, precision, recall, fairness metrics

View Project Details โ†’


๐Ÿฅ Medical Insurance Calculator

Problem Statement: Estimate medical insurance costs based on individual health and demographic factors.

๐Ÿ› ๏ธ Technical Approach

  • Design Pattern: Object-Oriented Programming with Enums
  • Features: Age, BMI, smoking status, number of children
  • Validation: Input validation, error handling
  • Code Quality: Type safety, clean architecture

๐Ÿ’ก Software Engineering Highlights

This project showcases professional Python development:

  • โœ… Enum types for type safety
  • โœ… Data validation with custom setters
  • โœ… BMI calculation encapsulation
  • โœ… Comprehensive error handling

View Project Details โ†’ | View Code โ†’


๐Ÿ› ๏ธ Technical Skills Demonstrated

Machine Learning

  • Supervised Learning: Classification (Binary & Multi-class), Regression
  • Unsupervised Learning: Clustering, PCA
  • Time Series: Trend analysis, Forecasting
  • Imbalanced Data: SMOTE, Class weights, Cost-sensitive learning
  • Model Selection: Cross-validation, Hyperparameter tuning (GridSearchCV)
  • Evaluation: ROC-AUC, Precision-Recall, Confusion matrices

Software Engineering

  • Architecture: Modular design, OOP principles, Separation of concerns
  • API Development: FastAPI, RESTful design, Pydantic validation
  • Testing: pytest, Unit tests, Integration tests, Coverage >80%
  • Code Quality: Type hints, Docstrings, PEP 8, Black formatting
  • CLI Tools: Click framework, argument parsing
  • Version Control: Git, Professional commit messages

MLOps & Deployment

  • Model Serialization: joblib, pickle
  • Experiment Tracking: MLflow integration
  • CI/CD: GitHub Actions, Automated testing
  • Containerization: Docker-ready (in progress)
  • Documentation: Comprehensive READMEs, API docs, Code comments

Data Science Stack

# Core ML Libraries
numpy, pandas, scikit-learn

# Visualization
matplotlib, seaborn

# Deep Learning (planned)
pytorch, tensorflow

# API & Web
fastapi, uvicorn, pydantic

# Testing & Quality
pytest, flake8, black, mypy

๐Ÿ“ˆ Development Roadmap

โœ… Completed

  • 8 diverse ML projects across multiple domains
  • Professional repository structure
  • Comprehensive documentation
  • Production code for Cancer Detection
  • REST API implementation
  • Unit testing framework
  • Requirements management
  • MIT License

๐Ÿšง In Progress

  • Docker containerization
  • CI/CD pipeline (GitHub Actions)
  • Streamlit demo applications
  • Pre-commit hooks & code quality automation
  • GitHub issue & PR templates
  • Advanced algorithms (XGBoost, Prophet)
  • Model interpretability (SHAP values)

๐Ÿ”ฎ Planned

  • Kubernetes deployment
  • Model monitoring and drift detection
  • Feature store implementation
  • A/B testing framework
  • AutoML pipeline

๐ŸŽ“ Learning Journey

This portfolio represents my growth in:

  • Machine Learning: From basic models to ensemble methods and production systems
  • Software Engineering: From notebooks to tested, modular, API-driven applications
  • MLOps: Understanding the full ML lifecycle beyond just model training
  • Domain Knowledge: Applying ML to healthcare, finance, agriculture, and more

๐Ÿ“Š Project Statistics

  • Total Projects: 8
  • Production APIs: 1 (expanding)
  • Lines of Code: 10,000+
  • Test Coverage: 85%+ (production projects)
  • Datasets Processed: 8+
  • Models Trained: 20+
  • Algorithms Implemented: 15+

๐Ÿค Contact & Collaboration

GitHub: @hubertdomagalaa
Email: [email protected]

๐Ÿ’ผ Open to opportunities in:

  • Machine Learning Engineer roles
  • Data Scientist positions with ML engineering focus
  • MLOps and production ML systems
  • Collaborative open-source ML projects

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • Datasets: UCI Machine Learning Repository, Kaggle, sklearn built-in datasets
  • Libraries: scikit-learn, FastAPI, pytest, and the entire Python data science ecosystem
  • Inspiration: Production ML best practices from industry leaders

โญ If you find this portfolio valuable, please consider starring the repository!

Last updated: January 2026

About

No description, website, or topics provided.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors