Skip to content

yoismail/portfolio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

136 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‘‹ Yomi Ismail

Data Engineer | Product Operations Specialist

πŸš€ I design and build scalable data pipelines, optimize operational workflows, and turn raw data into actionable business insights.


🌐 Live Portfolio

πŸ‘‰ Yomi-Ismail-Portfolio


πŸ“Œ About This Project

This repository contains my personal portfolio website, built to showcase real-world, production-oriented work across Data Engineering and Product Operations.

The goal is not just to display projects, but to demonstrate:

  • System thinking
  • Data architecture design
  • Operational impact
  • Clean, modern user experience

🧠 What This Portfolio Demonstrates

This portfolio reflects how I approach problems as a Data Engineer:

  • 🐳 Containerized data platforms with Docker Compose and orchestrated execution
  • πŸ” Airflow DAGs with parameterized scheduling for stock vs scaled execution
  • 🌊 Medallion architecture (Bronze β†’ Silver β†’ Gold) for raw-to-warehouse data flows
  • πŸ”„ End-to-end ETL pipeline design (Extract β†’ Transform β†’ Load) with idempotent re-runs
  • ⚑ Distributed processing with PySpark on millions of rows (broadcast joins, explicit caching, year/month partitioning)
  • πŸ—οΈ Three-table dimensional Gold layers and star schema warehouses
  • πŸ”Œ Asynchronous, high-concurrency API ingestion with retry and rate-limit discipline
  • πŸ“Š Analytics-ready data modeling with composite primary keys, DECIMAL precision, and indexed query paths
  • πŸ”’ Idempotent loads via deterministic surrogate keys, dedup-merge patterns, and row-count validation
  • πŸ“ˆ Translating data into business decisions through warehouse-side feature engineering

πŸš€ Featured Work

🏒 Nova Retail: Dockerized Data Platform with Airflow-Orchestrated Medallion ETL

A containerized on-premise data platform built for a multinational retail scenario, with parameterized stock-vs-scaled execution from the same DAG:

  • Architected a 5-service Docker Compose stack (PostgreSQL, custom Spark image, Airflow init/webserver/scheduler) brought up by one command, with health-checked service dependencies and persistent named volumes
  • Built a PySpark medallion pipeline (Bronze partitioned Parquet β†’ Silver joined frame β†’ three-table dimensional Gold) with year/month partitioning on fact tables, explicit broadcast hints on small dimensions, and cache strategy on the join chain
  • Implemented Apache Airflow orchestration with Param-based DAG supporting both scheduled stock runs (analytical truth) and on-demand scaled runs (architecture validation), using the sidecar Spark pattern via docker exec
  • Designed a three-table dimensional Gold layer (fact_sales at transaction grain, sales_summary by state-category, sales_by_month_state by year-month) with composite primary keys, DECIMAL precision, and read-optimized indexes
  • Validated every PostgreSQL load with read-back row-count comparison; zero silent data loss across all loads
  • End-to-end runtime: 3m 11s stock (555K source rows, 118K Silver records, 3 Gold tables); validated at 2x scale processing 7.5M Silver records in 13m 14s

πŸ‘‰ View project on GitHub πŸ‘‰ Read the full case study


🏦 FibbieBanks: 1M-Row PySpark ETL & Star Schema Warehouse

A distributed ETL pipeline processing 1 million synthetic banking transactions through five disciplined stages into a validated PostgreSQL star schema:

  • Designed a 5-stage modular pipeline (Extract β†’ Explore β†’ Clean β†’ Transform β†’ Load) in PySpark
  • Built a 4-dimension + 1-fact star schema with deterministic SHA-256 surrogate keys
  • Engineered type discipline end-to-end: DecimalType(18,2) flows from CSV through to NUMERIC(18,2) in PostgreSQL
  • Made loads idempotent via temp-table + LEFT JOIN merge pattern; re-runs verify zero duplicate rows
  • Validated schema at every transformation stage to catch source drift loudly, not silently
  • End-to-end runtime: ~25 minutes on first load, ~9 minutes on idempotent re-run

πŸ‘‰ View project on GitHub πŸ‘‰ Read the full case study


πŸ”¬ XTD Research Labs: Async Ingestion & PySpark Medallion Pipeline

A three-stage data engineering pipeline built for a UK grid decarbonization research scenario, processing three years of regional carbon intensity data from a live government API:

  • Designed asynchronous aiohttp ingestion with semaphore-bounded concurrency and exponential-backoff retry, pulling 1,095 days of regional data from the UK Carbon Intensity API in under 12 minutes with zero rate-limit hits
  • Built a medallion architecture (Bronze raw JSON β†’ Silver Parquet β†’ Gold CSV) with three distinct idempotency models, one per layer, matched to each layer's rebuild cost
  • Used PySpark to explode deeply nested JSON and pivot 9 fuel types into typed columns: 53,594 raw records expand to 8.7M intermediate rows, then collapse to 945,092 silver records
  • Aggregated to 19,728 daily research metrics and loaded to PostgreSQL via a two-stage dedup-merge with composite-key idempotency
  • Recovered 32 transient HTTP 500 errors via retry logic during the actual run, with zero permanent failures and one empty-payload edge case caught and skipped

πŸ‘‰ View project on GitHub πŸ‘‰ Read the full case study


πŸ’³ PayFlow: End-to-End ETL & Data Warehouse

A production-style ETL pipeline built to simulate a fintech transaction system using real Brazilian e-commerce data:

  • Designed modular pipeline architecture (extract, stage, transform, load)
  • Built normalized staging + analytics-ready star schema in PostgreSQL
  • Implemented data cleaning, validation, and transformation logic across 9 source CSVs
  • Optimized data loading with idempotent inserts and structured logging

Note: PayFlow and Nova Retail both use the public Olist Brazilian e-commerce dataset, taken from different engineering angles. PayFlow demonstrates classical normalized warehousing on the dataset; Nova Retail demonstrates production-grade infrastructure (Docker, Airflow, medallion architecture) on the same data.

πŸ‘‰ View project on GitHub πŸ‘‰ Read the full case study


🍫 ChocoDelight: Layered Data Platform

A three-schema PostgreSQL warehouse on chocolate sales data, with raw, operational, and analytics layers and a full dimensional model.

πŸ‘‰ View project on GitHub πŸ‘‰ Read the full case study


πŸ›’ AliExpress Laptop ETL: Live Web Scraping Pipeline

A production-style scraping pipeline that walks 60 pages of AliExpress laptop listings, enriches with discount metrics and price bands, and appends only new records to PostgreSQL.

πŸ‘‰ View project on GitHub πŸ‘‰ Read the full case study


βš™οΈ Tech Stack

Distributed & Big Data

  • PySpark (DataFrames, SQL, JDBC)
  • Apache Spark (transformation engine)
  • Parquet (columnar storage, partitioned writes)

Orchestration & Containerization

  • Apache Airflow (Param-based DAGs, BashOperator, sidecar Spark pattern)
  • Docker, Docker Compose (multi-service containerized data platforms)

Data Engineering

  • Python (pandas, SQLAlchemy, psycopg2, python-dotenv)
  • Asynchronous Python (aiohttp, asyncio) for high-concurrency ingestion
  • PostgreSQL (star schema, dimensional Gold layers, FK referential integrity, composite keys, indexes)
  • SQL (DDL, complex joins, window functions, CTEs)
  • JDBC (cross-system data movement, parallel-write candidates)
  • ETL Pipelines (modular, idempotent, transactional)
  • Dimensional Modeling (Kimball star schema, three-table Gold, conformed dimensions, surrogate keys)
  • Medallion Architecture (Bronze, Silver, Gold layered data lakes)

Data Engineering Patterns

  • Deterministic SHA-256 surrogate keys
  • Idempotent loads via temp-table + LEFT JOIN merge, two-stage dedup-merge, and three-layer overwrite semantics
  • Layered idempotency (file-existence, tracker-file, partition-overwrite, wipe-and-rebuild)
  • Year/month partitioning on fact tables for read pruning
  • Broadcast joins on bounded-size dimensions
  • Explicit caching strategy on join chains (cache-before-action)
  • Row-count validation on every Postgres load (read-back verification)
  • Asynchronous ingestion with semaphore-bounded concurrency and backoff retry
  • Schema validation at every transformation stage
  • Type-aware cleaning (DECIMAL preservation for money and metrics)
  • Custom observability (timed decorators, structured logs, rotating file handlers, UTF-8 detection)
  • Environment-driven configuration with dual-host detection (local vs container)

Web Scraping & Automation

  • Selenium WebDriver
  • BeautifulSoup

Tools & Workflow

  • Git & GitHub (GitHub Actions, GitHub Pages, CI/CD)
  • Jenkins (build pipelines)
  • Tableau, Power BI (analytics dashboards)
  • Zoho CRM / Mixpanel (product operations)
  • AWS S3 (object storage)
  • Grafana (observability dashboards)
  • VS Code, PyCharm

Currently Learning

  • dbt (analyst-authored transformations)
  • Amazon Redshift, Google BigQuery, Snowflake
  • Apache Kafka (event streaming)
  • Terraform (infrastructure-as-code)
  • GCP, Azure (cloud platforms)
  • Hadoop (distributed file system)

🎨 Key Features of the Website

  • Responsive and mobile-optimized layout
  • Smooth scrolling and interaction design
  • Clean, recruiter-focused UI
  • Structured project showcase with 6 in-depth case studies (3 featured, 3 secondary)
  • 7 engineering principles backed by working code references
  • Performance-optimized frontend (no frameworks, pure HTML/CSS/JS)

πŸ“ˆ What I'm Currently Building

  • Spark cluster deployment (Databricks, EMR, Dataproc) for jobs beyond single-host limits
  • dbt-layered analytics models on top of warehouse outputs
  • SCD Type 2 patterns for slowly changing dimensions
  • Parallel JDBC writes for high-throughput warehouse loads
  • Migration tooling (Alembic, Flyway) for safe DDL evolution

🀝 Connect With Me


🧠 Philosophy

"Data is only valuable when it is structured, reliable, and actionable."

The seven engineering principles that guide my work:

  1. Schema is the source of truth, not Python
  2. Idempotency is a contract, not a feature
  3. Write logs assuming you'll be debugging at 2am
  4. Validate before you load, not after
  5. Configuration belongs in environment variables, not code
  6. Modular ETL beats monolithic scripts, always
  7. Engineer features in the warehouse, not in dashboards

πŸ‘‰ Read the full Engineering Principles page


⭐ Final Note

This portfolio is continuously evolving as I build real-world, production-grade data engineering solutions.

If you're hiring for Data Engineering, Analytics, or Technical Operations roles, this repository reflects the level of thinking and execution I bring.

About

Data Engineer focused on ETL pipelines, data modeling, and transforming raw data into scalable, analytics-ready systems.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages