👋 Yomi Ismail

Data Engineer | Product Operations Specialist

🚀 I design and build scalable data pipelines, optimize operational workflows, and turn raw data into actionable business insights.

🌐 Live Portfolio

👉 Yomi-Ismail-Portfolio

📌 About This Project

This repository contains my personal portfolio website, built to showcase real-world, production-oriented work across Data Engineering and Product Operations.

The goal is not just to display projects, but to demonstrate:

System thinking
Data architecture design
Operational impact
Clean, modern user experience

🧠 What This Portfolio Demonstrates

This portfolio reflects how I approach problems as a Data Engineer:

🐳 Containerized data platforms with Docker Compose and orchestrated execution
🔁 Airflow DAGs with parameterized scheduling for stock vs scaled execution
🌊 Medallion architecture (Bronze → Silver → Gold) for raw-to-warehouse data flows
🔄 End-to-end ETL pipeline design (Extract → Transform → Load) with idempotent re-runs
⚡ Distributed processing with PySpark on millions of rows (broadcast joins, explicit caching, year/month partitioning)
🏗️ Three-table dimensional Gold layers and star schema warehouses
🔌 Asynchronous, high-concurrency API ingestion with retry and rate-limit discipline
📊 Analytics-ready data modeling with composite primary keys, DECIMAL precision, and indexed query paths
🔒 Idempotent loads via deterministic surrogate keys, dedup-merge patterns, and row-count validation
📈 Translating data into business decisions through warehouse-side feature engineering

🚀 Featured Work

🏢 Nova Retail: Dockerized Data Platform with Airflow-Orchestrated Medallion ETL

A containerized on-premise data platform built for a multinational retail scenario, with parameterized stock-vs-scaled execution from the same DAG:

Architected a 5-service Docker Compose stack (PostgreSQL, custom Spark image, Airflow init/webserver/scheduler) brought up by one command, with health-checked service dependencies and persistent named volumes
Built a PySpark medallion pipeline (Bronze partitioned Parquet → Silver joined frame → three-table dimensional Gold) with year/month partitioning on fact tables, explicit broadcast hints on small dimensions, and cache strategy on the join chain
Implemented Apache Airflow orchestration with Param-based DAG supporting both scheduled stock runs (analytical truth) and on-demand scaled runs (architecture validation), using the sidecar Spark pattern via docker exec
Designed a three-table dimensional Gold layer (fact_sales at transaction grain, sales_summary by state-category, sales_by_month_state by year-month) with composite primary keys, DECIMAL precision, and read-optimized indexes
Validated every PostgreSQL load with read-back row-count comparison; zero silent data loss across all loads
End-to-end runtime: 3m 11s stock (555K source rows, 118K Silver records, 3 Gold tables); validated at 2x scale processing 7.5M Silver records in 13m 14s

👉 View project on GitHub 👉 Read the full case study

🏦 FibbieBanks: 1M-Row PySpark ETL & Star Schema Warehouse

A distributed ETL pipeline processing 1 million synthetic banking transactions through five disciplined stages into a validated PostgreSQL star schema:

Designed a 5-stage modular pipeline (Extract → Explore → Clean → Transform → Load) in PySpark
Built a 4-dimension + 1-fact star schema with deterministic SHA-256 surrogate keys
Engineered type discipline end-to-end: DecimalType(18,2) flows from CSV through to NUMERIC(18,2) in PostgreSQL
Made loads idempotent via temp-table + LEFT JOIN merge pattern; re-runs verify zero duplicate rows
Validated schema at every transformation stage to catch source drift loudly, not silently
End-to-end runtime: ~25 minutes on first load, ~9 minutes on idempotent re-run

👉 View project on GitHub 👉 Read the full case study

🔬 XTD Research Labs: Async Ingestion & PySpark Medallion Pipeline

A three-stage data engineering pipeline built for a UK grid decarbonization research scenario, processing three years of regional carbon intensity data from a live government API:

Designed asynchronous aiohttp ingestion with semaphore-bounded concurrency and exponential-backoff retry, pulling 1,095 days of regional data from the UK Carbon Intensity API in under 12 minutes with zero rate-limit hits
Built a medallion architecture (Bronze raw JSON → Silver Parquet → Gold CSV) with three distinct idempotency models, one per layer, matched to each layer's rebuild cost
Used PySpark to explode deeply nested JSON and pivot 9 fuel types into typed columns: 53,594 raw records expand to 8.7M intermediate rows, then collapse to 945,092 silver records
Aggregated to 19,728 daily research metrics and loaded to PostgreSQL via a two-stage dedup-merge with composite-key idempotency
Recovered 32 transient HTTP 500 errors via retry logic during the actual run, with zero permanent failures and one empty-payload edge case caught and skipped

👉 View project on GitHub 👉 Read the full case study

💳 PayFlow: End-to-End ETL & Data Warehouse

A production-style ETL pipeline built to simulate a fintech transaction system using real Brazilian e-commerce data:

Designed modular pipeline architecture (extract, stage, transform, load)
Built normalized staging + analytics-ready star schema in PostgreSQL
Implemented data cleaning, validation, and transformation logic across 9 source CSVs
Optimized data loading with idempotent inserts and structured logging

Note: PayFlow and Nova Retail both use the public Olist Brazilian e-commerce dataset, taken from different engineering angles. PayFlow demonstrates classical normalized warehousing on the dataset; Nova Retail demonstrates production-grade infrastructure (Docker, Airflow, medallion architecture) on the same data.

👉 View project on GitHub 👉 Read the full case study

🍫 ChocoDelight: Layered Data Platform

A three-schema PostgreSQL warehouse on chocolate sales data, with raw, operational, and analytics layers and a full dimensional model.

👉 View project on GitHub 👉 Read the full case study

🛒 AliExpress Laptop ETL: Live Web Scraping Pipeline

A production-style scraping pipeline that walks 60 pages of AliExpress laptop listings, enriches with discount metrics and price bands, and appends only new records to PostgreSQL.

👉 View project on GitHub 👉 Read the full case study

⚙️ Tech Stack

Distributed & Big Data

PySpark (DataFrames, SQL, JDBC)
Apache Spark (transformation engine)
Parquet (columnar storage, partitioned writes)

Orchestration & Containerization

Apache Airflow (Param-based DAGs, BashOperator, sidecar Spark pattern)
Docker, Docker Compose (multi-service containerized data platforms)

Data Engineering

Python (pandas, SQLAlchemy, psycopg2, python-dotenv)
Asynchronous Python (aiohttp, asyncio) for high-concurrency ingestion
PostgreSQL (star schema, dimensional Gold layers, FK referential integrity, composite keys, indexes)
SQL (DDL, complex joins, window functions, CTEs)
JDBC (cross-system data movement, parallel-write candidates)
ETL Pipelines (modular, idempotent, transactional)
Dimensional Modeling (Kimball star schema, three-table Gold, conformed dimensions, surrogate keys)
Medallion Architecture (Bronze, Silver, Gold layered data lakes)

Data Engineering Patterns

Deterministic SHA-256 surrogate keys
Idempotent loads via temp-table + LEFT JOIN merge, two-stage dedup-merge, and three-layer overwrite semantics
Layered idempotency (file-existence, tracker-file, partition-overwrite, wipe-and-rebuild)
Year/month partitioning on fact tables for read pruning
Broadcast joins on bounded-size dimensions
Explicit caching strategy on join chains (cache-before-action)
Row-count validation on every Postgres load (read-back verification)
Asynchronous ingestion with semaphore-bounded concurrency and backoff retry
Schema validation at every transformation stage
Type-aware cleaning (DECIMAL preservation for money and metrics)
Custom observability (timed decorators, structured logs, rotating file handlers, UTF-8 detection)
Environment-driven configuration with dual-host detection (local vs container)

Web Scraping & Automation

Selenium WebDriver
BeautifulSoup

Tools & Workflow

Git & GitHub (GitHub Actions, GitHub Pages, CI/CD)
Jenkins (build pipelines)
Tableau, Power BI (analytics dashboards)
Zoho CRM / Mixpanel (product operations)
AWS S3 (object storage)
Grafana (observability dashboards)
VS Code, PyCharm

Currently Learning

dbt (analyst-authored transformations)
Amazon Redshift, Google BigQuery, Snowflake
Apache Kafka (event streaming)
Terraform (infrastructure-as-code)
GCP, Azure (cloud platforms)
Hadoop (distributed file system)

🎨 Key Features of the Website

Responsive and mobile-optimized layout
Smooth scrolling and interaction design
Clean, recruiter-focused UI
Structured project showcase with 6 in-depth case studies (3 featured, 3 secondary)
7 engineering principles backed by working code references
Performance-optimized frontend (no frameworks, pure HTML/CSS/JS)

📈 What I'm Currently Building

Spark cluster deployment (Databricks, EMR, Dataproc) for jobs beyond single-host limits
dbt-layered analytics models on top of warehouse outputs
SCD Type 2 patterns for slowly changing dimensions
Parallel JDBC writes for high-throughput warehouse loads
Migration tooling (Alembic, Flyway) for safe DDL evolution

🤝 Connect With Me

💼 GitHub: https://github.com/yoismail
🌐 Portfolio: https://yoismail.github.io/portfolio/
📧 Email: [email protected]
🔗 LinkedIn: https://www.linkedin.com/in/yomi-ismail

🧠 Philosophy

"Data is only valuable when it is structured, reliable, and actionable."

The seven engineering principles that guide my work:

Schema is the source of truth, not Python
Idempotency is a contract, not a feature
Write logs assuming you'll be debugging at 2am
Validate before you load, not after
Configuration belongs in environment variables, not code
Modular ETL beats monolithic scripts, always
Engineer features in the warehouse, not in dashboards

👉 Read the full Engineering Principles page

⭐ Final Note

This portfolio is continuously evolving as I build real-world, production-grade data engineering solutions.

If you're hiring for Data Engineering, Analytics, or Technical Operations roles, this repository reflects the level of thinking and execution I bring.

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
cv		cv
images		images
svg		svg
README.md		README.md
aliexpress.html		aliexpress.html
chocodelight.html		chocodelight.html
fibbiebanks.html		fibbiebanks.html
index.html		index.html
nova_retail.html		nova_retail.html
payflow.html		payflow.html
principles.html		principles.html
xtd_research_labs.html		xtd_research_labs.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👋 Yomi Ismail

Data Engineer | Product Operations Specialist

🌐 Live Portfolio

📌 About This Project

🧠 What This Portfolio Demonstrates

🚀 Featured Work

🏢 Nova Retail: Dockerized Data Platform with Airflow-Orchestrated Medallion ETL

🏦 FibbieBanks: 1M-Row PySpark ETL & Star Schema Warehouse

🔬 XTD Research Labs: Async Ingestion & PySpark Medallion Pipeline

💳 PayFlow: End-to-End ETL & Data Warehouse

🍫 ChocoDelight: Layered Data Platform

🛒 AliExpress Laptop ETL: Live Web Scraping Pipeline

⚙️ Tech Stack

Distributed & Big Data

Orchestration & Containerization

Data Engineering

Data Engineering Patterns

Web Scraping & Automation

Tools & Workflow

Currently Learning

🎨 Key Features of the Website

📈 What I'm Currently Building

🤝 Connect With Me

🧠 Philosophy

⭐ Final Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

👋 Yomi Ismail

Data Engineer | Product Operations Specialist

🌐 Live Portfolio

📌 About This Project

🧠 What This Portfolio Demonstrates

🚀 Featured Work

🏢 Nova Retail: Dockerized Data Platform with Airflow-Orchestrated Medallion ETL

🏦 FibbieBanks: 1M-Row PySpark ETL & Star Schema Warehouse

🔬 XTD Research Labs: Async Ingestion & PySpark Medallion Pipeline

💳 PayFlow: End-to-End ETL & Data Warehouse

🍫 ChocoDelight: Layered Data Platform

🛒 AliExpress Laptop ETL: Live Web Scraping Pipeline

⚙️ Tech Stack

Distributed & Big Data

Orchestration & Containerization

Data Engineering

Data Engineering Patterns

Web Scraping & Automation

Tools & Workflow

Currently Learning

🎨 Key Features of the Website

📈 What I'm Currently Building

🤝 Connect With Me

🧠 Philosophy

⭐ Final Note

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages