π I design and build scalable data pipelines, optimize operational workflows, and turn raw data into actionable business insights.
This repository contains my personal portfolio website, built to showcase real-world, production-oriented work across Data Engineering and Product Operations.
The goal is not just to display projects, but to demonstrate:
- System thinking
- Data architecture design
- Operational impact
- Clean, modern user experience
This portfolio reflects how I approach problems as a Data Engineer:
- π³ Containerized data platforms with Docker Compose and orchestrated execution
- π Airflow DAGs with parameterized scheduling for stock vs scaled execution
- π Medallion architecture (Bronze β Silver β Gold) for raw-to-warehouse data flows
- π End-to-end ETL pipeline design (Extract β Transform β Load) with idempotent re-runs
- β‘ Distributed processing with PySpark on millions of rows (broadcast joins, explicit caching, year/month partitioning)
- ποΈ Three-table dimensional Gold layers and star schema warehouses
- π Asynchronous, high-concurrency API ingestion with retry and rate-limit discipline
- π Analytics-ready data modeling with composite primary keys, DECIMAL precision, and indexed query paths
- π Idempotent loads via deterministic surrogate keys, dedup-merge patterns, and row-count validation
- π Translating data into business decisions through warehouse-side feature engineering
A containerized on-premise data platform built for a multinational retail scenario, with parameterized stock-vs-scaled execution from the same DAG:
- Architected a 5-service Docker Compose stack (PostgreSQL, custom Spark image, Airflow init/webserver/scheduler) brought up by one command, with health-checked service dependencies and persistent named volumes
- Built a PySpark medallion pipeline (Bronze partitioned Parquet β Silver joined frame β three-table dimensional Gold) with year/month partitioning on fact tables, explicit broadcast hints on small dimensions, and cache strategy on the join chain
- Implemented Apache Airflow orchestration with
Param-based DAG supporting both scheduled stock runs (analytical truth) and on-demand scaled runs (architecture validation), using the sidecar Spark pattern viadocker exec - Designed a three-table dimensional Gold layer (
fact_salesat transaction grain,sales_summaryby state-category,sales_by_month_stateby year-month) with composite primary keys, DECIMAL precision, and read-optimized indexes - Validated every PostgreSQL load with read-back row-count comparison; zero silent data loss across all loads
- End-to-end runtime: 3m 11s stock (555K source rows, 118K Silver records, 3 Gold tables); validated at 2x scale processing 7.5M Silver records in 13m 14s
π View project on GitHub π Read the full case study
A distributed ETL pipeline processing 1 million synthetic banking transactions through five disciplined stages into a validated PostgreSQL star schema:
- Designed a 5-stage modular pipeline (Extract β Explore β Clean β Transform β Load) in PySpark
- Built a 4-dimension + 1-fact star schema with deterministic SHA-256 surrogate keys
- Engineered type discipline end-to-end:
DecimalType(18,2)flows from CSV through toNUMERIC(18,2)in PostgreSQL - Made loads idempotent via temp-table + LEFT JOIN merge pattern; re-runs verify zero duplicate rows
- Validated schema at every transformation stage to catch source drift loudly, not silently
- End-to-end runtime: ~25 minutes on first load, ~9 minutes on idempotent re-run
π View project on GitHub π Read the full case study
A three-stage data engineering pipeline built for a UK grid decarbonization research scenario, processing three years of regional carbon intensity data from a live government API:
- Designed asynchronous
aiohttpingestion with semaphore-bounded concurrency and exponential-backoff retry, pulling 1,095 days of regional data from the UK Carbon Intensity API in under 12 minutes with zero rate-limit hits - Built a medallion architecture (Bronze raw JSON β Silver Parquet β Gold CSV) with three distinct idempotency models, one per layer, matched to each layer's rebuild cost
- Used PySpark to explode deeply nested JSON and pivot 9 fuel types into typed columns: 53,594 raw records expand to 8.7M intermediate rows, then collapse to 945,092 silver records
- Aggregated to 19,728 daily research metrics and loaded to PostgreSQL via a two-stage dedup-merge with composite-key idempotency
- Recovered 32 transient HTTP 500 errors via retry logic during the actual run, with zero permanent failures and one empty-payload edge case caught and skipped
π View project on GitHub π Read the full case study
A production-style ETL pipeline built to simulate a fintech transaction system using real Brazilian e-commerce data:
- Designed modular pipeline architecture (extract, stage, transform, load)
- Built normalized staging + analytics-ready star schema in PostgreSQL
- Implemented data cleaning, validation, and transformation logic across 9 source CSVs
- Optimized data loading with idempotent inserts and structured logging
Note: PayFlow and Nova Retail both use the public Olist Brazilian e-commerce dataset, taken from different engineering angles. PayFlow demonstrates classical normalized warehousing on the dataset; Nova Retail demonstrates production-grade infrastructure (Docker, Airflow, medallion architecture) on the same data.
π View project on GitHub π Read the full case study
A three-schema PostgreSQL warehouse on chocolate sales data, with raw, operational, and analytics layers and a full dimensional model.
π View project on GitHub π Read the full case study
A production-style scraping pipeline that walks 60 pages of AliExpress laptop listings, enriches with discount metrics and price bands, and appends only new records to PostgreSQL.
π View project on GitHub π Read the full case study
- PySpark (DataFrames, SQL, JDBC)
- Apache Spark (transformation engine)
- Parquet (columnar storage, partitioned writes)
- Apache Airflow (Param-based DAGs, BashOperator, sidecar Spark pattern)
- Docker, Docker Compose (multi-service containerized data platforms)
- Python (pandas, SQLAlchemy, psycopg2, python-dotenv)
- Asynchronous Python (aiohttp, asyncio) for high-concurrency ingestion
- PostgreSQL (star schema, dimensional Gold layers, FK referential integrity, composite keys, indexes)
- SQL (DDL, complex joins, window functions, CTEs)
- JDBC (cross-system data movement, parallel-write candidates)
- ETL Pipelines (modular, idempotent, transactional)
- Dimensional Modeling (Kimball star schema, three-table Gold, conformed dimensions, surrogate keys)
- Medallion Architecture (Bronze, Silver, Gold layered data lakes)
- Deterministic SHA-256 surrogate keys
- Idempotent loads via temp-table + LEFT JOIN merge, two-stage dedup-merge, and three-layer overwrite semantics
- Layered idempotency (file-existence, tracker-file, partition-overwrite, wipe-and-rebuild)
- Year/month partitioning on fact tables for read pruning
- Broadcast joins on bounded-size dimensions
- Explicit caching strategy on join chains (cache-before-action)
- Row-count validation on every Postgres load (read-back verification)
- Asynchronous ingestion with semaphore-bounded concurrency and backoff retry
- Schema validation at every transformation stage
- Type-aware cleaning (DECIMAL preservation for money and metrics)
- Custom observability (timed decorators, structured logs, rotating file handlers, UTF-8 detection)
- Environment-driven configuration with dual-host detection (local vs container)
- Selenium WebDriver
- BeautifulSoup
- Git & GitHub (GitHub Actions, GitHub Pages, CI/CD)
- Jenkins (build pipelines)
- Tableau, Power BI (analytics dashboards)
- Zoho CRM / Mixpanel (product operations)
- AWS S3 (object storage)
- Grafana (observability dashboards)
- VS Code, PyCharm
- dbt (analyst-authored transformations)
- Amazon Redshift, Google BigQuery, Snowflake
- Apache Kafka (event streaming)
- Terraform (infrastructure-as-code)
- GCP, Azure (cloud platforms)
- Hadoop (distributed file system)
- Responsive and mobile-optimized layout
- Smooth scrolling and interaction design
- Clean, recruiter-focused UI
- Structured project showcase with 6 in-depth case studies (3 featured, 3 secondary)
- 7 engineering principles backed by working code references
- Performance-optimized frontend (no frameworks, pure HTML/CSS/JS)
- Spark cluster deployment (Databricks, EMR, Dataproc) for jobs beyond single-host limits
- dbt-layered analytics models on top of warehouse outputs
- SCD Type 2 patterns for slowly changing dimensions
- Parallel JDBC writes for high-throughput warehouse loads
- Migration tooling (Alembic, Flyway) for safe DDL evolution
- πΌ GitHub: https://github.com/yoismail
- π Portfolio: https://yoismail.github.io/portfolio/
- π§ Email: [email protected]
- π LinkedIn: https://www.linkedin.com/in/yomi-ismail
"Data is only valuable when it is structured, reliable, and actionable."
The seven engineering principles that guide my work:
- Schema is the source of truth, not Python
- Idempotency is a contract, not a feature
- Write logs assuming you'll be debugging at 2am
- Validate before you load, not after
- Configuration belongs in environment variables, not code
- Modular ETL beats monolithic scripts, always
- Engineer features in the warehouse, not in dashboards
π Read the full Engineering Principles page
This portfolio is continuously evolving as I build real-world, production-grade data engineering solutions.
If you're hiring for Data Engineering, Analytics, or Technical Operations roles, this repository reflects the level of thinking and execution I bring.