Skip to content
View DiazSk's full-sized avatar
🎯
Focusing
🎯
Focusing

Highlights

  • Pro

Block or report DiazSk

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
DiazSk/README.md

Zaid Shaikh

Architecting resilient data ecosystems and scalable software systems.

MS Computer Science student at Northeastern University (4.0 GPA) graduating in December 2026. I specialize in designing scalable distributed systems, cloud-native lakehouses, and production-grade pipelines. Beyond simply connecting modern tools, I am deeply committed to building and understanding the foundational architecture of the systems I engineer.

🎯 Actively seeking Fall 2026 Internships/Co-ops and Full-Time opportunities in Data Engineering, SWE/SDE, Analytics Engineering, and BI.

LinkedIn Email Portfolio


📊 GitHub Stats

Zaid's GitHub Stats Top Languages


🛠️ Tech Stack

  • Languages: Python, SQL, Java, Bash
  • Big Data & Streaming: Apache Airflow, Apache Kafka, Apache Flink, Apache Spark (PySpark), dbt, Great Expectations, Delta Lake, Parquet
  • Databases: PostgreSQL, MySQL, Snowflake, Redis, TimescaleDB, DuckDB, Cassandra, DynamoDB
  • Cloud & DevOps: AWS (S3, EC2, Glue, IAM, Redshift, CloudWatch), Azure (Databricks, Data Factory, Data Lake), Terraform, Docker, Kubernetes Git, GitHub Actions, CI/CD

💼 Experience

Graduate Teaching Assistant — Machine Learning (CS6140) — Northeastern University, Khoury College of Computer Sciences

May 2026 - Present

  • Held weekly office hours debugging student Python implementations of PCA, multiple linear regression, and Ridge/Lasso, working through algorithm internals and scikit-learn pipelines with a graduate cohort.
  • Graded course assignments on a 10 to 12 day turnaround, reviewing model code, train/test splitting logic, and written analyses against the course rubric.

NLP Research Assistant — Northeastern University, Khoury College of Computer Sciences

Sept 2025 — Present

  • Engineered a Composite Semantic Drift Score for a co-authored COLM 2026 paper on LLM paraphrasing, as measured by cumulative meaning loss exceeding 331% of safety thresholds across 36,827 records, by integrating SBERT, METEOR, and ROUGE-L signals into a single weighted index.
  • Architected automated evaluation pipelines in Python to process 4,817 complex records, engineering batched scoring mechanisms that replaced manual analysis workflows.
  • Eliminated missing-field errors and achieved 100% data completeness by enforcing strict Pydantic schema contracts and staged quality gates across multi-modal data ingestion pipelines.
  • Optimized embedding throughput and reduced pipeline runtime to under 75 seconds by implementing all-mpnet-base-v2 batch encoding and fingerprint-based cache reuse.
  • Quantified non-linear semantic drift across multi-hop text generation chains, as measured by a Hop A to Hop B t-statistic of 213.15, by running paired t-test and Wilcoxon signed-rank validation scripts over the full evaluation set.
  • Engineered a scalable multi-model ingestion matrix utilizing directory-driven loaders and metadata injection, automating data processing across 7 distinct domains without manual intervention.

📂 Featured Projects

  • Tech: Java, RabbitMQ, Redis, MySQL, WebSockets, AWS EC2
  • Impact: Engineered a write-behind persistence pipeline sustaining throughput of 21,091 msg/s with zero data loss. Architected CQRS-style read/write separation and optimized read-path latency to 13ms at 1M-row scale.

  • Tech: Terraform, AWS S3, Glue, Airflow, PySpark, dbt, Docker
  • Impact: Processed 100GB+ NYC taxi trip records (2.8M rows) through PySpark ETL on AWS Glue. Provisioned infrastructure using Terraform IaC and automated daily batch pipelines via Airflow DAGs.

  • Tech: Kafka, Flink (Java), Redis, PostgreSQL, Docker
  • Impact: Achieved 99% polling reduction via Kafka key-based partitioning and Flink exactly-once processing. Architected a hybrid Redis/TimescaleDB dual-storage system serving 20+ concurrent users with sub-second response times.

  • Tech: Python, PostgreSQL, Snowflake, Airflow, Docker, marimo
  • Impact: Designed a Medallion-architecture warehouse (Bronze → Silver → Gold) centralizing 14 sources for 1.6M+ records. Reduced SQL query latency by 90% via query tuning and data normalization.

  • Tech: Apache Airflow, dbt, PostgreSQL, AWS, Terraform, Docker
  • Architecture: S3 Data Lake → dbt transformations → Analytics Mart with SCD Type 2 dimensions
  • Impact: Optimized query time from 4.2s → 1.1s (74% improvement) across 3 data sources and 50K+ events.

📫 Contact


I believe in building things the right way — production-grade code, proper documentation, and solutions that actually work.

Pinned Loading

  1. chatflow-messaging-system chatflow-messaging-system Public

    Scalable CQRS WebSocket messaging system built with Java, RabbitMQ, and Redis. Features a write-behind persistence pipeline sustaining 21,000+ msg/sec.

    Java

  2. Real-Time-Cryptocurrency-Market-Analyzer Real-Time-Cryptocurrency-Market-Analyzer Public

    Real-time crypto market analyzer with sub-100ms latency. Apache Kafka → Flink → Redis → TimescaleDB pipeline processing live market data through parallel time windows (1-min/5-min/15-min). Implemen…

    Python

  3. nyc-taxi-data-lakehouse nyc-taxi-data-lakehouse Public

    A production-ready data engineering solution featuring cloud-based batch processing, infrastructure as code, and analytics-ready data transformations using the NYC TLC Trip Record dataset.

    Python

  4. healthcare-lakehouse-azure healthcare-lakehouse-azure Public

    Azure Medallion lakehouse on 9.66M CMS Medicare provider-service records — PySpark Bronze→Silver→Gold on ADLS Gen2, with Power BI + marimo dashboards surfacing 5 hero billing insights.

    Jupyter Notebook

  5. sql-data-warehouse-project sql-data-warehouse-project Public

    Building a modern data warehouse with PostgreSQL Server, including ETL process, data modeling, and analytics

    Python

  6. Modern-E-commerce-Analytics-Platform Modern-E-commerce-Analytics-Platform Public

    Create a scalable analytics infrastructure that processes e-commerce transactions, product catalogs, and user behavior data to enable business intelligence and ML feature engineering.

    Python