Data Engineer Β· Azure Databricks Β· Lakehouse Architecture Β· MLflow Β· PySpark
I build data pipelines that survive production.
Currently at Axtria (Bengaluru), designing Spark-based Lakehouse systems for pharmaceutical clients β serving 300+ sales reps across 10 markets with ML-driven HCP engagement recommendations via Veeva CRM.
- ποΈ Specialise in medallion Lakehouse architecture (Bronze β Silver β Gold) with Unity Catalog governance across DEV / UAT / PROD
- β‘ Reduced Spark ETL execution time by 60% through partition strategy refinement and join optimisation
- π€ Build ML-integrated data workflows β from raw ingestion to model scoring pipelines consumed by downstream CRM systems
- π Automate everything β Databricks Asset Bundles + GitHub Actions = zero-touch, repeatable deployments
Production-grade data product: CSV ingestion β Medallion Lakehouse β ML segmentation β AI-assisted dashboard β automated report delivery β live REST API. Fully automated. No manual steps.
π Live API: https://databricks-asset-bundle-deployment.onrender.com
π Swagger UI: https://databricks-asset-bundle-deployment.onrender.com/docs
- Architected a fully automated, end-to-end Databricks data product covering CSV ingestion, Bronze β Silver β Gold medallion transformation, ML training, dashboard analytics, and live API serving β deployed with zero manual steps via GitHub Actions CI/CD.
- Implemented idempotent Delta MERGE-based ETL using PySpark alongside a parallel Delta Live Tables (DLT) pipeline with
@dlt.expectconstraints for declarative data quality enforcement and pipeline lineage tracking. - Trained and registered a scikit-learn KMeans customer segmentation model in Unity Catalog Model Registry via MLflow, evaluating cluster quality with silhouette score and elbow method across k=2β6.
- Built a Databricks SQL Dashboard ("Customer Intelligence Dashboard") with 4 visualisations β top customers, revenue by city, recency distribution, and ML segment breakdown β integrated with Databricks Genie for natural language querying; configured automated hourly report delivery to subscribed stakeholders post pipeline completion.
- Deployed a FastAPI on Render.com backed by Databricks Serverless (Spark Connect) for live query execution, with API key authentication and Swagger UI. Automated full deployment via Databricks Asset Bundles (DAB) and GitHub Actions across DEV/PROD with approval gates, ruff linting, and pytest smoke tests.
| Component | Stack |
|---|---|
| Medallion ETL (Bronze β Silver β Gold) | PySpark Β· Delta Lake Β· Delta MERGE |
| Declarative pipeline with data quality | Delta Live Tables Β· @dlt.expect |
| KMeans customer segmentation | scikit-learn Β· MLflow Β· Unity Catalog Model Registry |
| Customer Intelligence Dashboard | Databricks SQL Β· Genie AI (natural language queries) |
| Automated report delivery to subscribers | Databricks Dashboard Subscriptions (hourly) |
| Live REST API with authentication | FastAPI Β· Spark Connect Β· Render.com |
| Zero-touch CI/CD across DEV & PROD | GitHub Actions Β· Databricks Asset Bundles |
Data Engineering
ML & Experimentation
CI/CD & Orchestration
API & Serving
Languages
Data Engineer Β· M.Tech Artificial Intelligence Β· Delhi Technological University Β· IEEE Published Β· Axtria Bravo Award 2025



