This repository documents an end-to-end Databricks Lakehouse built with the Medallion Architecture. It ingests raw CRM and ERP files, cleans and standardizes them, models them into a star schema, and prepares the workflow for orchestration in Databricks.
The project is organized into three data layers and one automation layer. The final goal is to produce reliable analytics tables for reporting and BI:
- Bronze: Raw ingestion with no transformations.
- Silver: Cleansing, standardization, and validation.
- Gold: Business-ready dimensional modeling.
- Pipeline: Orchestration with Databricks Workflows.
script/
βββ init_lakehouse.ipynb
βββ utils/
β βββ config.ipynb # Centralized config β schemas, table registry
βββ bronze/
β βββ bronze_layer.ipynb
βββ silver/
β βββ silver_orchestration.ipynb
β βββ crm/
β β βββ silver_crm_cust_info.ipynb
β β βββ silver_crm_prd_info.ipynb
β β βββ silver_crm_sales_details.ipynb
β βββ erp/
β βββ silver_erp_cust_az12.ipynb
β βββ silver_erp_loc_a101.ipynb
β βββ silver_erp_px_cat_g1v2.ipynb
βββ gold/
βββ gold_orchestration.ipynb
βββ gold_dim_customers.ipynb
βββ gold_dim_products.ipynb
βββ gold_fact_sales.ipynb
1. Prepare the workspace Use a Databricks workspace with Unity Catalog enabled. Create or confirm the following schemas:
bronzesilvergoldmonitoring
2. Create the raw file volume
Create a volume in the Bronze schema to act as the landing zone:
workspace.bronze.raw_sources
3. Upload source data
Place the six source CSV files into the raw_sources volume.
4. Initialize the lakehouse Run the initialization script. This notebook prepares the schemas, storage, and audit log table needed for the project.
script/init_lakehouse.ipynbRun the ingestion notebook:
script/bronze/bronze_layer.ipynbThis notebook reads each source CSV and writes it as a Delta table in workspace.bronze using a consistent naming convention.
Run the orchestration notebook:
script/silver/silver_orchestration.ipynbThis notebook triggers the Silver transformation notebooks in sequence. The Silver layer cleans, standardizes, and validates data before writing to workspace.silver.
Run the dimensional modeling notebook:
script/gold/gold_orchestration.ipynbThis notebook builds the dimensional model in workspace.gold, including customer, product, and sales tables for analytics.
Silver Outputs The Silver layer produces cleaned, standardized tables such as:
workspace.silver.crm_customersworkspace.silver.crm_productsworkspace.silver.crm_salesworkspace.silver.erp_customersworkspace.silver.erp_customer_locationworkspace.silver.erp_product_category
Gold Outputs The Gold layer creates the final Star Schema:
workspace.gold.dim_customersworkspace.gold.dim_productsworkspace.gold.fact_sales
The pipeline is designed to run as a Databricks Workflow named loading_bike_data_lakehouse.
Recommended Task Order:
- Bronze ingestion
- Silver orchestration
- Gold orchestration
Beyond the base Lakehouse, four production engineering patterns were implemented:
Every Silver notebook includes a validation block that runs after each write. Checks enforced:
| Check | Rule |
|---|---|
| Row count | total_rows > 0 β table must not be empty |
| Null PKs | null_pk == 0 β primary key must be fully populated |
| Duplicate PKs | duplicate_pk == 0 β no duplicate primary keys allowed |
| Business rules | e.g. sales_amount > 0 on crm_sales |
QC is intentionally placed at the Silver layer, not Gold. Gold is a transformation layer β if corrupt data reaches Gold, failures are silent. Silver is the last point where raw data shape can be enforced.
Silver writes use MERGE INTO (upsert) instead of full overwrite. This means each pipeline run only touches rows that are new or changed β leaving unaffected rows untouched.
# Pattern used across all 6 silver notebooks
if not spark.catalog.tableExists(TARGET_TABLE):
df.write.format("delta").saveAsTable(TARGET_TABLE) # first run
else:
delta_target = DeltaTable.forName(spark, TARGET_TABLE)
delta_target.alias("target").merge(
df.alias("source"),
f"target.{PK_COL} = source.{PK_COL}"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()crm_sales uses a composite key (order_number, product_number) because no single column uniquely identifies a row in that table.
All Silver and Gold notebooks source table names and schema references from a single config notebook (utils/config), loaded via %run:
CATALOG = "workspace"
SILVER_SCHEMA = "silver"
GOLD_SCHEMA = "gold"
MONITORING_SCHEMA = "monitoring"
TABLES = {
"crm_cust": f"{CATALOG}.{SILVER_SCHEMA}.crm_customers",
"crm_prd": f"{CATALOG}.{SILVER_SCHEMA}.crm_products",
"crm_sales": f"{CATALOG}.{SILVER_SCHEMA}.crm_sales",
"erp_cust": f"{CATALOG}.{SILVER_SCHEMA}.erp_customers",
"erp_loc": f"{CATALOG}.{SILVER_SCHEMA}.erp_customer_location",
"erp_cat": f"{CATALOG}.{SILVER_SCHEMA}.erp_product_category",
"audit_log": f"{CATALOG}.{MONITORING_SCHEMA}.pipeline_audit_log",
}Centralizing table references means environment promotion (dev β prod) requires changing one file, not touching every pipeline notebook.
Every Silver notebook writes one audit row to a dedicated monitoring table after each pipeline run. This makes pipeline health visible and queryable.
Audit log table: workspace.monitoring.pipeline_audit_log
| Column | Type | Description |
|---|---|---|
notebook_name |
STRING | Which notebook ran |
target_table |
STRING | Which silver table was written |
run_timestamp |
TIMESTAMP | When the run completed |
rows_inserted |
LONG | Rows added by MERGE |
rows_updated |
LONG | Rows modified by MERGE |
rows_deleted |
LONG | Rows removed by MERGE |
qc_status |
STRING | PASS or FAIL |
qc_message |
STRING | Failure reason or All checks passed |
QC asserts are wrapped in try/except AssertionError so the audit log is always written β whether the pipeline passes or fails. A pipeline that crashes silently is unobservable; a pipeline that logs its failure is debuggable.
Each Silver notebook includes a final sanity check cell that displays the audit log sorted by most recent runs first:
spark.sql(f"SELECT * FROM {TABLES['audit_log']} ORDER BY run_timestamp DESC").display()A continuous integration pipeline and automated test suite ensure the configuration integrity and code quality across all pipeline notebooks.
Why This Matters
In a production lakehouse, configuration drift is a silent killer. A typo in a schema name, a missing table key in TABLES, or an accidental hardcoded catalog reference can break an entire downstream workflow β often only discovered during a late-night production run. Automated testing catches these issues before they reach production.
Additionally, notebook code is notoriously difficult to lint and test compared to plain Python files. The CI/CD setup bridges this gap by treating notebooks as testable, lintable artifacts.
What Was Implemented
Three components work together to enforce quality gates on every push:
1. GitHub Actions Workflow (.github/workflows/pipeline_ci.yml)
Triggers automatically on every push or pull request to main. The workflow performs:
- Code Quality Check: Runs
nbqa ruffto lint all notebooks using Ruff (a fast Python linter). This catches PEP 8 violations, unused imports, undefined variables, and other code smells directly inside.ipynbfiles. - Automated Tests: Executes the pytest test suite against the centralized config notebook to validate configuration integrity.
name: PySpark CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
pyspark-test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install nbqa ruff pytest nbformat
run: pip install nbqa ruff pytest nbformat
- name: Run the linter on notebooks
run: nbqa ruff .
- name: Run tests
run: pytest tests/ -v2. Test Suite (tests/test_config.py)
Validates the utils/config.ipynb notebook programmatically by extracting and executing its code cells, then asserting on critical properties:
| Test | Purpose |
|---|---|
test_catalog_equals_workspace |
Ensures CATALOG variable is always set to "workspace" β prevents accidental references to dev or staging catalogs |
test_tables_keys |
Verifies all seven expected table keys exist in the TABLES dictionary (crm_cust, crm_prd, crm_sales, erp_cust, erp_loc, erp_cat, audit_log) β catches missing or renamed keys before pipelines break |
test_tables_values_fully_qualified |
Confirms every table reference is a fully qualified three-part name (catalog.schema.table) β prevents ambiguous references that fail in Unity Catalog environments |
The test suite uses nbformat to parse notebook JSON, extract code cells, and execute them in a sandboxed namespace. This allows testing notebook configuration as if it were a standard Python module.
3. Development Dependencies (requirements-dev.txt)
Pins exact versions of testing and linting tools to ensure reproducible CI runs:
nbqa==1.9.1 # Enables running quality tools (ruff, black, mypy) on Jupyter notebooks
ruff==0.4.4 # Fast Python linter β replaces flake8, isort, and pylint
pytest==8.2.0 # Testing framework
nbformat==5.10.4 # Parses and validates Jupyter notebook format
Purpose & Impact
This CI/CD setup provides four critical safeguards:
- Early Detection: Configuration errors are caught in seconds during development, not hours later during a workflow run.
- Enforced Standards: Code quality is non-negotiable β linting failures block merges, ensuring every notebook meets baseline quality standards.
- Regression Prevention: Tests act as regression guards. If a refactor accidentally breaks the config structure, the test suite fails before code is merged.
- Team Scalability: New contributors can safely modify notebooks knowing the CI pipeline will catch breaking changes. This lowers the barrier to collaboration.
In production, this means fewer pipeline failures, faster debugging when issues do occur, and higher confidence in promoting code across environments.
This project demonstrates a practical Databricks Lakehouse implementation using:
- Medallion Architecture (Bronze / Silver / Gold)
- Unity Catalog governance
- Delta tables with ACID guarantees
- Incremental MERGE pipelines
- SCD Type 2 β active record filtering at Gold layer
- Data Quality validation at the Silver layer
- Config-driven, environment-portable notebook design
- Dimensional modeling (Star Schema)
- Workflow orchestration via Databricks Jobs
- Pipeline observability via Delta audit log (
workspace.monitoring) - CI/CD automation with GitHub Actions
- Automated testing for configuration integrity


