Skip to content

Commit 454ccbc

Browse files
authored
Merge pull request #52959 from weslbo/images-implement-manage-data-quality-constraints-in-unity-catalog
Images implement manage data quality constraints in unity catalog
2 parents dcc3204 + 1c2d6c8 commit 454ccbc

11 files changed

Lines changed: 37 additions & 39 deletions
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.implement-manage-data-quality-constraints-unity-catalog.detect-manage-schema-drift
3+
title: Detect and manage schema drift
4+
metadata:
5+
title: Detect and Manage Schema Drift
6+
description: Learn how to detect and manage schema drift in Azure Databricks data pipelines using Delta Lake, Auto Loader, schema evolution, and error handling strategies.
7+
ms.date: 12/07/2025
8+
author: weslbo
9+
ms.author: wedebols
10+
ms.topic: unit
11+
ai-usage: ai-generated
12+
durationInMinutes: 8
13+
content: |
14+
[!include[](includes/4-detect-manage-schema-drift.md)]

learn-pr/wwl-databricks/implement-manage-data-quality-constraints-unity-catalog/4-implement-schema-enforcement-manage-drift.yml

Lines changed: 0 additions & 14 deletions
This file was deleted.

learn-pr/wwl-databricks/implement-manage-data-quality-constraints-unity-catalog/includes/2-implement-validation-checks.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ In this unit, you learn how to implement validation checks for nullability, data
66

77
Azure Databricks provides two primary mechanisms for implementing validation checks: pipeline expectations and table constraints. Each approach serves different scenarios and offers distinct capabilities.
88

9+
:::image type="content" source="../media/2-understand-validation-approaches.png" alt-text="Diagram explaining validation approaches in Azure Databricks." border="false" lightbox="../media/2-understand-validation-approaches.png":::
10+
911
**Pipeline expectations** apply validation during data transformations in Lakeflow Spark Declarative Pipelines. Expectations let you warn, drop invalid records, or fail the pipeline when data violates your rules. This approach works well for streaming tables and materialized views where you need real-time quality control.
1012

1113
**Table constraints** enforce rules directly on Delta Lake tables. Constraints reject invalid data at write time, preventing bad records from ever entering your tables. This approach suits batch processing and scenarios requiring strict data integrity guarantees.
@@ -64,13 +66,15 @@ Cardinality validation ensures that columns expected to contain unique values ac
6466
Pipeline expectations can validate cardinality by checking for conditions that indicate uniqueness issues. For example, you can verify that a Social Security Number appears only once per person:
6567

6668
```python
69+
from pyspark.sql.window import Window
70+
from pyspark.sql.functions import count
71+
6772
@dp.table()
68-
@dp.expect("unique_ssn_per_person", """
69-
ssn IS NOT NULL
70-
AND LENGTH(ssn) = 9
71-
""")
73+
@dp.expect("unique_ssn_per_person", "ssn_count = 1")
7274
def employees():
73-
return spark.readStream.table("raw.employees")
75+
df = spark.table("raw.employees")
76+
w = Window.partitionBy("ssn")
77+
return df.withColumn("ssn_count", count("*").over(w))
7478
```
7579

7680
For more comprehensive cardinality checks, combine expectations with aggregation logic in your transformation:

learn-pr/wwl-databricks/implement-manage-data-quality-constraints-unity-catalog/includes/3-implement-data-type-checks.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,14 @@ When you insert data where `quantity` is a string that represents a number, Delt
2525

2626
```sql
2727
-- This succeeds because '100' can be cast to INT
28-
INSERT INTO inventory VALUES (1, '100', '2024-01-15');
28+
INSERT INTO inventory VALUES (1, '100', '2026-01-15');
2929
```
3030

3131
However, inserting a string that can't be converted to an integer causes the operation to fail:
3232

3333
```sql
3434
-- This fails because 'fifty' cannot be cast to INT
35-
INSERT INTO inventory VALUES (2, 'fifty', '2024-01-15');
35+
INSERT INTO inventory VALUES (2, 'fifty', '2026-01-15');
3636
```
3737

3838
Schema enforcement provides a first line of defense against type mismatches. For more control over how mismatches are handled, you can use explicit casting.

learn-pr/wwl-databricks/implement-manage-data-quality-constraints-unity-catalog/includes/4-implement-schema-enforcement-manage-drift.md renamed to learn-pr/wwl-databricks/implement-manage-data-quality-constraints-unity-catalog/includes/4-detect-manage-schema-drift.md

Lines changed: 7 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,26 @@
11
Data pipelines often receive data from sources that evolve over time. New columns appear, others disappear, and the structure of incoming data changes as business requirements shift. Without proper controls, these changes can silently corrupt your data or break your pipelines entirely.
22

3-
In this unit, you learn how to enforce schema constraints in Azure Databricks and implement strategies for detecting and managing schema drift in your data engineering workflows.
3+
In this unit, you learn how to detect and manage schema drift—the structural changes that occur when source systems add, remove, or rename columns over time.
44

5-
## Understand schema enforcement
5+
## Recognize schema drift challenges
66

7-
Schema enforcement is the process of validating that incoming data matches the expected structure of your target table. Delta Lake enforces schema on write by default, which means every write operation validates the data structure before committing changes.
7+
While data type validation ensures values match expected types (as covered in the previous unit), schema drift addresses a different challenge: the structure of your data changes over time. A source system adds a new `phone_number` column, removes a deprecated `legacy_id` field, or renames `customer_email` to `email_address`. These structural changes happen independently of type validation.
88

9-
When you insert data into a Delta table, Azure Databricks enforces these rules:
9+
Delta Lake's schema enforcement blocks structural mismatches by default. When incoming data contains columns not present in the target table, or when required columns are missing, the write operation fails. This fail-fast behavior protects your tables from unexpected structural changes, but you need strategies to handle legitimate schema evolution.
1010

11-
- All columns in the incoming data must exist in the target table
12-
- The source data must include all columns present in the target table
13-
- Column names must match (schema enforcement is case-sensitive by default)
11+
:::image type="content" source="../media/4-recognize-schema-drift-challenges.png" alt-text="Diagram helping you recognize schema drift challenges." border="false" lightbox="../media/4-recognize-schema-drift-challenges.png":::
1412

15-
Consider what happens when you attempt to write data that doesn't match the expected schema:
13+
Consider a streaming pipeline that processes customer data:
1614

1715
```sql
18-
-- Target table expects columns: customer_id, name, email
1916
CREATE TABLE customers (
2017
customer_id INT,
2118
name STRING,
2219
email STRING
2320
);
24-
25-
-- This insert fails because 'phone' column doesn't exist in target
26-
INSERT INTO customers
27-
SELECT customer_id, name, email, phone FROM source_data;
2821
```
2922

30-
The operation fails with an error indicating that the column `phone` doesn't exist in the target table. This fail-fast behavior prevents unexpected data from entering your tables.
31-
32-
> [!NOTE]
33-
> Schema enforcement applies to Delta Lake tables by default. Tables backed by external data sources don't enforce schema automatically.
23+
When your source system adds a `phone_number` column and starts sending it in the data feed, writes fail because the target table doesn't include this column. Your pipeline stops until you decide how to handle the new field—either by rejecting it, adding it to the table schema, or preserving it for later analysis.
3424

3525
## Detect and respond to schema drift
3626

learn-pr/wwl-databricks/implement-manage-data-quality-constraints-unity-catalog/includes/5-manage-data-quality-pipeline-expectations.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ With expectations, you specify what valid data looks like using SQL constraints.
66

77
Every expectation consists of three parts: a name, a constraint, and an action. Understanding these components helps you design effective data quality checks.
88

9+
:::image type="content" source="../media/5-define-expectations.png" alt-text="Diagram defines expectations with three components." border="false" lightbox="../media/5-define-expectations.png":::
10+
911
The **name** identifies the expectation and appears in monitoring dashboards. Choose names that clearly describe what you're validating. For example, `valid_customer_age` communicates the rule's purpose better than `check_1`.
1012

1113
The **constraint** is a SQL Boolean expression that evaluates to true or false for each record. When a record fails the constraint, the expectation triggers. You can use any valid SQL syntax except custom Python functions, external service calls, or subqueries.
@@ -128,6 +130,8 @@ To view expectation metrics:
128130
3. Select a dataset that has expectations defined.
129131
4. Open the **Data quality** tab in the right sidebar.
130132

133+
:::image type="content" source="../media/5-monitor-expectation-results.png" alt-text="Screenshot of the declarative pipeline editor, highlighting expectations." border="false" lightbox="../media/5-monitor-expectation-results.png":::
134+
131135
The metrics show you how many records passed or failed each expectation during pipeline runs. For `warn` and `drop` actions, you see counts of violations. For `fail` actions, the pipeline stops before metrics are recorded, but error messages include details about the violating record.
132136

133137
When a `fail` expectation triggers, the error message provides context to help you investigate:

learn-pr/wwl-databricks/implement-manage-data-quality-constraints-unity-catalog/index.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ units:
3434
- learn.wwl.implement-manage-data-quality-constraints-unity-catalog.introduction
3535
- learn.wwl.implement-manage-data-quality-constraints-unity-catalog.implement-validation-checks
3636
- learn.wwl.implement-manage-data-quality-constraints-unity-catalog.implement-data-type-checks
37-
- learn.wwl.implement-manage-data-quality-constraints-unity-catalog.implement-schema-enforcement-manage-drift
37+
- learn.wwl.implement-manage-data-quality-constraints-unity-catalog.detect-manage-schema-drift
3838
- learn.wwl.implement-manage-data-quality-constraints-unity-catalog.manage-data-quality-pipeline-expectations
3939
- learn.wwl.implement-manage-data-quality-constraints-unity-catalog.knowledge-check
4040
- learn.wwl.implement-manage-data-quality-constraints-unity-catalog.summary
509 KB
Loading
2.61 MB
Loading
396 KB
Loading

0 commit comments

Comments
 (0)