|
1 | 1 | Data pipelines often receive data from sources that evolve over time. New columns appear, others disappear, and the structure of incoming data changes as business requirements shift. Without proper controls, these changes can silently corrupt your data or break your pipelines entirely. |
2 | 2 |
|
3 | | -In this unit, you learn how to enforce schema constraints in Azure Databricks and implement strategies for detecting and managing schema drift in your data engineering workflows. |
| 3 | +In this unit, you learn how to detect and manage schema drift—the structural changes that occur when source systems add, remove, or rename columns over time. |
4 | 4 |
|
5 | | -## Understand schema enforcement |
| 5 | +## Recognize schema drift challenges |
6 | 6 |
|
7 | | -Schema enforcement is the process of validating that incoming data matches the expected structure of your target table. Delta Lake enforces schema on write by default, which means every write operation validates the data structure before committing changes. |
| 7 | +While data type validation ensures values match expected types (as covered in the previous unit), schema drift addresses a different challenge: the structure of your data changes over time. A source system adds a new `phone_number` column, removes a deprecated `legacy_id` field, or renames `customer_email` to `email_address`. These structural changes happen independently of type validation. |
8 | 8 |
|
9 | | -When you insert data into a Delta table, Azure Databricks enforces these rules: |
| 9 | +Delta Lake's schema enforcement blocks structural mismatches by default. When incoming data contains columns not present in the target table, or when required columns are missing, the write operation fails. This fail-fast behavior protects your tables from unexpected structural changes, but you need strategies to handle legitimate schema evolution. |
10 | 10 |
|
11 | | -- All columns in the incoming data must exist in the target table |
12 | | -- The source data must include all columns present in the target table |
13 | | -- Column names must match (schema enforcement is case-sensitive by default) |
| 11 | +:::image type="content" source="../media/4-recognize-schema-drift-challenges.png" alt-text="Diagram helping you recognize schema drift challenges." border="false" lightbox="../media/4-recognize-schema-drift-challenges.png"::: |
14 | 12 |
|
15 | | -Consider what happens when you attempt to write data that doesn't match the expected schema: |
| 13 | +Consider a streaming pipeline that processes customer data: |
16 | 14 |
|
17 | 15 | ```sql |
18 | | --- Target table expects columns: customer_id, name, email |
19 | 16 | CREATE TABLE customers ( |
20 | 17 | customer_id INT, |
21 | 18 | name STRING, |
22 | 19 | email STRING |
23 | 20 | ); |
24 | | - |
25 | | --- This insert fails because 'phone' column doesn't exist in target |
26 | | -INSERT INTO customers |
27 | | -SELECT customer_id, name, email, phone FROM source_data; |
28 | 21 | ``` |
29 | 22 |
|
30 | | -The operation fails with an error indicating that the column `phone` doesn't exist in the target table. This fail-fast behavior prevents unexpected data from entering your tables. |
31 | | - |
32 | | -> [!NOTE] |
33 | | -> Schema enforcement applies to Delta Lake tables by default. Tables backed by external data sources don't enforce schema automatically. |
| 23 | +When your source system adds a `phone_number` column and starts sending it in the data feed, writes fail because the target table doesn't include this column. Your pipeline stops until you decide how to handle the new field—either by rejecting it, adding it to the table schema, or preserving it for later analysis. |
34 | 24 |
|
35 | 25 | ## Detect and respond to schema drift |
36 | 26 |
|
|
0 commit comments