Skip to content

Commit 8b62dd4

Browse files
authored
Merge pull request #52961 from weslbo/images-ingest-data-into-unity-catalog
Images ingest data into unity catalog
2 parents 43e4190 + 1187c44 commit 8b62dd4

9 files changed

Lines changed: 10 additions & 0 deletions

learn-pr/wwl-databricks/ingest-data-into-unity-catalog/includes/2-ingest-data-lakeflow-connect.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ Data engineers face a common challenge: getting data from diverse sources into a
44

55
Lakeflow Connect is a collection of managed connectors in Azure Databricks that simplify data ingestion from external sources. Rather than writing custom extraction code, you configure pipelines through either a graphical interface or declarative definitions.
66

7+
:::image type="content" source="../media/2-understand-lakeflow-connect-pipelines.png" alt-text="Diagram explaining Lakeflow Connect pipelines." border="false" lightbox="../media/2-understand-lakeflow-connect-pipelines.png":::
8+
79
Each pipeline consists of three core components:
810

911
- **Connection**: Stores credentials and endpoint information for the source system. You create connections once and reuse them across multiple pipelines.

learn-pr/wwl-databricks/ingest-data-into-unity-catalog/includes/5-ingest-data-change-data-capture-feed.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ When source systems generate millions of transactions daily, reloading entire ta
44

55
Change data capture ingestion involves reading a stream of change records from a source system and applying those changes to a target table. Unlike full table reloads, CDC processes only what has changed since the last sync, reducing both processing time and resource consumption.
66

7+
:::image type="content" source="../media/5-understand-change-data-capture-ingestion-patterns.png" alt-text="Diagram explaining CDC ingestion patterns." border="false" lightbox="../media/5-understand-change-data-capture-ingestion-patterns.png":::
8+
79
A typical CDC feed contains records with the following structure:
810

911
- **Data columns**: The actual values for each field in the source table
@@ -107,4 +109,6 @@ Distributed systems often deliver events out of order. A network delay might cau
107109

108110
When two updates arrive for the same key, the API compares their sequence values. If the newer update (higher sequence number) has already been applied and an older update arrives late, the API ignores the late-arriving record. This ensures your destination table always reflects the correct final state.
109111

112+
:::image type="content" source="../media/5-handle-out-of-order-events.png" alt-text="Diagram explaining handling out-of-order events." border="false" lightbox="../media/5-handle-out-of-order-events.png":::
113+
110114
Without this built-in handling, you would need to write complex merge logic that checks timestamps, compares versions, and conditionally applies updates. The AUTO CDC API encapsulates all this logic, letting you focus on your business requirements rather than edge case handling.

learn-pr/wwl-databricks/ingest-data-into-unity-catalog/includes/7-ingest-data-auto-loader.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ Auto Loader is a Structured Streaming source that monitors a cloud storage locat
66

77
When you start an Auto Loader stream, it can process existing files in the directory and then continuously watch for new arrivals. The stream stores progress information in a checkpoint location, which allows it to resume from exactly where it stopped if interrupted.
88

9+
:::image type="content" source="../media/7-understand-how-auto-loader-works.png" alt-text="Diagram showing how Auto Loader works." border="false" lightbox="../media/7-understand-how-auto-loader-works.png":::
10+
911
Auto Loader detects new files using one of two modes:
1012

1113
- **Directory listing mode**: Auto Loader periodically lists the input directory to discover new files. This approach requires no additional configuration beyond storage access.

learn-pr/wwl-databricks/ingest-data-into-unity-catalog/includes/8-ingest-data-lakeflow-declarative-pipelines.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ When you need to ingest data from multiple sources with automated orchestration,
44

55
Lakeflow Declarative Pipelines is a framework for building batch and streaming data pipelines using SQL or Python. You declare the structure of your data transformations, and the pipeline automatically handles orchestration, retries, and incremental processing.
66

7+
:::image type="content" source="../media/8-understand-lakeflow-declarative-pipelines-ingestion.png" alt-text="Diagram explaining Lakeflow Declarative Pipelines for ingestion." border="false" lightbox="../media/8-understand-lakeflow-declarative-pipelines-ingestion.png":::
8+
79
For data ingestion, you typically create **streaming tables** as targets. A streaming table is a Delta table with built-in support for streaming data. Each row from the source is processed exactly once, making streaming tables ideal for append-only ingestion workloads where data continuously arrives.
810

911
The pipeline framework provides several benefits for ingestion:
285 KB
Loading
5.52 MB
Loading
6.65 MB
Loading
442 KB
Loading
588 KB
Loading

0 commit comments

Comments
 (0)