Skip to content

Commit 4ef23fb

Browse files
committed
updated modules
1 parent 2ceb985 commit 4ef23fb

24 files changed

Lines changed: 166 additions & 25 deletions

learn-pr/wwl-data-ai/explore-azure-databricks/includes/01-introduction.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-01]
2+
13
Azure Databricks is a cloud-based data platform that brings together the best of **data engineering, data science, and machine learning** in a single, unified workspace. Built on top of **Apache Spark**, it allows organizations to easily process, analyze, and visualize massive amounts of data in real time.
24

35
![Diagram showing an Overview of Azure Databricks.](../media/databricks-overview.png)
@@ -16,6 +18,8 @@ At its core, Azure Databricks helps organizations:
1618

1719
## Data Lakehouse
1820

21+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-02]
22+
1923
A **data lakehouse** is a data management approach that blends the strengths of both data lakes and data warehouses. It offers scalable storage and processing, allowing organizations to handle diverse workloads—such as machine learning and business intelligence—without relying on separate, disconnected systems. By centralizing data, a lakehouse supports a single source of truth, reduces duplicate costs, and ensures that information stays up to date.
2024

2125
Many lakehouses follow a layered design pattern where data is gradually improved, enriched, and refined as it moves through different stages of processing. This layered approach—commonly called the **medallion architecture**—organizes data into stages that build on one another, making it easier to manage and use effectively.

learn-pr/wwl-data-ai/explore-azure-databricks/includes/02-azure-databricks.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-03]
2+
13
To use Azure Databricks, you must create an Azure Databricks workspace in your Azure subscription. A workspace is an Azure Databricks deployment in a cloud service account. It provides a unified environment for working with Azure Databricks assets for a specified set of users.
24

35
You can create an Azure Databricks workspace by:
@@ -60,6 +62,8 @@ The workspace is available in **multiple languages.** To change the workspace la
6062

6163
## Get help from Databricks Assistant
6264

65+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-04]
66+
6367
**Databricks Assistant** is an AI-powered pair programmer and support tool that helps you work more efficiently in Databricks by generating, explaining, and fixing code or queries directly in notebooks, dashboards, and files.
6468

6569
![Screenshot of the Azure Databricks Assistant.](../media/databricks-assistant.png)

learn-pr/wwl-data-ai/explore-azure-databricks/includes/03-workloads.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,24 @@ Azure Databricks offers capabilities for various workloads including Machine Lea
22

33
## Data Engineering
44

5+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-05]
6+
57
Azure Databricks provides capabilities for data scientists and engineers who need to collaborate on complex data processing tasks. It provides an integrated environment with Apache Spark for big data processing in a data lakehouse, and supports multiple languages including Python, R, Scala, and SQL. The platform facilitates data exploration, visualization, and the development of data pipelines.
68

79
:::image type="content" source="../media/03-azure-databricks-data-science-engineering.png" alt-text="Diagram of Databricks data ingestion & data sources screen." lightbox="../media/03-azure-databricks-data-science-engineering.png":::
810

911
## Machine Learning
1012

13+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-06]
14+
1115
Azure Databricks supports building, training, and deploying machine learning models at scale. It includes MLflow, an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. It also supports various ML frameworks such as TensorFlow, PyTorch, and Scikit-learn, making it versatile for different ML tasks.
1216

1317
:::image type="content" source="../media/04-azure-databricks-machine-learning.png" alt-text="Diagram of Databricks Machine Learning screen." lightbox="../media/04-azure-databricks-machine-learning.png":::
1418

1519
## SQL
1620

21+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-07]
22+
1723
Data analysts who primarily interact with data through SQL can use SQL warehouses in Azure Databricks. The Azure Databricks Workspace UI provides a familiar SQL editor, dashboards, and automatic visualization tools to analyze and visualize data directly within Azure Databricks. This workload is ideal for running quick ad-hoc queries and creating reports from large datasets.
1824

1925
:::image type="content" source="../media/05-azure-databricks-sql.png" alt-text="Diagram of DatabricksSQL Editor screen." lightbox="../media/05-azure-databricks-sql.png":::

learn-pr/wwl-data-ai/explore-azure-databricks/includes/04-key-concepts.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@ Azure Databricks is a single service platform with multiple technologies that en
22

33
## Workspaces
44

5+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-08]
6+
57
A **workspace** in Azure Databricks is a secure, collaborative environment where your can access and organize all Databricks assets, such as notebooks, clusters, jobs, libraries, dashboards, and experiments.
68

79
You can open an Azure Databricks Workspace from the Azure portal, by selecting **Launch Workspace**.
@@ -14,6 +16,8 @@ In addition, workspaces are tied to **Unity Catalog** (when enabled) for central
1416

1517
## Notebooks
1618

19+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-888888888-09]
20+
1721
**Databricks notebooks** are interactive, web-based documents that combine **runnable code, visualizations, and narrative text** in a single environment. They support multiple languages—such as Python, R, Scala, and SQL—and allow users to switch between languages within the same notebook using *magic commands*. This flexibility makes notebooks well-suited for **exploratory data analysis, data visualization, machine learning experiments, and building complex data pipelines**.
1822

1923
Notebooks are also designed for **collaboration**: multiple users can edit and run cells simultaneously, add comments, and share insights in real time. They integrate tightly with Databricks clusters, enabling users to process large datasets efficiently, and can connect to external data sources through **Unity Catalog** for governed data access. In addition, notebooks can be version-controlled, scheduled as jobs, or exported for sharing outside the platform, making them central to both **ad-hoc exploration** and **production-grade workflows**.
@@ -24,6 +28,8 @@ Notebooks contain a collection of two types of cells: **code cells** and **Markd
2428

2529
## Clusters
2630

31+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-10]
32+
2733
Azure Databricks leverages a two-layer architecture:
2834

2935
- **Control Plane**: this internal layer, managed by Microsoft, handles backend services specific to your Azure Databricks account.
@@ -43,6 +49,8 @@ This allows you to tailor compute to specific needs—from exploratory analysis
4349

4450
## Databricks Runtime
4551

52+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-11]
53+
4654
The **Databricks Runtime** is a set of customized builds of **Apache Spark** that include performance improvements and additional libraries. These runtimes make it easier to handle tasks such as **machine learning**, **graph processing**, and **genomics**, while still supporting general data processing and analytics.
4755

4856
Databricks provides multiple runtime versions, including **long-term support (LTS)** releases. Each release specifies the underlying Apache Spark version, its release date, and when support will end. Over time, older runtime versions follow a lifecycle:
@@ -56,6 +64,8 @@ If a maintenance update is released for a runtime version you're using, you can
5664

5765
## Lakeflow Jobs
5866

67+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-12]
68+
5969
**Lakeflow Jobs** provide workflow automation and orchestration in Azure Databricks, making it possible to reliably schedule, coordinate, and run data processing tasks. Instead of running code manually, you can use jobs to automate repetitive or production-grade workloads such as ETL pipelines, machine learning training, or dashboard refreshes.
6070

6171
:::image type="content" source="../media/jobs.png" alt-text="Screenshot of an Azure Databricks Jobs landing page." lightbox="../media/jobs.png":::
@@ -72,6 +82,8 @@ Because they're repeatable and managed, jobs are critical for **production workl
7282

7383
## Delta Lake
7484

85+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-13]
86+
7587
**Delta Lake** is an open-source storage framework that improves the reliability and scalability of data lakes by adding transactional features on top of cloud object storage, such as **Azure Data Lake Storage**. Traditional data lakes can suffer from issues like inconsistent data, partial writes, or difficulties managing concurrent access. Delta Lake addresses these problems by supporting:
7688

7789
- **ACID transactions** (atomicity, consistency, isolation, durability) for reliable reads and writes.
@@ -83,6 +95,8 @@ On top of this foundation, **Delta tables** provide a familiar table abstraction
8395

8496
## Databricks SQL
8597

98+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-14]
99+
86100
**Databricks SQL** brings **data warehousing capabilities** to the Databricks Lakehouse, allowing analysts and business users to query and visualize data stored in open formats directly in the data lake. It supports **ANSI SQL**, so anyone familiar with SQL can run queries, build reports, and create dashboards without needing to learn new languages or tools.
87101

88102
Databricks SQL is available only in the **Premium tier** of Azure Databricks. It includes:
@@ -93,6 +107,8 @@ Databricks SQL is available only in the **Premium tier** of Azure Databricks. It
93107

94108
## SQL Warehouses
95109

110+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-15]
111+
96112
All Databricks SQL queries run on **SQL warehouses** (formerly called SQL endpoints), which are scalable compute resources decoupled from storage. Different warehouse types are available depending on performance, cost, and management needs:
97113

98114
- **Serverless SQL Warehouses**

learn-pr/wwl-data-ai/explore-azure-databricks/includes/05-data-governance-using-unity-catalog-and-microsoft-purview.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-16]
2+
13
Data governance is critical for ensuring that data within an organization is managed securely, efficiently, and in compliance with regulations.
24

35
In many organizations, data is distributed across databases, data warehouses, data lakes, and even multiple catalogs. It also exists in diverse formats like Parquet, CSV, and Delta Lake. Beyond structured data in tables, there’s also unstructured data in files, along with other assets such as machine learning models, notebooks, and dashboards that require management and governance. This fragmentation creates silos across sources, formats, and asset types.
@@ -14,6 +16,8 @@ Azure Databricks, combined with Unity Catalog and Microsoft Purview, provides a
1416

1517
## Unity Catalog
1618

19+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-17]
20+
1721
Unity Catalog provides a centralized way to manage access, discovery, lineage, audit logs, and quality monitoring across data and AI assets within Azure Databricks. It applies consistently across all workspaces in a region.
1822

1923
![Diagram of the Unity Catalog components.](../media/06-azure-databricks-with-unity-catalog.png)
@@ -46,6 +50,8 @@ In most accounts, Unity Catalog is enabled by default when you create a workspac
4650

4751
## Microsoft Purview
4852

53+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=22222222-2222-2222-8888-8888888888-18]
54+
4955
Microsoft Purview is a data governance service that lets you manage and oversee data across on-premises systems, multiple clouds, and SaaS platforms. It includes features such as data discovery, classification, lineage tracking, and access governance.
5056

5157
When integrated with Azure Databricks and Unity Catalog, Purview can discover Lakehouse data and ingest its metadata into the Data Map. This allows you to apply consistent governance across your entire data environment, while acting as a central catalog that brings together metadata from different sources.

learn-pr/wwl-databricks/cleanse-transform-load-data-into-unity-catalog/5-transform-data-filter-group-aggregate.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@ title: Transform data with filters and aggregations
44
metadata:
55
title: Transform Data With Filters and Aggregations
66
description: Learn how to filter, group, and aggregate data in Azure Databricks using PySpark and SQL to transform raw data into meaningful summaries.
7-
ms.date: 12/07/2025
7+
ms.date: 01/15/2026
88
author: weslbo
99
ms.author: wedebols
1010
ms.topic: unit
1111
ai-usage: ai-generated
12-
durationInMinutes: 6
12+
durationInMinutes: 7
1313
content: |
1414
[!include[](includes/5-transform-data-filter-group-aggregate.md)]

learn-pr/wwl-databricks/cleanse-transform-load-data-into-unity-catalog/includes/5-transform-data-filter-group-aggregate.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,40 @@ df_filtered = spark.sql("""
5757
""")
5858
```
5959

60+
### Filter null values
61+
62+
Filtering null values requires special handling in Spark DataFrames. Use the `isNull()` and `isNotNull()` functions to identify or exclude null values:
63+
64+
```python
65+
# Filter rows where order_amount is not null
66+
df_valid_orders = df.filter(col("order_amount").isNotNull())
67+
68+
# Filter rows where order_amount is null
69+
df_null_orders = df.filter(col("order_amount").isNull())
70+
71+
# Alternative syntax using column object directly
72+
df_valid_orders = df.filter(df.order_amount.isNotNull())
73+
```
74+
75+
> [!IMPORTANT]
76+
> Using Python's `None` with inequality operators like `!= None` doesn't reliably filter null values in Spark DataFrames. Null comparisons in SQL semantics don't evaluate to true or false—they return null. Always use `isNull()` or `isNotNull()` for correct null handling.
77+
78+
In SQL, use the `IS NULL` or `IS NOT NULL` operators:
79+
80+
```sql
81+
-- Filter orders with non-null amounts
82+
SELECT *
83+
FROM orders
84+
WHERE order_amount IS NOT NULL;
85+
```
86+
87+
For comprehensive null handling that removes entire rows containing null values, use the `dropna()` method covered in the unit on resolving duplicate and missing values:
88+
89+
```python
90+
# Remove rows where order_amount is null
91+
df_clean = df.dropna(subset=["order_amount"])
92+
```
93+
6094
## Group data to organize records
6195

6296
Grouping organizes rows that share common values into categories. This prepares data for aggregation—once grouped, you can calculate statistics for each category.

learn-pr/wwl-databricks/cleanse-transform-load-data-into-unity-catalog/index.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ uid: learn.wwl.cleanse-transform-load-data-into-unity-catalog
33
metadata:
44
title: Cleanse, Transform, and Load Data into Unity Catalog
55
description: Learn how to cleanse, transform, and load data into Unity Catalog tables in Azure Databricks by profiling data, handling duplicates and nulls, applying transformations, and using various loading strategies.
6-
ms.date: 12/07/2025
6+
ms.date: 01/15/2026
77
author: weslbo
88
ms.author: wedebols
99
ms.topic: module

learn-pr/wwl-databricks/design-implement-data-modeling-unity-catalog/3-choose-data-ingestion-tool.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@ title: Choose a data ingestion tool
44
metadata:
55
title: Choose a Data Ingestion Tool
66
description: Learn how to select the appropriate data ingestion tool in Azure Databricks, including Lakeflow Connect, Auto Loader, COPY INTO, Spark Structured Streaming, JDBC/ODBC, and Azure Data Factory.
7-
ms.date: 12/07/2025
7+
ms.date: 01/15/2026
88
author: weslbo
99
ms.author: wedebols
1010
ms.topic: unit
1111
ai-usage: ai-generated
12-
durationInMinutes: 10
12+
durationInMinutes: 11
1313
content: |
1414
[!include[](includes/3-choose-data-ingestion-tool.md)]

learn-pr/wwl-databricks/design-implement-data-modeling-unity-catalog/5-design-data-partitioning-scheme.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@ title: Design and implement a data partitioning scheme
44
metadata:
55
title: Design and Implement a Data Partitioning Scheme
66
description: Learn how to design and implement effective data partitioning schemes in Azure Databricks to optimize query performance and manage large-scale datasets.
7-
ms.date: 12/07/2025
7+
ms.date: 01/15/2026
88
author: weslbo
99
ms.author: wedebols
1010
ms.topic: unit
1111
ai-usage: ai-generated
12-
durationInMinutes: 8
12+
durationInMinutes: 10
1313
content: |
1414
[!include[](includes/5-design-data-partitioning-scheme.md)]

0 commit comments

Comments
 (0)