You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Azure Databricks is a cloud-based data platform that brings together the best of **data engineering, data science, and machine learning** in a single, unified workspace. Built on top of **Apache Spark**, it allows organizations to easily process, analyze, and visualize massive amounts of data in real time.
2
4
3
5

@@ -16,6 +18,8 @@ At its core, Azure Databricks helps organizations:
A **data lakehouse** is a data management approach that blends the strengths of both data lakes and data warehouses. It offers scalable storage and processing, allowing organizations to handle diverse workloads—such as machine learning and business intelligence—without relying on separate, disconnected systems. By centralizing data, a lakehouse supports a single source of truth, reduces duplicate costs, and ensures that information stays up to date.
20
24
21
25
Many lakehouses follow a layered design pattern where data is gradually improved, enriched, and refined as it moves through different stages of processing. This layered approach—commonly called the **medallion architecture**—organizes data into stages that build on one another, making it easier to manage and use effectively.
To use Azure Databricks, you must create an Azure Databricks workspace in your Azure subscription. A workspace is an Azure Databricks deployment in a cloud service account. It provides a unified environment for working with Azure Databricks assets for a specified set of users.
2
4
3
5
You can create an Azure Databricks workspace by:
@@ -60,6 +62,8 @@ The workspace is available in **multiple languages.** To change the workspace la
**Databricks Assistant** is an AI-powered pair programmer and support tool that helps you work more efficiently in Databricks by generating, explaining, and fixing code or queries directly in notebooks, dashboards, and files.
64
68
65
69

Azure Databricks provides capabilities for data scientists and engineers who need to collaborate on complex data processing tasks. It provides an integrated environment with Apache Spark for big data processing in a data lakehouse, and supports multiple languages including Python, R, Scala, and SQL. The platform facilitates data exploration, visualization, and the development of data pipelines.
6
8
7
9
:::image type="content" source="../media/03-azure-databricks-data-science-engineering.png" alt-text="Diagram of Databricks data ingestion & data sources screen." lightbox="../media/03-azure-databricks-data-science-engineering.png":::
Azure Databricks supports building, training, and deploying machine learning models at scale. It includes MLflow, an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. It also supports various ML frameworks such as TensorFlow, PyTorch, and Scikit-learn, making it versatile for different ML tasks.
12
16
13
17
:::image type="content" source="../media/04-azure-databricks-machine-learning.png" alt-text="Diagram of Databricks Machine Learning screen." lightbox="../media/04-azure-databricks-machine-learning.png":::
Data analysts who primarily interact with data through SQL can use SQL warehouses in Azure Databricks. The Azure Databricks Workspace UI provides a familiar SQL editor, dashboards, and automatic visualization tools to analyze and visualize data directly within Azure Databricks. This workload is ideal for running quick ad-hoc queries and creating reports from large datasets.
18
24
19
25
:::image type="content" source="../media/05-azure-databricks-sql.png" alt-text="Diagram of DatabricksSQL Editor screen." lightbox="../media/05-azure-databricks-sql.png":::
A **workspace** in Azure Databricks is a secure, collaborative environment where your can access and organize all Databricks assets, such as notebooks, clusters, jobs, libraries, dashboards, and experiments.
6
8
7
9
You can open an Azure Databricks Workspace from the Azure portal, by selecting **Launch Workspace**.
@@ -14,6 +16,8 @@ In addition, workspaces are tied to **Unity Catalog** (when enabled) for central
**Databricks notebooks** are interactive, web-based documents that combine **runnable code, visualizations, and narrative text** in a single environment. They support multiple languages—such as Python, R, Scala, and SQL—and allow users to switch between languages within the same notebook using *magic commands*. This flexibility makes notebooks well-suited for **exploratory data analysis, data visualization, machine learning experiments, and building complex data pipelines**.
18
22
19
23
Notebooks are also designed for **collaboration**: multiple users can edit and run cells simultaneously, add comments, and share insights in real time. They integrate tightly with Databricks clusters, enabling users to process large datasets efficiently, and can connect to external data sources through **Unity Catalog** for governed data access. In addition, notebooks can be version-controlled, scheduled as jobs, or exported for sharing outside the platform, making them central to both **ad-hoc exploration** and **production-grade workflows**.
@@ -24,6 +28,8 @@ Notebooks contain a collection of two types of cells: **code cells** and **Markd
The **Databricks Runtime** is a set of customized builds of **Apache Spark** that include performance improvements and additional libraries. These runtimes make it easier to handle tasks such as **machine learning**, **graph processing**, and **genomics**, while still supporting general data processing and analytics.
47
55
48
56
Databricks provides multiple runtime versions, including **long-term support (LTS)** releases. Each release specifies the underlying Apache Spark version, its release date, and when support will end. Over time, older runtime versions follow a lifecycle:
@@ -56,6 +64,8 @@ If a maintenance update is released for a runtime version you're using, you can
**Lakeflow Jobs** provide workflow automation and orchestration in Azure Databricks, making it possible to reliably schedule, coordinate, and run data processing tasks. Instead of running code manually, you can use jobs to automate repetitive or production-grade workloads such as ETL pipelines, machine learning training, or dashboard refreshes.
60
70
61
71
:::image type="content" source="../media/jobs.png" alt-text="Screenshot of an Azure Databricks Jobs landing page." lightbox="../media/jobs.png":::
@@ -72,6 +82,8 @@ Because they're repeatable and managed, jobs are critical for **production workl
**Delta Lake** is an open-source storage framework that improves the reliability and scalability of data lakes by adding transactional features on top of cloud object storage, such as **Azure Data Lake Storage**. Traditional data lakes can suffer from issues like inconsistent data, partial writes, or difficulties managing concurrent access. Delta Lake addresses these problems by supporting:
76
88
77
89
-**ACID transactions** (atomicity, consistency, isolation, durability) for reliable reads and writes.
@@ -83,6 +95,8 @@ On top of this foundation, **Delta tables** provide a familiar table abstraction
**Databricks SQL** brings **data warehousing capabilities** to the Databricks Lakehouse, allowing analysts and business users to query and visualize data stored in open formats directly in the data lake. It supports **ANSI SQL**, so anyone familiar with SQL can run queries, build reports, and create dashboards without needing to learn new languages or tools.
87
101
88
102
Databricks SQL is available only in the **Premium tier** of Azure Databricks. It includes:
@@ -93,6 +107,8 @@ Databricks SQL is available only in the **Premium tier** of Azure Databricks. It
All Databricks SQL queries run on **SQL warehouses** (formerly called SQL endpoints), which are scalable compute resources decoupled from storage. Different warehouse types are available depending on performance, cost, and management needs:
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/explore-azure-databricks/includes/05-data-governance-using-unity-catalog-and-microsoft-purview.md
Data governance is critical for ensuring that data within an organization is managed securely, efficiently, and in compliance with regulations.
2
4
3
5
In many organizations, data is distributed across databases, data warehouses, data lakes, and even multiple catalogs. It also exists in diverse formats like Parquet, CSV, and Delta Lake. Beyond structured data in tables, there’s also unstructured data in files, along with other assets such as machine learning models, notebooks, and dashboards that require management and governance. This fragmentation creates silos across sources, formats, and asset types.
@@ -14,6 +16,8 @@ Azure Databricks, combined with Unity Catalog and Microsoft Purview, provides a
Unity Catalog provides a centralized way to manage access, discovery, lineage, audit logs, and quality monitoring across data and AI assets within Azure Databricks. It applies consistently across all workspaces in a region.
18
22
19
23

@@ -46,6 +50,8 @@ In most accounts, Unity Catalog is enabled by default when you create a workspac
Microsoft Purview is a data governance service that lets you manage and oversee data across on-premises systems, multiple clouds, and SaaS platforms. It includes features such as data discovery, classification, lineage tracking, and access governance.
50
56
51
57
When integrated with Azure Databricks and Unity Catalog, Purview can discover Lakehouse data and ingest its metadata into the Data Map. This allows you to apply consistent governance across your entire data environment, while acting as a central catalog that brings together metadata from different sources.
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/cleanse-transform-load-data-into-unity-catalog/5-transform-data-filter-group-aggregate.yml
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -4,11 +4,11 @@ title: Transform data with filters and aggregations
4
4
metadata:
5
5
title: Transform Data With Filters and Aggregations
6
6
description: Learn how to filter, group, and aggregate data in Azure Databricks using PySpark and SQL to transform raw data into meaningful summaries.
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/cleanse-transform-load-data-into-unity-catalog/includes/5-transform-data-filter-group-aggregate.md
+34Lines changed: 34 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,6 +57,40 @@ df_filtered = spark.sql("""
57
57
""")
58
58
```
59
59
60
+
### Filter null values
61
+
62
+
Filtering null values requires special handling in Spark DataFrames. Use the `isNull()` and `isNotNull()` functions to identify or exclude null values:
> Using Python's `None` with inequality operators like `!= None` doesn't reliably filter null values in Spark DataFrames. Null comparisons in SQL semantics don't evaluate to true or false—they return null. Always use `isNull()` or `isNotNull()` for correct null handling.
77
+
78
+
In SQL, use the `IS NULL` or `IS NOT NULL` operators:
79
+
80
+
```sql
81
+
-- Filter orders with non-null amounts
82
+
SELECT*
83
+
FROM orders
84
+
WHERE order_amount IS NOT NULL;
85
+
```
86
+
87
+
For comprehensive null handling that removes entire rows containing null values, use the `dropna()` method covered in the unit on resolving duplicate and missing values:
88
+
89
+
```python
90
+
# Remove rows where order_amount is null
91
+
df_clean = df.dropna(subset=["order_amount"])
92
+
```
93
+
60
94
## Group data to organize records
61
95
62
96
Grouping organizes rows that share common values into categories. This prepares data for aggregation—once grouped, you can calculate statistics for each category.
title: Cleanse, Transform, and Load Data into Unity Catalog
5
5
description: Learn how to cleanse, transform, and load data into Unity Catalog tables in Azure Databricks by profiling data, handling duplicates and nulls, applying transformations, and using various loading strategies.
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/design-implement-data-modeling-unity-catalog/3-choose-data-ingestion-tool.yml
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -4,11 +4,11 @@ title: Choose a data ingestion tool
4
4
metadata:
5
5
title: Choose a Data Ingestion Tool
6
6
description: Learn how to select the appropriate data ingestion tool in Azure Databricks, including Lakeflow Connect, Auto Loader, COPY INTO, Spark Structured Streaming, JDBC/ODBC, and Azure Data Factory.
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/design-implement-data-modeling-unity-catalog/5-design-data-partitioning-scheme.yml
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -4,11 +4,11 @@ title: Design and implement a data partitioning scheme
4
4
metadata:
5
5
title: Design and Implement a Data Partitioning Scheme
6
6
description: Learn how to design and implement effective data partitioning schemes in Azure Databricks to optimize query performance and manage large-scale datasets.
0 commit comments