You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/4-troubleshoot-repair-spark-jobs-notebooks.yml
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/2-monitor-manage-cluster-consumption.md
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,8 @@ Unmonitored clusters present real financial risks. A single **idle cluster** run
6
6
7
7
At the same time, **under-provisioned clusters** create their own problems. Jobs take longer to complete, users experience delays, and critical workflows miss their deadlines. The goal isn't simply to minimize costs—it's to **match resources precisely to workload requirements**.
8
8
9
+
:::image type="content" source="../media/2-understand-impact-cluster-consumption.png" alt-text="Diagram explaining the impact of cluster consumption." border="false" lightbox="../media/2-understand-impact-cluster-consumption.png":::
10
+
9
11
Monitoring cluster consumption helps you **identify waste** before it impacts budgets and **spot performance bottlenecks** before they affect business operations. Regular monitoring also establishes **baselines** that help you plan capacity and justify infrastructure decisions.
10
12
11
13
## Monitor compute metrics
@@ -20,6 +22,8 @@ The metrics interface displays three categories of data:
20
22
21
23
**GPU metrics** (available on Databricks Runtime ML 13.3 and later) track specialized compute utilization when running machine learning workloads.
22
24
25
+
:::image type="content" source="../media/2-cluster-metrics.png" alt-text="Screenshot of Azure Databricks cluster metrics." lightbox="../media/2-cluster-metrics.png":::
26
+
23
27
You can filter metrics by time range using the date picker, viewing data from the past 30 days. Select individual nodes from the Compute dropdown to investigate specific worker performance, or view aggregated metrics across all nodes to understand overall cluster behavior.
24
28
25
29
> [!NOTE]
@@ -29,6 +33,8 @@ You can filter metrics by time range using the date picker, viewing data from th
29
33
30
34
SQL warehouses have their own monitoring interface optimized for query analytics. Select a SQL warehouse and then select the **Monitoring** tab to view performance data.
31
35
36
+
:::image type="content" source="../media/2-warehouse-monitoring-tab.png" alt-text="Screenshot of the Azure Databricks SQL warehouse monitoring tab." lightbox="../media/2-warehouse-monitoring-tab.png":::
37
+
32
38
**Live statistics** at the top of the page show warehouse status, running queries, queued queries, and current cluster count. These metrics update in real-time and help you quickly assess whether the warehouse is keeping up with demand.
33
39
34
40
The **peak query count** chart displays the maximum number of concurrent queries—both running and queued—during your selected time frame. Spikes in this chart often indicate periods where the warehouse struggled to keep up with demand.
@@ -43,6 +49,8 @@ Monitoring reveals patterns; configuration changes act on those patterns. Two ke
43
49
44
50
**Auto-termination** shuts down idle clusters after a specified period of inactivity. For development environments, **30-60 minutes** is typically appropriate. The cluster terminates when no commands have run for the specified duration, preventing costs from accumulating overnight or over weekends.
To configure auto-termination, enable the setting during cluster creation or edit an existing cluster. Enter the number of minutes of inactivity before termination. Keep in mind that a cluster is considered inactive only when all commands—including Spark jobs, Structured Streaming, and JDBC calls—have finished executing.
47
55
48
56
**Autoscaling** dynamically adjusts the number of worker nodes based on workload demand. Configure **minimum and maximum node counts** based on your workload analysis. The cluster adds workers during intensive processing and removes them during lighter periods, reducing costs by **20-40%** compared to fixed-size clusters.
@@ -56,6 +64,8 @@ Beyond real-time monitoring, you need visibility into actual spending. Azure Dat
56
64
57
65
**Budgets** let you set financial targets and track spending across your account. Configure **email notifications** when spending approaches or exceeds your budget limits. You can apply filters to track spending by team, project, or workspace.
**System tables**, specifically `system.billing.usage`, provide detailed usage data you can query directly. Join this table with `compute.clusters` to identify which cluster owners consume the most **Databricks Units (DBUs)**. Use **custom tags** to attribute costs to specific business units or projects.
60
70
61
71
**Tags** propagate from clusters and workspaces to billing records, enabling accurate **chargeback**. Apply tags consistently from the start—you can't add tags retroactively to historical usage. Common tags include business unit, project, and environment (development, staging, production).
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/3-troubleshoot-repair-lakeflow-jobs.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,8 @@ To investigate a failed job:
25
25
3. In the **Runs** tab, hover over a failed task (shown in red) to see metadata including start time, end time, status, duration, and error messages.
26
26
4. Select the failed task to open the **Task run details** page with complete output and logs.
27
27
28
+
:::image type="content" source="../media/3-identify-cause-failure.png" alt-text="Screenshot showing a failed task." lightbox="../media/3-identify-cause-failure.png":::
29
+
28
30
The matrix view helps you identify patterns. If the same task fails repeatedly, the issue likely relates to that task's code or configuration. If failures appear random across different tasks, you might have a cluster or resource problem.
29
31
30
32
> [!TIP]
@@ -54,6 +56,8 @@ To repair a failed run:
54
56
4. Optionally, modify task parameters in the dialog. These values override the original settings for this repair run only.
55
57
5. Select **Repair run** to start.
56
58
59
+
:::image type="content" source="../media/3-repair-failed-task.png" alt-text="Screenshot of the failed task." lightbox="../media/3-repair-failed-task.png":::
60
+
57
61
After the repair completes, the matrix view adds a new column showing the repaired run results. Tasks that were red (failed) should now appear green (successful).
58
62
59
63
> [!NOTE]
@@ -67,6 +71,8 @@ Sometimes you need to halt a running job or restart one that's stuck. The Jobs U
67
71
68
72
**To restart continuous jobs**: Continuous jobs that fail repeatedly enter an exponential backoff state, where Azure Databricks waits progressively longer between retry attempts. The **Job details** panel shows the number of consecutive failures and the time until the next retry. Select **Restart run** to cancel the active run, reset the retry period, and immediately start a new run.
69
73
74
+
:::image type="content" source="../media/3-stop-run.png" alt-text="Screenshot showing how to stop an active run." lightbox="../media/3-stop-run.png":::
75
+
70
76
Use the stop function when a job consumes excessive resources, processes incorrect data, or needs immediate intervention. Use restart when you've fixed an underlying issue and want to bypass the exponential backoff waiting period.
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/4-troubleshoot-repair-spark-jobs-notebooks.md
+13-15Lines changed: 13 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,8 @@ When a Spark job fails or runs slower than expected, your ability to quickly dia
4
4
5
5
Spark jobs can fail for various reasons, and understanding these patterns helps you focus your investigation. The most frequent causes fall into three categories: code errors, resource constraints, and environmental issues.
6
6
7
+
:::image type="content" source="../media/4-understand-common-causes-failures.png" alt-text="Diagram explaining the common causes of Spark job failures." border="false" lightbox="../media/4-understand-common-causes-failures.png":::
8
+
7
9
**Code-related failures** include syntax errors in notebooks, incorrect transformations, or data quality issues like schema mismatches. These failures typically produce error messages that point directly to the problematic code.
8
10
9
11
**Resource bottlenecks** occur when jobs consume more CPU, memory, or disk than available. You might see out-of-memory (OOM) errors, slow shuffle operations, or tasks that fail repeatedly. These issues often require adjusting cluster configuration or optimizing your code.
@@ -14,6 +16,8 @@ Spark jobs can fail for various reasons, and understanding these patterns helps
14
16
15
17
The Spark UI provides detailed visibility into job execution and is your primary diagnostic tool. To access it, navigate to your cluster's page and select the **Spark UI** tab.
16
18
19
+
:::image type="content" source="../media/4-use-spark-user-interface.png" alt-text="Screenshot of the Spark user interface." lightbox="../media/4-use-spark-user-interface.png":::
20
+
17
21
Start your investigation with the **Jobs Timeline**, which shows the execution sequence of all Spark jobs. Look for three key patterns:
18
22
19
23
**Failing jobs** appear with a red status indicator. Select any failed job to view the failed stage and specific failure reason. The error description often contains links to more detailed information about task-level failures.
@@ -31,6 +35,8 @@ After identifying a problematic job, drill into its longest stage to examine tas
31
35
32
36
Resource bottlenecks manifest differently depending on which resource is constrained. The compute metrics interface helps you identify these patterns by showing CPU, memory, and network utilization across nodes.
33
37
38
+
:::image type="content" source="../media/4-identify-resolve-resource-bottlenecks.png" alt-text="Diagram exlaining how to identify and resolve resource bottlenecks." border="false" lightbox="../media/4-identify-resolve-resource-bottlenecks.png":::
39
+
34
40
**Memory pressure** appears as high memory utilization across workers or the driver. In the Spark UI, look for spill indicators showing data being written to disk because memory is insufficient. You can address memory issues by increasing worker instance sizes, reducing partition counts, or optimizing transformations to minimize data held in memory.
35
41
36
42
**CPU constraints** show as high CPU utilization with long task execution times despite adequate I/O throughput. Consider enabling Photon acceleration for compatible workloads or scaling out with additional worker nodes.
@@ -39,13 +45,19 @@ Resource bottlenecks manifest differently depending on which resource is constra
39
45
40
46
To access compute metrics, select your cluster from the **Compute** page and select the **Metrics** tab. The **Server load distribution** visualization uses color coding—red indicates heavily loaded nodes, while blue shows idle resources. If the driver node appears red while workers are blue, the driver is overloaded and may need a larger instance type.
41
47
48
+
:::image type="content" source="../media/4-server-metrics.png" alt-text="Screenshot of the compute metrics tab." lightbox="../media/4-server-metrics.png":::
49
+
42
50
## Restart clusters to resolve environmental issues
43
51
44
52
Sometimes a cluster encounters problems that require a restart to resolve. Resource exhaustion, malfunctioning executors, or stale container images can all necessitate a fresh cluster start.
45
53
46
54
Before restarting, determine whether a restart is appropriate. Check the **Event log** tab on the cluster details page for lifecycle events that might explain the problem. Look for messages about instance acquisition failures, spot instance reclamation, or executor terminations.
47
55
48
-
To restart a cluster using the UI, select your cluster from the **Compute** page and select **Restart**. You can also restart programmatically using the Databricks CLI:
56
+
To restart a cluster using the UI, select your cluster from the **Compute** page and select **Restart**.
57
+
58
+
:::image type="content" source="../media/4-restart-cluster.png" alt-text="Screenshot showing how to restart a cluster." lightbox="../media/4-restart-cluster.png":::
59
+
60
+
You can also restart programmatically using the Databricks CLI:
49
61
50
62
```bash
51
63
databricks clusters restart CLUSTER_ID
@@ -56,18 +68,4 @@ Replace `CLUSTER_ID` with your cluster's identifier, which you can find on the c
56
68
> [!IMPORTANT]
57
69
> Restarting a cluster terminates any running jobs and resets the Spark UI history. Save any diagnostic information you need before restarting. For long-running clusters processing streaming data, consider scheduling regular restarts during maintenance windows to ensure the cluster runs on current images.
58
70
59
-
## Repair failed job runs
60
-
61
-
When a job with multiple tasks fails, you don't need to rerun the entire job. The repair run feature lets you execute only the failed tasks and their dependents, saving time and resources. Note that repair is supported only for jobs that orchestrate two or more tasks.
62
-
63
-
To repair a job run:
64
-
65
-
1. Navigate to **Job Runs** in the sidebar.
66
-
2. Select the failed job from the list.
67
-
3. Select **Repair run** to see all tasks that will be reexecuted.
68
-
4. Optionally modify task parameters before repair.
69
-
5. Select **Repair run** to start the recovery.
70
-
71
-
For jobs that fail repeatedly, Databricks Assistant can help diagnose errors. Open the failed job and select **Diagnose Error** to receive suggestions for resolving the issue.
72
-
73
71
After making changes—whether adjusting cluster configuration, fixing code, or resolving external dependencies—validate your fix by monitoring the next job run. Check that execution times return to expected levels and that no new errors appear in the Spark UI or job output.
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/5-resolve-cache-skew-spill-shuffle.md
When investigating caching problems, consider these scenarios:
27
27
28
+
:::image type="content" source="../media/5-investigate-cache-issues.png" alt-text="Diagram showing how to investigate caching issues." border="false" lightbox="../media/5-investigate-cache-issues.png":::
29
+
28
30
**Under-caching** means data is read repeatedly from remote storage when it could be served from cache. The Spark UI shows high **Input** values for stages that read the same data multiple times. Enable disk cache and use worker nodes with SSD storage for better performance.
29
31
30
32
**Over-caching** consumes memory that Spark needs for processing. If you see memory pressure or out-of-memory errors, review whether cached data is actually being reused. Spark cache (using `.cache()` or `.persist()`) requires explicit management, unlike automatic disk caching.
@@ -39,6 +41,8 @@ df.unpersist()
39
41
40
42
Data skew occurs when some partitions contain significantly more data than others. This imbalance causes a few tasks to run much longer than the rest, leaving most cluster resources idle while waiting for slow tasks to complete.
41
43
44
+
:::image type="content" source="../media/5-investigate-data-skew.png" alt-text="Diagram showing how to investigate data skew." border="false" lightbox="../media/5-investigate-data-skew.png":::
45
+
42
46
To identify skew in the Spark UI, navigate to the stage's page and scroll to **Summary Metrics**. Compare the **Max** duration to the **75th percentile**. If the Max is more than 50% higher than the 75th percentile, you likely have skew.
Spill happens when Spark runs out of memory during processing and writes intermediate data to disk. This disk I/O significantly slows down operations. Spill commonly occurs during shuffle operations, aggregations, or when partitions are too large.
The Spark UI shows spill metrics at the top of each stage's page. Look for **Shuffle Spill (Memory)** and **Shuffle Spill (Disk)** values. Any non-zero spill indicates memory pressure.
Shuffle moves data between nodes during operations like joins, aggregations, and repartitioning. While sometimes necessary, excessive shuffle is expensive because it involves serializing data, writing to disk, transferring across the network, and deserializing.
86
92
93
+
:::image type="content" source="../media/5-investigate-shuffle-issues.png" alt-text="Diagram explaining how to investigate shuffle issues." border="false" lightbox="../media/5-investigate-shuffle-issues.png":::
94
+
87
95
In the Spark UI, check the **Shuffle Read** and **Shuffle Write** columns for each stage. Large shuffle values indicate significant data movement. The DAG shows where shuffle operations occur as exchange nodes.
Copy file name to clipboardExpand all lines: learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/6-implement-log-streaming-azure-analytics.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,8 @@ The data flow works as follows:
15
15
3. Log Analytics ingests the events into **service-specific tables**.
16
16
4. You query, visualize, and alert on this data using **Kusto Query Language (KQL)**.
Platform administrators typically configure the diagnostic settings through the Azure portal. As a data engineer, you focus on using the logs for monitoring and troubleshooting.
0 commit comments