MicrosoftDocs
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/4-troubleshoot-repair-spark-jobs-notebooks.yml‎
Lines changed: 1 addition & 1 deletion b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/4-troubleshoot-repair-spark-jobs-notebooks.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/2-monitor-manage-cluster-consumption.md‎
Lines changed: 10 additions & 0 deletions b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/2-monitor-manage-cluster-consumption.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/3-troubleshoot-repair-lakeflow-jobs.md‎
Lines changed: 6 additions & 0 deletions b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/3-troubleshoot-repair-lakeflow-jobs.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/4-troubleshoot-repair-spark-jobs-notebooks.md‎
Lines changed: 13 additions & 15 deletions b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/4-troubleshoot-repair-spark-jobs-notebooks.md‎
Lines changed: 13 additions & 15 deletions
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/5-resolve-cache-skew-spill-shuffle.md‎
Lines changed: 8 additions & 0 deletions b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/5-resolve-cache-skew-spill-shuffle.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/6-implement-log-streaming-azure-analytics.md‎
Lines changed: 2 additions & 0 deletions b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/6-implement-log-streaming-azure-analytics.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/2-cluster-metrics.png‎
528 KB b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/2-cluster-metrics.png‎
528 KB
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/2-configure-auto-termination-autoscaling.png‎
239 KB b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/2-configure-auto-termination-autoscaling.png‎
239 KB
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/2-create-budget.png‎
783 KB b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/2-create-budget.png‎
783 KB
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/2-understand-impact-cluster-consumption.png‎
576 KB b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/2-understand-impact-cluster-consumption.png‎
576 KB
@@ -9,6 +9,6 @@ metadata:
   ms.author: wedebols
   ms.topic: unit
   ai-usage: ai-generated
-durationInMinutes: 7
+durationInMinutes: 6
 content: |
   [!include[](includes/4-troubleshoot-repair-spark-jobs-notebooks.md)]
@@ -6,6 +6,8 @@ Unmonitored clusters present real financial risks. A single **idle cluster** run
 
 At the same time, **under-provisioned clusters** create their own problems. Jobs take longer to complete, users experience delays, and critical workflows miss their deadlines. The goal isn't simply to minimize costs—it's to **match resources precisely to workload requirements**.
 
+:::image type="content" source="../media/2-understand-impact-cluster-consumption.png" alt-text="Diagram explaining the impact of cluster consumption." border="false" lightbox="../media/2-understand-impact-cluster-consumption.png":::
+
 Monitoring cluster consumption helps you **identify waste** before it impacts budgets and **spot performance bottlenecks** before they affect business operations. Regular monitoring also establishes **baselines** that help you plan capacity and justify infrastructure decisions.
 
 ## Monitor compute metrics
@@ -20,6 +22,8 @@ The metrics interface displays three categories of data:
 
 **GPU metrics** (available on Databricks Runtime ML 13.3 and later) track specialized compute utilization when running machine learning workloads.
 
+:::image type="content" source="../media/2-cluster-metrics.png" alt-text="Screenshot of Azure Databricks cluster metrics." lightbox="../media/2-cluster-metrics.png":::
+
 You can filter metrics by time range using the date picker, viewing data from the past 30 days. Select individual nodes from the Compute dropdown to investigate specific worker performance, or view aggregated metrics across all nodes to understand overall cluster behavior.
 
 > [!NOTE]
@@ -29,6 +33,8 @@ You can filter metrics by time range using the date picker, viewing data from th
 
 SQL warehouses have their own monitoring interface optimized for query analytics. Select a SQL warehouse and then select the **Monitoring** tab to view performance data.
 
+:::image type="content" source="../media/2-warehouse-monitoring-tab.png" alt-text="Screenshot of the Azure Databricks SQL warehouse monitoring tab." lightbox="../media/2-warehouse-monitoring-tab.png":::
+
 **Live statistics** at the top of the page show warehouse status, running queries, queued queries, and current cluster count. These metrics update in real-time and help you quickly assess whether the warehouse is keeping up with demand.
 
 The **peak query count** chart displays the maximum number of concurrent queries—both running and queued—during your selected time frame. Spikes in this chart often indicate periods where the warehouse struggled to keep up with demand.
@@ -43,6 +49,8 @@ Monitoring reveals patterns; configuration changes act on those patterns. Two ke
 
 **Auto-termination** shuts down idle clusters after a specified period of inactivity. For development environments, **30-60 minutes** is typically appropriate. The cluster terminates when no commands have run for the specified duration, preventing costs from accumulating overnight or over weekends.
 
+:::image type="content" source="../media/2-configure-auto-termination-autoscaling.png" alt-text="Screenshot showing cluster auto-termination and autoscaling settings." lightbox="../media/2-configure-auto-termination-autoscaling.png":::
+
 To configure auto-termination, enable the setting during cluster creation or edit an existing cluster. Enter the number of minutes of inactivity before termination. Keep in mind that a cluster is considered inactive only when all commands—including Spark jobs, Structured Streaming, and JDBC calls—have finished executing.
 
 **Autoscaling** dynamically adjusts the number of worker nodes based on workload demand. Configure **minimum and maximum node counts** based on your workload analysis. The cluster adds workers during intensive processing and removes them during lighter periods, reducing costs by **20-40%** compared to fixed-size clusters.
@@ -56,6 +64,8 @@ Beyond real-time monitoring, you need visibility into actual spending. Azure Dat
 
 **Budgets** let you set financial targets and track spending across your account. Configure **email notifications** when spending approaches or exceeds your budget limits. You can apply filters to track spending by team, project, or workspace.
 
+:::image type="content" source="../media/2-create-budget.png" alt-text="Screenshot showing the account console > usage > create budget." lightbox="../media/2-create-budget.png":::
+
 **System tables**, specifically `system.billing.usage`, provide detailed usage data you can query directly. Join this table with `compute.clusters` to identify which cluster owners consume the most **Databricks Units (DBUs)**. Use **custom tags** to attribute costs to specific business units or projects.
 
 **Tags** propagate from clusters and workspaces to billing records, enabling accurate **chargeback**. Apply tags consistently from the start—you can't add tags retroactively to historical usage. Common tags include business unit, project, and environment (development, staging, production).
 
@@ -25,6 +25,8 @@ To investigate a failed job:
 3. In the **Runs** tab, hover over a failed task (shown in red) to see metadata including start time, end time, status, duration, and error messages.
 4. Select the failed task to open the **Task run details** page with complete output and logs.
 
+:::image type="content" source="../media/3-identify-cause-failure.png" alt-text="Screenshot showing a failed task." lightbox="../media/3-identify-cause-failure.png":::
+
 The matrix view helps you identify patterns. If the same task fails repeatedly, the issue likely relates to that task's code or configuration. If failures appear random across different tasks, you might have a cluster or resource problem.
 
 > [!TIP]
@@ -54,6 +56,8 @@ To repair a failed run:
 4. Optionally, modify task parameters in the dialog. These values override the original settings for this repair run only.
 5. Select **Repair run** to start.
 
+:::image type="content" source="../media/3-repair-failed-task.png" alt-text="Screenshot of the failed task." lightbox="../media/3-repair-failed-task.png":::
+
 After the repair completes, the matrix view adds a new column showing the repaired run results. Tasks that were red (failed) should now appear green (successful).
 
 > [!NOTE]
@@ -67,6 +71,8 @@ Sometimes you need to halt a running job or restart one that's stuck. The Jobs U
 
 **To restart continuous jobs**: Continuous jobs that fail repeatedly enter an exponential backoff state, where Azure Databricks waits progressively longer between retry attempts. The **Job details** panel shows the number of consecutive failures and the time until the next retry. Select **Restart run** to cancel the active run, reset the retry period, and immediately start a new run.
 
+:::image type="content" source="../media/3-stop-run.png" alt-text="Screenshot showing how to stop an active run." lightbox="../media/3-stop-run.png":::
+
 Use the stop function when a job consumes excessive resources, processes incorrect data, or needs immediate intervention. Use restart when you've fixed an underlying issue and want to bypass the exponential backoff waiting period.
 
 ## Monitor job health proactively
 
@@ -4,6 +4,8 @@ When a Spark job fails or runs slower than expected, your ability to quickly dia
 
 Spark jobs can fail for various reasons, and understanding these patterns helps you focus your investigation. The most frequent causes fall into three categories: code errors, resource constraints, and environmental issues.
 
+:::image type="content" source="../media/4-understand-common-causes-failures.png" alt-text="Diagram explaining the common causes of Spark job failures." border="false" lightbox="../media/4-understand-common-causes-failures.png":::
+
 **Code-related failures** include syntax errors in notebooks, incorrect transformations, or data quality issues like schema mismatches. These failures typically produce error messages that point directly to the problematic code.
 
 **Resource bottlenecks** occur when jobs consume more CPU, memory, or disk than available. You might see out-of-memory (OOM) errors, slow shuffle operations, or tasks that fail repeatedly. These issues often require adjusting cluster configuration or optimizing your code.
@@ -14,6 +16,8 @@ Spark jobs can fail for various reasons, and understanding these patterns helps
 
 The Spark UI provides detailed visibility into job execution and is your primary diagnostic tool. To access it, navigate to your cluster's page and select the **Spark UI** tab.
 
+:::image type="content" source="../media/4-use-spark-user-interface.png" alt-text="Screenshot of the Spark user interface." lightbox="../media/4-use-spark-user-interface.png":::
+
 Start your investigation with the **Jobs Timeline**, which shows the execution sequence of all Spark jobs. Look for three key patterns:
 
 **Failing jobs** appear with a red status indicator. Select any failed job to view the failed stage and specific failure reason. The error description often contains links to more detailed information about task-level failures.
@@ -31,6 +35,8 @@ After identifying a problematic job, drill into its longest stage to examine tas
 
 Resource bottlenecks manifest differently depending on which resource is constrained. The compute metrics interface helps you identify these patterns by showing CPU, memory, and network utilization across nodes.
 
+:::image type="content" source="../media/4-identify-resolve-resource-bottlenecks.png" alt-text="Diagram exlaining how to identify and resolve resource bottlenecks." border="false" lightbox="../media/4-identify-resolve-resource-bottlenecks.png":::
+
 **Memory pressure** appears as high memory utilization across workers or the driver. In the Spark UI, look for spill indicators showing data being written to disk because memory is insufficient. You can address memory issues by increasing worker instance sizes, reducing partition counts, or optimizing transformations to minimize data held in memory.
 
 **CPU constraints** show as high CPU utilization with long task execution times despite adequate I/O throughput. Consider enabling Photon acceleration for compatible workloads or scaling out with additional worker nodes.
@@ -39,13 +45,19 @@ Resource bottlenecks manifest differently depending on which resource is constra
 
 To access compute metrics, select your cluster from the **Compute** page and select the **Metrics** tab. The **Server load distribution** visualization uses color coding—red indicates heavily loaded nodes, while blue shows idle resources. If the driver node appears red while workers are blue, the driver is overloaded and may need a larger instance type.
 
+:::image type="content" source="../media/4-server-metrics.png" alt-text="Screenshot of the compute metrics tab." lightbox="../media/4-server-metrics.png":::
+
 ## Restart clusters to resolve environmental issues
 
 Sometimes a cluster encounters problems that require a restart to resolve. Resource exhaustion, malfunctioning executors, or stale container images can all necessitate a fresh cluster start.
 
 Before restarting, determine whether a restart is appropriate. Check the **Event log** tab on the cluster details page for lifecycle events that might explain the problem. Look for messages about instance acquisition failures, spot instance reclamation, or executor terminations.
 
-To restart a cluster using the UI, select your cluster from the **Compute** page and select **Restart**. You can also restart programmatically using the Databricks CLI:
+To restart a cluster using the UI, select your cluster from the **Compute** page and select **Restart**. 
+
+:::image type="content" source="../media/4-restart-cluster.png" alt-text="Screenshot showing how to restart a cluster." lightbox="../media/4-restart-cluster.png":::
+
+You can also restart programmatically using the Databricks CLI:
 
 ```bash
 databricks clusters restart CLUSTER_ID
@@ -56,18 +68,4 @@ Replace `CLUSTER_ID` with your cluster's identifier, which you can find on the c
 > [!IMPORTANT]
 > Restarting a cluster terminates any running jobs and resets the Spark UI history. Save any diagnostic information you need before restarting. For long-running clusters processing streaming data, consider scheduling regular restarts during maintenance windows to ensure the cluster runs on current images.
 
-## Repair failed job runs
-
-When a job with multiple tasks fails, you don't need to rerun the entire job. The repair run feature lets you execute only the failed tasks and their dependents, saving time and resources. Note that repair is supported only for jobs that orchestrate two or more tasks.
-
-To repair a job run:
-
-1. Navigate to **Job Runs** in the sidebar.
-2. Select the failed job from the list.
-3. Select **Repair run** to see all tasks that will be reexecuted.
-4. Optionally modify task parameters before repair.
-5. Select **Repair run** to start the recovery.
-
-For jobs that fail repeatedly, Databricks Assistant can help diagnose errors. Open the failed job and select **Diagnose Error** to receive suggestions for resolving the issue.
-
 After making changes—whether adjusting cluster configuration, fixing code, or resolving external dependencies—validate your fix by monitoring the next job run. Check that execution times return to expected levels and that no new errors appear in the Spark UI or job output.
@@ -25,6 +25,8 @@ spark.conf.get("spark.databricks.io.cache.enabled")
 
 When investigating caching problems, consider these scenarios:
 
+:::image type="content" source="../media/5-investigate-cache-issues.png" alt-text="Diagram showing how to investigate caching issues." border="false" lightbox="../media/5-investigate-cache-issues.png":::
+
 **Under-caching** means data is read repeatedly from remote storage when it could be served from cache. The Spark UI shows high **Input** values for stages that read the same data multiple times. Enable disk cache and use worker nodes with SSD storage for better performance.
 
 **Over-caching** consumes memory that Spark needs for processing. If you see memory pressure or out-of-memory errors, review whether cached data is actually being reused. Spark cache (using `.cache()` or `.persist()`) requires explicit management, unlike automatic disk caching.
@@ -39,6 +41,8 @@ df.unpersist()
 
 Data skew occurs when some partitions contain significantly more data than others. This imbalance causes a few tasks to run much longer than the rest, leaving most cluster resources idle while waiting for slow tasks to complete.
 
+:::image type="content" source="../media/5-investigate-data-skew.png" alt-text="Diagram showing how to investigate data skew." border="false" lightbox="../media/5-investigate-data-skew.png":::
+
 To identify skew in the Spark UI, navigate to the stage's page and scroll to **Summary Metrics**. Compare the **Max** duration to the **75th percentile**. If the Max is more than 50% higher than the 75th percentile, you likely have skew.
 
 Common causes of skew include:
@@ -63,6 +67,8 @@ spark.conf.get("spark.databricks.optimizer.adaptive.enabled")
 
 Spill happens when Spark runs out of memory during processing and writes intermediate data to disk. This disk I/O significantly slows down operations. Spill commonly occurs during shuffle operations, aggregations, or when partitions are too large.
 
+:::image type="content" source="../media/5-investigate-memory-spill.png" alt-text="Diagram explaining memory spill." border="false" lightbox="../media/5-investigate-memory-spill.png":::
+
 The Spark UI shows spill metrics at the top of each stage's page. Look for **Shuffle Spill (Memory)** and **Shuffle Spill (Disk)** values. Any non-zero spill indicates memory pressure.
 
 To reduce spill:
@@ -84,6 +90,8 @@ spark.conf.set("spark.sql.shuffle.partitions", "auto")
 
 Shuffle moves data between nodes during operations like joins, aggregations, and repartitioning. While sometimes necessary, excessive shuffle is expensive because it involves serializing data, writing to disk, transferring across the network, and deserializing.
 
+:::image type="content" source="../media/5-investigate-shuffle-issues.png" alt-text="Diagram explaining how to investigate shuffle issues." border="false" lightbox="../media/5-investigate-shuffle-issues.png":::
+
 In the Spark UI, check the **Shuffle Read** and **Shuffle Write** columns for each stage. Large shuffle values indicate significant data movement. The DAG shows where shuffle operations occur as exchange nodes.
 
 Reduce unnecessary shuffle with these approaches:
 
@@ -15,6 +15,8 @@ The data flow works as follows:
 3. Log Analytics ingests the events into **service-specific tables**.
 4. You query, visualize, and alert on this data using **Kusto Query Language (KQL)**.
 
+:::image type="content" source="../media/6-understand-log-streaming.png" alt-text="Diagram explaining log streaming architecture." border="false" lightbox="../media/6-understand-log-streaming.png":::
+
 Platform administrators typically configure the diagnostic settings through the Azure portal. As a data engineer, you focus on using the logs for monitoring and troubleshooting.
 
 > [!NOTE]