MicrosoftDocs
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/4-troubleshoot-repair-spark-jobs-notebooks.yml‎
Lines changed: 1 addition & 1 deletion b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/4-troubleshoot-repair-spark-jobs-notebooks.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/3-troubleshoot-repair-lakeflow-jobs.md‎
Lines changed: 6 additions & 0 deletions b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/3-troubleshoot-repair-lakeflow-jobs.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/4-troubleshoot-repair-spark-jobs-notebooks.md‎
Lines changed: 13 additions & 15 deletions b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/4-troubleshoot-repair-spark-jobs-notebooks.md‎
Lines changed: 13 additions & 15 deletions
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/5-resolve-cache-skew-spill-shuffle.md‎
Lines changed: 8 additions & 0 deletions b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/5-resolve-cache-skew-spill-shuffle.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/6-implement-log-streaming-azure-analytics.md‎
Lines changed: 2 additions & 0 deletions b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/6-implement-log-streaming-azure-analytics.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/3-identify-cause-failure.png‎
612 KB b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/3-identify-cause-failure.png‎
612 KB
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/3-repair-failed-task.png‎
1.57 MB b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/3-repair-failed-task.png‎
1.57 MB
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/3-stop-run.png‎
584 KB b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/3-stop-run.png‎
584 KB
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/4-identify-resolve-resource-bottlenecks.png‎
426 KB b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/4-identify-resolve-resource-bottlenecks.png‎
426 KB
diff --git a/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/4-restart-cluster.png‎
763 KB b/‎learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/media/4-restart-cluster.png‎
763 KB
@@ -9,6 +9,6 @@ metadata:
   ms.author: wedebols
   ms.topic: unit
   ai-usage: ai-generated
-durationInMinutes: 7
+durationInMinutes: 6
 content: |
   [!include[](includes/4-troubleshoot-repair-spark-jobs-notebooks.md)]
@@ -25,6 +25,8 @@ To investigate a failed job:
 3. In the **Runs** tab, hover over a failed task (shown in red) to see metadata including start time, end time, status, duration, and error messages.
 4. Select the failed task to open the **Task run details** page with complete output and logs.
 
+:::image type="content" source="../media/3-identify-cause-failure.png" alt-text="Screenshot showing a failed task." lightbox="../media/3-identify-cause-failure.png":::
+
 The matrix view helps you identify patterns. If the same task fails repeatedly, the issue likely relates to that task's code or configuration. If failures appear random across different tasks, you might have a cluster or resource problem.
 
 > [!TIP]
@@ -54,6 +56,8 @@ To repair a failed run:
 4. Optionally, modify task parameters in the dialog. These values override the original settings for this repair run only.
 5. Select **Repair run** to start.
 
+:::image type="content" source="../media/3-repair-failed-task.png" alt-text="Screenshot of the failed task." lightbox="../media/3-repair-failed-task.png":::
+
 After the repair completes, the matrix view adds a new column showing the repaired run results. Tasks that were red (failed) should now appear green (successful).
 
 > [!NOTE]
@@ -67,6 +71,8 @@ Sometimes you need to halt a running job or restart one that's stuck. The Jobs U
 
 **To restart continuous jobs**: Continuous jobs that fail repeatedly enter an exponential backoff state, where Azure Databricks waits progressively longer between retry attempts. The **Job details** panel shows the number of consecutive failures and the time until the next retry. Select **Restart run** to cancel the active run, reset the retry period, and immediately start a new run.
 
+:::image type="content" source="../media/3-stop-run.png" alt-text="Screenshot showing how to stop an active run." lightbox="../media/3-stop-run.png":::
+
 Use the stop function when a job consumes excessive resources, processes incorrect data, or needs immediate intervention. Use restart when you've fixed an underlying issue and want to bypass the exponential backoff waiting period.
 
 ## Monitor job health proactively
 
@@ -4,6 +4,8 @@ When a Spark job fails or runs slower than expected, your ability to quickly dia
 
 Spark jobs can fail for various reasons, and understanding these patterns helps you focus your investigation. The most frequent causes fall into three categories: code errors, resource constraints, and environmental issues.
 
+:::image type="content" source="../media/4-understand-common-causes-failures.png" alt-text="Diagram explaining the common causes of Spark job failures." border="false" lightbox="../media/4-understand-common-causes-failures.png":::
+
 **Code-related failures** include syntax errors in notebooks, incorrect transformations, or data quality issues like schema mismatches. These failures typically produce error messages that point directly to the problematic code.
 
 **Resource bottlenecks** occur when jobs consume more CPU, memory, or disk than available. You might see out-of-memory (OOM) errors, slow shuffle operations, or tasks that fail repeatedly. These issues often require adjusting cluster configuration or optimizing your code.
@@ -14,6 +16,8 @@ Spark jobs can fail for various reasons, and understanding these patterns helps
 
 The Spark UI provides detailed visibility into job execution and is your primary diagnostic tool. To access it, navigate to your cluster's page and select the **Spark UI** tab.
 
+:::image type="content" source="../media/4-use-spark-user-interface.png" alt-text="Screenshot of the Spark user interface." lightbox="../media/4-use-spark-user-interface.png":::
+
 Start your investigation with the **Jobs Timeline**, which shows the execution sequence of all Spark jobs. Look for three key patterns:
 
 **Failing jobs** appear with a red status indicator. Select any failed job to view the failed stage and specific failure reason. The error description often contains links to more detailed information about task-level failures.
@@ -31,6 +35,8 @@ After identifying a problematic job, drill into its longest stage to examine tas
 
 Resource bottlenecks manifest differently depending on which resource is constrained. The compute metrics interface helps you identify these patterns by showing CPU, memory, and network utilization across nodes.
 
+:::image type="content" source="../media/4-identify-resolve-resource-bottlenecks.png" alt-text="Diagram exlaining how to identify and resolve resource bottlenecks." border="false" lightbox="../media/4-identify-resolve-resource-bottlenecks.png":::
+
 **Memory pressure** appears as high memory utilization across workers or the driver. In the Spark UI, look for spill indicators showing data being written to disk because memory is insufficient. You can address memory issues by increasing worker instance sizes, reducing partition counts, or optimizing transformations to minimize data held in memory.
 
 **CPU constraints** show as high CPU utilization with long task execution times despite adequate I/O throughput. Consider enabling Photon acceleration for compatible workloads or scaling out with additional worker nodes.
@@ -39,13 +45,19 @@ Resource bottlenecks manifest differently depending on which resource is constra
 
 To access compute metrics, select your cluster from the **Compute** page and select the **Metrics** tab. The **Server load distribution** visualization uses color coding—red indicates heavily loaded nodes, while blue shows idle resources. If the driver node appears red while workers are blue, the driver is overloaded and may need a larger instance type.
 
+:::image type="content" source="../media/4-server-metrics.png" alt-text="Screenshot of the compute metrics tab." lightbox="../media/4-server-metrics.png":::
+
 ## Restart clusters to resolve environmental issues
 
 Sometimes a cluster encounters problems that require a restart to resolve. Resource exhaustion, malfunctioning executors, or stale container images can all necessitate a fresh cluster start.
 
 Before restarting, determine whether a restart is appropriate. Check the **Event log** tab on the cluster details page for lifecycle events that might explain the problem. Look for messages about instance acquisition failures, spot instance reclamation, or executor terminations.
 
-To restart a cluster using the UI, select your cluster from the **Compute** page and select **Restart**. You can also restart programmatically using the Databricks CLI:
+To restart a cluster using the UI, select your cluster from the **Compute** page and select **Restart**. 
+
+:::image type="content" source="../media/4-restart-cluster.png" alt-text="Screenshot showing how to restart a cluster." lightbox="../media/4-restart-cluster.png":::
+
+You can also restart programmatically using the Databricks CLI:
 
 ```bash
 databricks clusters restart CLUSTER_ID
@@ -56,18 +68,4 @@ Replace `CLUSTER_ID` with your cluster's identifier, which you can find on the c
 > [!IMPORTANT]
 > Restarting a cluster terminates any running jobs and resets the Spark UI history. Save any diagnostic information you need before restarting. For long-running clusters processing streaming data, consider scheduling regular restarts during maintenance windows to ensure the cluster runs on current images.
 
-## Repair failed job runs
-
-When a job with multiple tasks fails, you don't need to rerun the entire job. The repair run feature lets you execute only the failed tasks and their dependents, saving time and resources. Note that repair is supported only for jobs that orchestrate two or more tasks.
-
-To repair a job run:
-
-1. Navigate to **Job Runs** in the sidebar.
-2. Select the failed job from the list.
-3. Select **Repair run** to see all tasks that will be reexecuted.
-4. Optionally modify task parameters before repair.
-5. Select **Repair run** to start the recovery.
-
-For jobs that fail repeatedly, Databricks Assistant can help diagnose errors. Open the failed job and select **Diagnose Error** to receive suggestions for resolving the issue.
-
 After making changes—whether adjusting cluster configuration, fixing code, or resolving external dependencies—validate your fix by monitoring the next job run. Check that execution times return to expected levels and that no new errors appear in the Spark UI or job output.
@@ -25,6 +25,8 @@ spark.conf.get("spark.databricks.io.cache.enabled")
 
 When investigating caching problems, consider these scenarios:
 
+:::image type="content" source="../media/5-investigate-cache-issues.png" alt-text="Diagram showing how to investigate caching issues." border="false" lightbox="../media/5-investigate-cache-issues.png":::
+
 **Under-caching** means data is read repeatedly from remote storage when it could be served from cache. The Spark UI shows high **Input** values for stages that read the same data multiple times. Enable disk cache and use worker nodes with SSD storage for better performance.
 
 **Over-caching** consumes memory that Spark needs for processing. If you see memory pressure or out-of-memory errors, review whether cached data is actually being reused. Spark cache (using `.cache()` or `.persist()`) requires explicit management, unlike automatic disk caching.
@@ -39,6 +41,8 @@ df.unpersist()
 
 Data skew occurs when some partitions contain significantly more data than others. This imbalance causes a few tasks to run much longer than the rest, leaving most cluster resources idle while waiting for slow tasks to complete.
 
+:::image type="content" source="../media/5-investigate-data-skew.png" alt-text="Diagram showing how to investigate data skew." border="false" lightbox="../media/5-investigate-data-skew.png":::
+
 To identify skew in the Spark UI, navigate to the stage's page and scroll to **Summary Metrics**. Compare the **Max** duration to the **75th percentile**. If the Max is more than 50% higher than the 75th percentile, you likely have skew.
 
 Common causes of skew include:
@@ -63,6 +67,8 @@ spark.conf.get("spark.databricks.optimizer.adaptive.enabled")
 
 Spill happens when Spark runs out of memory during processing and writes intermediate data to disk. This disk I/O significantly slows down operations. Spill commonly occurs during shuffle operations, aggregations, or when partitions are too large.
 
+:::image type="content" source="../media/5-investigate-memory-spill.png" alt-text="Diagram explaining memory spill." border="false" lightbox="../media/5-investigate-memory-spill.png":::
+
 The Spark UI shows spill metrics at the top of each stage's page. Look for **Shuffle Spill (Memory)** and **Shuffle Spill (Disk)** values. Any non-zero spill indicates memory pressure.
 
 To reduce spill:
@@ -84,6 +90,8 @@ spark.conf.set("spark.sql.shuffle.partitions", "auto")
 
 Shuffle moves data between nodes during operations like joins, aggregations, and repartitioning. While sometimes necessary, excessive shuffle is expensive because it involves serializing data, writing to disk, transferring across the network, and deserializing.
 
+:::image type="content" source="../media/5-investigate-shuffle-issues.png" alt-text="Diagram explaining how to investigate shuffle issues" border="false" lightbox="../media/5-investigate-shuffle-issues.png":::
+
 In the Spark UI, check the **Shuffle Read** and **Shuffle Write** columns for each stage. Large shuffle values indicate significant data movement. The DAG shows where shuffle operations occur as exchange nodes.
 
 Reduce unnecessary shuffle with these approaches:
 
@@ -15,6 +15,8 @@ The data flow works as follows:
 3. Log Analytics ingests the events into **service-specific tables**.
 4. You query, visualize, and alert on this data using **Kusto Query Language (KQL)**.
 
+:::image type="content" source="../media/6-understand-log-streaming.png" alt-text="Diagram explaining log streaming architecture." border="false" lightbox="../media/6-understand-log-streaming.png":::
+
 Platform administrators typically configure the diagnostic settings through the Azure portal. As a data engineer, you focus on using the logs for monitoring and troubleshooting.
 
 > [!NOTE]