Skip to content

Commit 0233c45

Browse files
committed
added images
1 parent e01c114 commit 0233c45

18 files changed

Lines changed: 30 additions & 16 deletions

learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/4-troubleshoot-repair-spark-jobs-notebooks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,6 @@ metadata:
99
ms.author: wedebols
1010
ms.topic: unit
1111
ai-usage: ai-generated
12-
durationInMinutes: 7
12+
durationInMinutes: 6
1313
content: |
1414
[!include[](includes/4-troubleshoot-repair-spark-jobs-notebooks.md)]

learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/3-troubleshoot-repair-lakeflow-jobs.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ To investigate a failed job:
2525
3. In the **Runs** tab, hover over a failed task (shown in red) to see metadata including start time, end time, status, duration, and error messages.
2626
4. Select the failed task to open the **Task run details** page with complete output and logs.
2727

28+
:::image type="content" source="../media/3-identify-cause-failure.png" alt-text="Screenshot showing a failed task." lightbox="../media/3-identify-cause-failure.png":::
29+
2830
The matrix view helps you identify patterns. If the same task fails repeatedly, the issue likely relates to that task's code or configuration. If failures appear random across different tasks, you might have a cluster or resource problem.
2931

3032
> [!TIP]
@@ -54,6 +56,8 @@ To repair a failed run:
5456
4. Optionally, modify task parameters in the dialog. These values override the original settings for this repair run only.
5557
5. Select **Repair run** to start.
5658

59+
:::image type="content" source="../media/3-repair-failed-task.png" alt-text="Screenshot of the failed task." lightbox="../media/3-repair-failed-task.png":::
60+
5761
After the repair completes, the matrix view adds a new column showing the repaired run results. Tasks that were red (failed) should now appear green (successful).
5862

5963
> [!NOTE]
@@ -67,6 +71,8 @@ Sometimes you need to halt a running job or restart one that's stuck. The Jobs U
6771

6872
**To restart continuous jobs**: Continuous jobs that fail repeatedly enter an exponential backoff state, where Azure Databricks waits progressively longer between retry attempts. The **Job details** panel shows the number of consecutive failures and the time until the next retry. Select **Restart run** to cancel the active run, reset the retry period, and immediately start a new run.
6973

74+
:::image type="content" source="../media/3-stop-run.png" alt-text="Screenshot showing how to stop an active run." lightbox="../media/3-stop-run.png":::
75+
7076
Use the stop function when a job consumes excessive resources, processes incorrect data, or needs immediate intervention. Use restart when you've fixed an underlying issue and want to bypass the exponential backoff waiting period.
7177

7278
## Monitor job health proactively

learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/4-troubleshoot-repair-spark-jobs-notebooks.md

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ When a Spark job fails or runs slower than expected, your ability to quickly dia
44

55
Spark jobs can fail for various reasons, and understanding these patterns helps you focus your investigation. The most frequent causes fall into three categories: code errors, resource constraints, and environmental issues.
66

7+
:::image type="content" source="../media/4-understand-common-causes-failures.png" alt-text="Diagram explaining the common causes of Spark job failures." border="false" lightbox="../media/4-understand-common-causes-failures.png":::
8+
79
**Code-related failures** include syntax errors in notebooks, incorrect transformations, or data quality issues like schema mismatches. These failures typically produce error messages that point directly to the problematic code.
810

911
**Resource bottlenecks** occur when jobs consume more CPU, memory, or disk than available. You might see out-of-memory (OOM) errors, slow shuffle operations, or tasks that fail repeatedly. These issues often require adjusting cluster configuration or optimizing your code.
@@ -14,6 +16,8 @@ Spark jobs can fail for various reasons, and understanding these patterns helps
1416

1517
The Spark UI provides detailed visibility into job execution and is your primary diagnostic tool. To access it, navigate to your cluster's page and select the **Spark UI** tab.
1618

19+
:::image type="content" source="../media/4-use-spark-user-interface.png" alt-text="Screenshot of the Spark user interface." lightbox="../media/4-use-spark-user-interface.png":::
20+
1721
Start your investigation with the **Jobs Timeline**, which shows the execution sequence of all Spark jobs. Look for three key patterns:
1822

1923
**Failing jobs** appear with a red status indicator. Select any failed job to view the failed stage and specific failure reason. The error description often contains links to more detailed information about task-level failures.
@@ -31,6 +35,8 @@ After identifying a problematic job, drill into its longest stage to examine tas
3135

3236
Resource bottlenecks manifest differently depending on which resource is constrained. The compute metrics interface helps you identify these patterns by showing CPU, memory, and network utilization across nodes.
3337

38+
:::image type="content" source="../media/4-identify-resolve-resource-bottlenecks.png" alt-text="Diagram exlaining how to identify and resolve resource bottlenecks." border="false" lightbox="../media/4-identify-resolve-resource-bottlenecks.png":::
39+
3440
**Memory pressure** appears as high memory utilization across workers or the driver. In the Spark UI, look for spill indicators showing data being written to disk because memory is insufficient. You can address memory issues by increasing worker instance sizes, reducing partition counts, or optimizing transformations to minimize data held in memory.
3541

3642
**CPU constraints** show as high CPU utilization with long task execution times despite adequate I/O throughput. Consider enabling Photon acceleration for compatible workloads or scaling out with additional worker nodes.
@@ -39,13 +45,19 @@ Resource bottlenecks manifest differently depending on which resource is constra
3945

4046
To access compute metrics, select your cluster from the **Compute** page and select the **Metrics** tab. The **Server load distribution** visualization uses color coding—red indicates heavily loaded nodes, while blue shows idle resources. If the driver node appears red while workers are blue, the driver is overloaded and may need a larger instance type.
4147

48+
:::image type="content" source="../media/4-server-metrics.png" alt-text="Screenshot of the compute metrics tab." lightbox="../media/4-server-metrics.png":::
49+
4250
## Restart clusters to resolve environmental issues
4351

4452
Sometimes a cluster encounters problems that require a restart to resolve. Resource exhaustion, malfunctioning executors, or stale container images can all necessitate a fresh cluster start.
4553

4654
Before restarting, determine whether a restart is appropriate. Check the **Event log** tab on the cluster details page for lifecycle events that might explain the problem. Look for messages about instance acquisition failures, spot instance reclamation, or executor terminations.
4755

48-
To restart a cluster using the UI, select your cluster from the **Compute** page and select **Restart**. You can also restart programmatically using the Databricks CLI:
56+
To restart a cluster using the UI, select your cluster from the **Compute** page and select **Restart**.
57+
58+
:::image type="content" source="../media/4-restart-cluster.png" alt-text="Screenshot showing how to restart a cluster." lightbox="../media/4-restart-cluster.png":::
59+
60+
You can also restart programmatically using the Databricks CLI:
4961

5062
```bash
5163
databricks clusters restart CLUSTER_ID
@@ -56,18 +68,4 @@ Replace `CLUSTER_ID` with your cluster's identifier, which you can find on the c
5668
> [!IMPORTANT]
5769
> Restarting a cluster terminates any running jobs and resets the Spark UI history. Save any diagnostic information you need before restarting. For long-running clusters processing streaming data, consider scheduling regular restarts during maintenance windows to ensure the cluster runs on current images.
5870
59-
## Repair failed job runs
60-
61-
When a job with multiple tasks fails, you don't need to rerun the entire job. The repair run feature lets you execute only the failed tasks and their dependents, saving time and resources. Note that repair is supported only for jobs that orchestrate two or more tasks.
62-
63-
To repair a job run:
64-
65-
1. Navigate to **Job Runs** in the sidebar.
66-
2. Select the failed job from the list.
67-
3. Select **Repair run** to see all tasks that will be reexecuted.
68-
4. Optionally modify task parameters before repair.
69-
5. Select **Repair run** to start the recovery.
70-
71-
For jobs that fail repeatedly, Databricks Assistant can help diagnose errors. Open the failed job and select **Diagnose Error** to receive suggestions for resolving the issue.
72-
7371
After making changes—whether adjusting cluster configuration, fixing code, or resolving external dependencies—validate your fix by monitoring the next job run. Check that execution times return to expected levels and that no new errors appear in the Spark UI or job output.

learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/5-resolve-cache-skew-spill-shuffle.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ spark.conf.get("spark.databricks.io.cache.enabled")
2525

2626
When investigating caching problems, consider these scenarios:
2727

28+
:::image type="content" source="../media/5-investigate-cache-issues.png" alt-text="Diagram showing how to investigate caching issues." border="false" lightbox="../media/5-investigate-cache-issues.png":::
29+
2830
**Under-caching** means data is read repeatedly from remote storage when it could be served from cache. The Spark UI shows high **Input** values for stages that read the same data multiple times. Enable disk cache and use worker nodes with SSD storage for better performance.
2931

3032
**Over-caching** consumes memory that Spark needs for processing. If you see memory pressure or out-of-memory errors, review whether cached data is actually being reused. Spark cache (using `.cache()` or `.persist()`) requires explicit management, unlike automatic disk caching.
@@ -39,6 +41,8 @@ df.unpersist()
3941

4042
Data skew occurs when some partitions contain significantly more data than others. This imbalance causes a few tasks to run much longer than the rest, leaving most cluster resources idle while waiting for slow tasks to complete.
4143

44+
:::image type="content" source="../media/5-investigate-data-skew.png" alt-text="Diagram showing how to investigate data skew." border="false" lightbox="../media/5-investigate-data-skew.png":::
45+
4246
To identify skew in the Spark UI, navigate to the stage's page and scroll to **Summary Metrics**. Compare the **Max** duration to the **75th percentile**. If the Max is more than 50% higher than the 75th percentile, you likely have skew.
4347

4448
Common causes of skew include:
@@ -63,6 +67,8 @@ spark.conf.get("spark.databricks.optimizer.adaptive.enabled")
6367

6468
Spill happens when Spark runs out of memory during processing and writes intermediate data to disk. This disk I/O significantly slows down operations. Spill commonly occurs during shuffle operations, aggregations, or when partitions are too large.
6569

70+
:::image type="content" source="../media/5-investigate-memory-spill.png" alt-text="Diagram explaining memory spill." border="false" lightbox="../media/5-investigate-memory-spill.png":::
71+
6672
The Spark UI shows spill metrics at the top of each stage's page. Look for **Shuffle Spill (Memory)** and **Shuffle Spill (Disk)** values. Any non-zero spill indicates memory pressure.
6773

6874
To reduce spill:
@@ -84,6 +90,8 @@ spark.conf.set("spark.sql.shuffle.partitions", "auto")
8490

8591
Shuffle moves data between nodes during operations like joins, aggregations, and repartitioning. While sometimes necessary, excessive shuffle is expensive because it involves serializing data, writing to disk, transferring across the network, and deserializing.
8692

93+
:::image type="content" source="../media/5-investigate-shuffle-issues.png" alt-text="Diagram explaining how to investigate shuffle issues" border="false" lightbox="../media/5-investigate-shuffle-issues.png":::
94+
8795
In the Spark UI, check the **Shuffle Read** and **Shuffle Write** columns for each stage. Large shuffle values indicate significant data movement. The DAG shows where shuffle operations occur as exchange nodes.
8896

8997
Reduce unnecessary shuffle with these approaches:

learn-pr/wwl-databricks/monitor-troubleshoot-optimize-workloads-azure-databricks/includes/6-implement-log-streaming-azure-analytics.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ The data flow works as follows:
1515
3. Log Analytics ingests the events into **service-specific tables**.
1616
4. You query, visualize, and alert on this data using **Kusto Query Language (KQL)**.
1717

18+
:::image type="content" source="../media/6-understand-log-streaming.png" alt-text="Diagram explaining log streaming architecture." border="false" lightbox="../media/6-understand-log-streaming.png":::
19+
1820
Platform administrators typically configure the diagnostic settings through the Azure portal. As a data engineer, you focus on using the logs for monitoring and troubleshooting.
1921

2022
> [!NOTE]
612 KB
Loading
1.57 MB
Loading
584 KB
Loading
Loading
763 KB
Loading

0 commit comments

Comments
 (0)