|
| 1 | +--- |
| 2 | +title: "Tutorial: Automate Incident Response in Azure SRE Agent" |
| 3 | +description: Connect Azure Monitor, create response plans, and let your agent investigate and resolve incidents autonomously from detection to fix. |
| 4 | +ms.topic: tutorial |
| 5 | +ms.date: 03/16/2026 |
| 6 | +author: craigshoemaker |
| 7 | +ms.author: cshoe |
| 8 | +ms.service: azure-sre-agent |
| 9 | +ms.ai-usage: ai-assisted |
| 10 | +#customer intent: As a site reliability engineer, I want to connect my incident platform and create response plans so that my agent automatically investigates and resolves incidents end-to-end. |
| 11 | +--- |
| 12 | + |
| 13 | +# Tutorial: Automate incident response in Azure SRE Agent |
| 14 | + |
| 15 | +**Estimated time**: 10 minutes |
| 16 | + |
| 17 | +Connect your incident platform and let your agent handle alerts automatically. The system handles alerts from detection to diagnosis to fix, all without you typing a single message. |
| 18 | + |
| 19 | +## What you accomplish |
| 20 | + |
| 21 | +By the end of this step, your agent: |
| 22 | + |
| 23 | +- Connects to Azure Monitor as your incident platform |
| 24 | +- Receives incidents filtered by severity through a response plan |
| 25 | +- Investigates matching alerts end-to-end, including code fixes and pull requests |
| 26 | + |
| 27 | +## Prerequisites |
| 28 | + |
| 29 | +| Requirement | Details | |
| 30 | +|---|---| |
| 31 | +| **Completed Steps 1–3** | [Create agent](create-agent.md), [Add knowledge](first-value.md), [Connect source code](connect-source-code.md). | |
| 32 | +| **Azure resources connected** | At least one Azure subscription with resources the agent can monitor. | |
| 33 | + |
| 34 | +## Connect Azure Monitor |
| 35 | + |
| 36 | +Link Azure Monitor as your incident platform so the agent automatically receives alerts. |
| 37 | + |
| 38 | +1. In the left sidebar, go to **Builder** > **Incident platform**. |
| 39 | +1. Select the **Incident platform** dropdown and choose **Azure Monitor**. |
| 40 | +1. The **Quickstart response plan** toggle is on by default. Turn it off as you create your own response plan in the next section. |
| 41 | +1. Select **Save**. |
| 42 | + |
| 43 | +Wait for the connection to complete. The status changes to **"Azure Monitor connected. Your next step is to set up incident response plans."** |
| 44 | + |
| 45 | +:::image type="content" source="media/automate-incidents/response-plan-saved.png" alt-text="Screenshot of Azure Monitor connected with a green checkmark status." lightbox="media/automate-incidents/response-plan-saved.png"::: |
| 46 | + |
| 47 | +**Checkpoint:** The incident platform page shows a green checkmark with **Azure Monitor connected**. |
| 48 | + |
| 49 | +> [!TIP] |
| 50 | +> You can also connect [PagerDuty](pagerduty-incidents.md) or [ServiceNow](servicenow-incidents.md) from the same dropdown. |
| 51 | +
|
| 52 | +## Create an incident response plan |
| 53 | + |
| 54 | +An incident response plan tells the agent which incidents to pick up and how much autonomy it has. The following steps are for Azure Monitor. PagerDuty and ServiceNow response plans use different filter fields based on their own incident metadata, such as priority, category, and assignment group. |
| 55 | + |
| 56 | +1. Go to **Builder** > **Incident response plans** in the left sidebar. |
| 57 | + |
| 58 | +1. Select **New incident response plan**. |
| 59 | + |
| 60 | +1. **Step 1: Set up incident filters:** |
| 61 | + |
| 62 | + - Enter a name, such as `all-incidents`. |
| 63 | + - Select severity levels. Choose **All severity** to catch everything during setup. |
| 64 | + - Optionally, add a title filter to narrow scope. |
| 65 | + |
| 66 | +1. Select **Next**. |
| 67 | + |
| 68 | + :::image type="content" source="media/automate-incidents/response-plan-step-1.png" alt-text="Screenshot of the response plan creation form with name and severity fields." lightbox="media/automate-incidents/response-plan-step-1.png"::: |
| 69 | + |
| 70 | +1. **Step 2: Preview filter results:** Review matching past incidents from your incident platform (empty if no incidents exist yet). Select **Next**. |
| 71 | + |
| 72 | +1. **Step 3: Save response plan:** |
| 73 | + - Choose how much control the agent has: |
| 74 | + - **Autonomous (Default)**: The agent investigates and acts independently, including code fixes and container restarts. |
| 75 | + - **Review**: The agent diagnoses but waits for your approval before acting. |
| 76 | + - Select **Save**. |
| 77 | + |
| 78 | +:::image type="content" source="media/automate-incidents/response-plan-step-3-save.png" alt-text="Screenshot of the response plan autonomy options showing Review and Autonomous modes." lightbox="media/automate-incidents/response-plan-step-3-save.png"::: |
| 79 | + |
| 80 | +**Checkpoint:** Your response plan appears in the list with status **On** and the autonomy level you selected. |
| 81 | + |
| 82 | +## What happens when an alert fires |
| 83 | + |
| 84 | +When Azure Monitor fires an alert that matches your response plan, the agent investigates automatically. What the agent does depends on the context you gave it. Runbooks, code repositories, Azure resources, and prior investigations all shape the depth and actions of the investigation. |
| 85 | + |
| 86 | +### Example: HTTP 500 errors on a container app |
| 87 | + |
| 88 | +In this example, the agent has a runbook for handling HTTP 500 errors, a connected code repository, and Azure resource access. |
| 89 | + |
| 90 | +:::image type="content" source="media/automate-incidents/incident-completed.png" alt-text="Screenshot of the incidents page showing one completed Sev3 alert with green Completed status." lightbox="media/automate-incidents/incident-completed.png"::: |
| 91 | + |
| 92 | +**The agent builds a plan from your runbook.** Rather than following a generic troubleshooting sequence, the agent reads the HTTP 500 runbook you upload during onboarding and follows your team's procedures. The agent checks for upstream dependencies first, then connection pool, then recent deployments. |
| 93 | + |
| 94 | +:::image type="content" source="media/automate-incidents/incident-full-page-top.png" alt-text="Screenshot of the agent showing investigation plan for HTTP 5xx alert with six numbered steps." lightbox="media/automate-incidents/incident-full-page-top.png"::: |
| 95 | + |
| 96 | +**The agent recalls prior knowledge.** If the agent investigated a similar issue before, it recognizes the pattern and skips discovery. It does this operation to combine your runbook procedures with what it learned from previous investigations. |
| 97 | + |
| 98 | +**The agent takes action.** In **Review** mode, the agent asks for your approval before each action. In **Autonomous** mode, it acts independently. In this example, the agent: |
| 99 | + |
| 100 | +- Reads the source code and identifies the root cause |
| 101 | +- Edits the code to fix the bug |
| 102 | +- Restarts the container to mitigate the alert |
| 103 | +- Commits the fix and pushes it to a new branch |
| 104 | +- Creates a GitHub issue for tracking |
| 105 | +- Verifies the service is healthy after the fix |
| 106 | + |
| 107 | +**The agent delivers a remediation summary.** The agent produces a structured report with everything the team needs to follow up: |
| 108 | + |
| 109 | +:::image type="content" source="media/automate-incidents/incident-full-page-code-fix.png" alt-text="Screenshot of the remediation summary table showing alert, mitigation, permanent fix, root cause, status, and tracking." lightbox="media/automate-incidents/incident-full-page-code-fix.png"::: |
| 110 | + |
| 111 | +| Item | What the agent reports | |
| 112 | +|---|---| |
| 113 | +| **Alert** | Which alert fired, severity, affected resource | |
| 114 | +| **Immediate mitigation** | What was done to restore service right now | |
| 115 | +| **Permanent fix** | Code changes made and branch pushed | |
| 116 | +| **Root cause** | Specific code bug or configuration issue with file references | |
| 117 | +| **Status** | Current health of the affected resource | |
| 118 | +| **Tracking** | GitHub issue number | |
| 119 | +| **Next steps** | Merge pull request and redeploy | |
| 120 | + |
| 121 | +> [!NOTE] |
| 122 | +> Your results vary based on the context your agent has. An agent with more runbooks, connected repositories, and prior investigations produces deeper, more targeted responses. |
| 123 | +
|
| 124 | +## Next step |
| 125 | + |
| 126 | +> [!div class="nextstepaction"] |
| 127 | +> [Step 5: Automate actions](automate-actions.md) |
| 128 | +
|
| 129 | +## Related content |
| 130 | + |
| 131 | +- [Incident response plans](incident-response-plans.md) |
| 132 | +- [PagerDuty incidents](pagerduty-incidents.md) |
| 133 | +- [ServiceNow incidents](servicenow-incidents.md) |
| 134 | +- [Memory and knowledge](memory.md) |
| 135 | +- [Monitor agent usage](monitor-agent-usage.md) |
0 commit comments