You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: learn-pr/advocates/improve-reliability-monitoring/5-tools.yml
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -27,4 +27,4 @@ quiz:
27
27
explanation: 'Correct, Azure Monitor emphasizes metrics, logs, and distributed traces as core observability data types.'
28
28
- content: 'Counters, gauges, and alerts'
29
29
isCorrect: false
30
-
explanation: 'Counters and gauges can produce metrics, and alerts act on data, but these are not the observability data types emphasized by Azure Monitor.'
30
+
explanation: "Counters and gauges can produce metrics, and alerts act on data, but these aren't the observability data types emphasized by Azure Monitor."
Copy file name to clipboardExpand all lines: learn-pr/advocates/improve-reliability-monitoring/includes/2-operational-awareness.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,7 +36,7 @@ One tool that can reduce that effort—plus give us information about the applic
36
36
37
37
Here's an example:
38
38
39
-
:::image type="content" source="../media/application-map.png" alt-text="Screenshot of the Application map panel in Azure portal displaying several components and the stats for traffic between them.":::
39
+
[](../media/application-map.png#lightbox)
40
40
41
41
In the preceding picture, you can see not only the components of the application, but also the communication between those components. If you zoom into one of the connections between components, you can see the number of calls made between components and the average latency for those calls. You can also see a representation of the number of successful and the number of failed calls. If you select any of these map elements, Application Insights lets you drill into the information to see detailed statistics on performance and success/failure metrics for those calls. This can be a great way to get a good sense of the larger picture of the application's components and how they function as a baseline. As a reminder, be sure to explore your application map and all that Application Insights can offer *before* you have an outage.
42
42
@@ -46,11 +46,11 @@ Application Insights is a great way to gain some operational awareness for an ap
46
46
47
47
Azure Resource Graph Explorer provides an interactive query environment right from the Azure portal for the data you need. It lets you run queries against near-current inventory data for the resources in your subscriptions. For example, if you want to see all of the VMs you're currently running, you could run the following query:
48
48
49
-
:::image type="content" source="../media/resource-graph-explorer.png" alt-text="Resource graph panel in Azure portal with the query of where type == microsoft.compute/virtualmachines":::
49
+
[](../media/resource-graph-explorer.png#lightbox)
50
50
51
51
and you'd get back a complete detailed list of the VMs being used in our subscription:
52
52
53
-
:::image type="content" source="../media/resource-graph-explorer-results.png" alt-text="Resource graph panel in the Azure portal with results of query showing table of results.":::
53
+
[](../media/resource-graph-explorer-results.png#lightbox)
54
54
55
55
The query language used in this environment is based on Kusto Query Language (KQL). Azure Resource Graph supports a useful subset of KQL rather than every KQL feature. We'll be discussing KQL in more depth later in this module when we talk about Azure Monitor Log Analytics.
56
56
@@ -72,7 +72,7 @@ Instead, let's look at a powerful idea: **dashboards as code**. Azure portal das
72
72
73
73
If you need to show a colleague the view you used during an outage, it's usually better to share a link to the dashboard or workbook with the relevant filters and time range than to treat exported JSON as a snapshot of the data.
74
74
75
-
:::image type="content" source="../media/dashboard.png" alt-text="Screenshot of an Azure portal dashboard showing export and sharing options.":::
75
+
[](../media/dashboard.png#lightbox)
Copy file name to clipboardExpand all lines: learn-pr/advocates/improve-reliability-monitoring/includes/3-expanding-understanding.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,55 +2,55 @@ For us to be able to effectively set up monitoring to improve our reliability, w
2
2
3
3
Let's look at some aspects of reliability now:
4
4
5
-
:::image type="content" source="../media/diagram-empty.png" alt-text="Diagram with the word reliability in a circle in the middle connected to empty circles at the end of each spoke.":::
5
+
[](../media/diagram-empty.png#lightbox)
6
6
7
7
## Availability
8
8
9
-
:::image type="content" source="../media/diagram-availability.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word availability added to one circle.":::
9
+
[](../media/diagram-availability.png#lightbox)
10
10
11
11
When people talk about reliability, they tend to start with availability. Is the system "up" or is it "down?" Can others reach your website or your service? Can they use the product when they expect to? It's important from the perspective of both external customers and internal users who depend on your service. Availability is probably the aspect of reliability with which you'll spend the most time working. It's a good starting point for discussing reliability, but it’s only one aspect.
12
12
13
13
## Latency
14
14
15
-
:::image type="content" source="../media/diagram-latency.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word latency added to previous diagram in a different circle.":::
15
+
[](../media/diagram-latency.png#lightbox)
16
16
17
17
Latency refers to the amount of delay between a request and a response. You might have heard the catchphrase "slow is the new down." People demand fast performance, and they lose patience with a site or service that leaves them waiting. We have good research that shows that if a website doesn't meet their expectations for response time, customers are likely to go to a competitor.
18
18
19
19
## Throughput
20
20
21
-
:::image type="content" source="../media/diagram-throughput.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word throughput added to previous diagram in a different circle.":::
21
+
[](../media/diagram-throughput.png#lightbox)
22
22
23
23
Throughput is a measure of the rate at which something is processed, or the number of transactions that a website, application, or service successfully handles over a specified period of time. This is particularly important when running pipelines or batch-processing systems. If a pipeline or a batch-processing system isn't processing things fast enough, that's not meeting our expectations and it isn't considered reliable.
24
24
25
25
## Coverage
26
26
27
-
:::image type="content" source="../media/diagram-coverage.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word coverage added to previous diagram in a different circle.":::
27
+
[](../media/diagram-coverage.png#lightbox)
28
28
29
29
Coverage refers to how much of the data that you expected to process was actually processed. Again, we come back to the idea of measuring how well we're meeting expectations as part of determining if something is reliable.
30
30
31
31
## Correctness
32
32
33
-
:::image type="content" source="../media/diagram-correctness.png" alt-text="Hub and spoke diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word correctness added to previous diagram in a different circle.":::
33
+
[](../media/diagram-correctness.png#lightbox)
34
34
35
35
Correctness is an aspect of reliability that's often overlooked. Did the process that you ran on the data yield the correct or expected result? This is an important factor to include in monitoring for reliability. No matter how fast or "always available" your service or site is, if it returns incorrect results, it’s not reliable in the eyes of your customers. Monitoring for correctness of results is an important part of monitoring for reliability.
36
36
37
37
## Fidelity
38
38
39
-
:::image type="content" source="../media/diagram-fidelity.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, the word fidelity added to previous diagram in a different circle.":::
39
+
[](../media/diagram-fidelity.png#lightbox)
40
40
41
41
Fidelity in this context is best understood through an example. Let's say you visit the home page of a video-streaming site. That page is made up of separate sections: new releases, personalized recommendations, top 10 movies watched, and so on. Each of those sections is likely generated by a separate back-end service. If one of those services goes down—for example, the personalization engine—visitors to the site don't get a "Sorry this site is down" message or a blank page. Instead, they see a home page with that section either removed or replaced with static content. In technical terms, we'd say they received a "degraded" experience instead of the complete intended page.
42
42
43
43
If we were to measure fidelity, we'd be measuring how often the user of a service received a "degraded" experience versus the full experience (complete fidelity). This measurement is useful for any fault-tolerant service that has the ability to continue running in a degraded mode when something goes wrong.
44
44
45
45
## Freshness
46
46
47
-
:::image type="content" source="../media/diagram-freshness.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, the word freshness added to previous diagram in a different circle.":::
47
+
[](../media/diagram-freshness.png#lightbox)
48
48
49
49
Freshness refers to how up to date the information is in situations where timeliness matters to the customer (for example, services that provide sports scores or election results). Those services are considered reliable if the data they provide is kept current.
50
50
51
51
## Durability
52
52
53
-
:::image type="content" source="../media/diagram-whole.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, the word durability added to previous diagram in a different circle filling in the entire diagram.":::
53
+
[](../media/diagram-whole.png#lightbox)
54
54
55
55
Durability is another slightly more niche aspect of reliability. If you're running a service that provides storage, you know just how important it is that data a customer writes to your service can be read later. This is a durability expectation.
0 commit comments