Skip to content

Commit 18676a1

Browse files
committed
Minor updates
1 parent aeaa63c commit 18676a1

11 files changed

Lines changed: 43 additions & 72 deletions

File tree

learn-pr/advocates/improve-reliability-monitoring/2-operational-awareness.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,17 +19,17 @@ quiz:
1919

2020
- content: 'What is operational awareness?'
2121
choices:
22-
- content: 'An understanding of the systems we have in production and how they are functioning'
22+
- content: "An understanding of the systems we have in production and how they're functioning"
2323
isCorrect: true
2424
explanation: "Correct, we can't begin to work on reliability if we don\\'t know what is actually running in production."
25-
- content: 'a comprehensive accounting for how much we are paying for our infrastructure'
25+
- content: "A comprehensive accounting for how much we're paying for our infrastructure"
2626
isCorrect: false
2727
explanation: 'Cost management is definitely important, but first we need to know what is running in production.'
2828
- content: 'The process of training your staff to be able to operate your systems'
2929
isCorrect: false
30-
explanation: 'Your staff definitely needs to understand your systems, but awareness is not just about the training process.'
30+
explanation: "Your staff definitely needs to understand your systems, but awareness isn't just about the training process."
3131

32-
- content: 'When we are gaining operational awareness for an application, which of these questions do we need to ask?'
32+
- content: "When we're gaining operational awareness for an application, which of these questions do we need to ask?"
3333
choices:
3434
- content: 'What are the component parts?'
3535
isCorrect: false
@@ -60,7 +60,7 @@ quiz:
6060
explanation: 'Dashboards can help display information that assists with gaining operational awareness.'
6161
- content: 'Azure Cosmos DB'
6262
isCorrect: true
63-
explanation: 'Correct, Cosmos DB is an excellent place to store data, but it is not directly useful for operational awareness.'
63+
explanation: "Correct, Cosmos DB is an excellent place to store data, but it isn't directly useful for operational awareness."
6464
- content: 'Azure Resource Graph Explorer'
6565
isCorrect: false
6666
explanation: 'Resource Graph Explorer provides a query environment that can show us exactly what resources we have in play.'

learn-pr/advocates/improve-reliability-monitoring/5-tools.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,4 @@ quiz:
2727
explanation: 'Correct, Azure Monitor emphasizes metrics, logs, and distributed traces as core observability data types.'
2828
- content: 'Counters, gauges, and alerts'
2929
isCorrect: false
30-
explanation: 'Counters and gauges can produce metrics, and alerts act on data, but these are not the observability data types emphasized by Azure Monitor.'
30+
explanation: "Counters and gauges can produce metrics, and alerts act on data, but these aren't the observability data types emphasized by Azure Monitor."

learn-pr/advocates/improve-reliability-monitoring/7-sli-slo.yml

Lines changed: 0 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -13,33 +13,3 @@ metadata:
1313
durationInMinutes: 9
1414
content: |
1515
[!include[](includes/7-sli-slo.md)]
16-
quiz:
17-
title: Check your knowledge
18-
questions:
19-
20-
- content: 'What does SLO stand for?'
21-
choices:
22-
- content: 'Service Level Observation'
23-
isCorrect: false
24-
explanation: 'SLO stands for "Service Level Objective".'
25-
- content: 'System Level Objective'
26-
isCorrect: false
27-
explanation: 'SLO stands for "Service Level Objective".'
28-
- content: 'Service Level Objective'
29-
isCorrect: true
30-
explanation: 'Correct, SLO stands for "Service Level Objective".'
31-
32-
- content: 'Which of these could be a workable SLO for a web service?'
33-
choices:
34-
- content: '90% of HTTP requests succeeded in the last 30-day window.'
35-
isCorrect: false
36-
explanation: 'This SLO is missing where the data will be measured.'
37-
- content: '90% of HTTP requests as reported by the load balancer succeeded.'
38-
isCorrect: false
39-
explanation: 'This SLO is missing the time horizon for the objective.'
40-
- content: 'The webserver functioned well in the last 30-day window as reported by the load balancer'
41-
isCorrect: false
42-
explanation: 'This SLO is missing the actual service level indicator for the objective.'
43-
- content: '90% of HTTP requests as reported by the load balancer succeeded in the last 30-day window.'
44-
isCorrect: true
45-
explanation: 'Correct, this is a workable SLO with all of the necessary parts.'

learn-pr/advocates/improve-reliability-monitoring/includes/2-operational-awareness.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ One tool that can reduce that effort—plus give us information about the applic
3636

3737
Here's an example:
3838

39-
:::image type="content" source="../media/application-map.png" alt-text="Screenshot of the Application map panel in Azure portal displaying several components and the stats for traffic between them.":::
39+
[![Screenshot of the Application map panel in Azure portal displaying several components and the stats for traffic between them.](../media/application-map.png)](../media/application-map.png#lightbox)
4040

4141
In the preceding picture, you can see not only the components of the application, but also the communication between those components. If you zoom into one of the connections between components, you can see the number of calls made between components and the average latency for those calls. You can also see a representation of the number of successful and the number of failed calls. If you select any of these map elements, Application Insights lets you drill into the information to see detailed statistics on performance and success/failure metrics for those calls. This can be a great way to get a good sense of the larger picture of the application's components and how they function as a baseline. As a reminder, be sure to explore your application map and all that Application Insights can offer *before* you have an outage.
4242

@@ -46,11 +46,11 @@ Application Insights is a great way to gain some operational awareness for an ap
4646

4747
Azure Resource Graph Explorer provides an interactive query environment right from the Azure portal for the data you need. It lets you run queries against near-current inventory data for the resources in your subscriptions. For example, if you want to see all of the VMs you're currently running, you could run the following query:
4848

49-
:::image type="content" source="../media/resource-graph-explorer.png" alt-text="Resource graph panel in Azure portal with the query of where type == microsoft.compute/virtualmachines":::
49+
[![Resource graph panel in Azure portal with the query of where type == microsoft.compute/virtualmachines.](../media/resource-graph-explorer.png)](../media/resource-graph-explorer.png#lightbox)
5050

5151
and you'd get back a complete detailed list of the VMs being used in our subscription:
5252

53-
:::image type="content" source="../media/resource-graph-explorer-results.png" alt-text="Resource graph panel in the Azure portal with results of query showing table of results.":::
53+
[![Resource graph panel in the Azure portal with results of query showing table of results.](../media/resource-graph-explorer-results.png)](../media/resource-graph-explorer-results.png#lightbox)
5454

5555
The query language used in this environment is based on Kusto Query Language (KQL). Azure Resource Graph supports a useful subset of KQL rather than every KQL feature. We'll be discussing KQL in more depth later in this module when we talk about Azure Monitor Log Analytics.
5656

@@ -72,7 +72,7 @@ Instead, let's look at a powerful idea: **dashboards as code**. Azure portal das
7272

7373
If you need to show a colleague the view you used during an outage, it's usually better to share a link to the dashboard or workbook with the relevant filters and time range than to treat exported JSON as a snapshot of the data.
7474

75-
:::image type="content" source="../media/dashboard.png" alt-text="Screenshot of an Azure portal dashboard showing export and sharing options.":::
75+
[![Screenshot of an Azure portal dashboard showing export and sharing options.](../media/dashboard.png)](../media/dashboard.png#lightbox)
7676

7777
### Grafana
7878

learn-pr/advocates/improve-reliability-monitoring/includes/3-expanding-understanding.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,55 +2,55 @@ For us to be able to effectively set up monitoring to improve our reliability, w
22

33
Let's look at some aspects of reliability now:
44

5-
:::image type="content" source="../media/diagram-empty.png" alt-text="Diagram with the word reliability in a circle in the middle connected to empty circles at the end of each spoke.":::
5+
[![Diagram with the word reliability in a circle in the middle connected to empty circles at the end of each spoke.](../media/diagram-empty.png)](../media/diagram-empty.png#lightbox)
66

77
## Availability
88

9-
:::image type="content" source="../media/diagram-availability.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word availability added to one circle.":::
9+
[![Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word availability added to one circle.](../media/diagram-availability.png)](../media/diagram-availability.png#lightbox)
1010

1111
When people talk about reliability, they tend to start with availability. Is the system "up" or is it "down?" Can others reach your website or your service? Can they use the product when they expect to? It's important from the perspective of both external customers and internal users who depend on your service. Availability is probably the aspect of reliability with which you'll spend the most time working. It's a good starting point for discussing reliability, but it’s only one aspect.
1212

1313
## Latency
1414

15-
:::image type="content" source="../media/diagram-latency.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word latency added to previous diagram in a different circle.":::
15+
[![Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word latency added to previous diagram in a different circle.](../media/diagram-latency.png)](../media/diagram-latency.png#lightbox)
1616

1717
Latency refers to the amount of delay between a request and a response. You might have heard the catchphrase "slow is the new down." People demand fast performance, and they lose patience with a site or service that leaves them waiting. We have good research that shows that if a website doesn't meet their expectations for response time, customers are likely to go to a competitor.
1818

1919
## Throughput
2020

21-
:::image type="content" source="../media/diagram-throughput.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word throughput added to previous diagram in a different circle.":::
21+
[![Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word throughput added to previous diagram in a different circle.](../media/diagram-throughput.png)](../media/diagram-throughput.png#lightbox)
2222

2323
Throughput is a measure of the rate at which something is processed, or the number of transactions that a website, application, or service successfully handles over a specified period of time. This is particularly important when running pipelines or batch-processing systems. If a pipeline or a batch-processing system isn't processing things fast enough, that's not meeting our expectations and it isn't considered reliable.
2424

2525
## Coverage
2626

27-
:::image type="content" source="../media/diagram-coverage.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word coverage added to previous diagram in a different circle.":::
27+
[![Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word coverage added to previous diagram in a different circle.](../media/diagram-coverage.png)](../media/diagram-coverage.png#lightbox)
2828

2929
Coverage refers to how much of the data that you expected to process was actually processed. Again, we come back to the idea of measuring how well we're meeting expectations as part of determining if something is reliable.
3030

3131
## Correctness
3232

33-
:::image type="content" source="../media/diagram-correctness.png" alt-text="Hub and spoke diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word correctness added to previous diagram in a different circle.":::
33+
[![Hub and spoke diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, with the word correctness added to previous diagram in a different circle.](../media/diagram-correctness.png)](../media/diagram-correctness.png#lightbox)
3434

3535
Correctness is an aspect of reliability that's often overlooked. Did the process that you ran on the data yield the correct or expected result? This is an important factor to include in monitoring for reliability. No matter how fast or "always available" your service or site is, if it returns incorrect results, it’s not reliable in the eyes of your customers. Monitoring for correctness of results is an important part of monitoring for reliability.
3636

3737
## Fidelity
3838

39-
:::image type="content" source="../media/diagram-fidelity.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, the word fidelity added to previous diagram in a different circle.":::
39+
[![Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, the word fidelity added to previous diagram in a different circle.](../media/diagram-fidelity.png)](../media/diagram-fidelity.png#lightbox)
4040

4141
Fidelity in this context is best understood through an example. Let's say you visit the home page of a video-streaming site. That page is made up of separate sections: new releases, personalized recommendations, top 10 movies watched, and so on. Each of those sections is likely generated by a separate back-end service. If one of those services goes down—for example, the personalization engine—visitors to the site don't get a "Sorry this site is down" message or a blank page. Instead, they see a home page with that section either removed or replaced with static content. In technical terms, we'd say they received a "degraded" experience instead of the complete intended page.
4242

4343
If we were to measure fidelity, we'd be measuring how often the user of a service received a "degraded" experience versus the full experience (complete fidelity). This measurement is useful for any fault-tolerant service that has the ability to continue running in a degraded mode when something goes wrong.
4444

4545
## Freshness
4646

47-
:::image type="content" source="../media/diagram-freshness.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, the word freshness added to previous diagram in a different circle.":::
47+
[![Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, the word freshness added to previous diagram in a different circle.](../media/diagram-freshness.png)](../media/diagram-freshness.png#lightbox)
4848

4949
Freshness refers to how up to date the information is in situations where timeliness matters to the customer (for example, services that provide sports scores or election results). Those services are considered reliable if the data they provide is kept current.
5050

5151
## Durability
5252

53-
:::image type="content" source="../media/diagram-whole.png" alt-text="Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, the word durability added to previous diagram in a different circle filling in the entire diagram.":::
53+
[![Diagram with the word reliability in a circle in the middle connected to circles at the end of each spoke, the word durability added to previous diagram in a different circle filling in the entire diagram.](../media/diagram-whole.png)](../media/diagram-whole.png#lightbox)
5454

5555
Durability is another slightly more niche aspect of reliability. If you're running a service that provides storage, you know just how important it is that data a customer writes to your service can be read later. This is a durability expectation.
5656

0 commit comments

Comments
 (0)