Skip to content

Commit 782112c

Browse files
authored
Merge pull request #53377 from wwlpublish/LP156462-4
Created GitHub repo
2 parents 7833a59 + 644e405 commit 782112c

29 files changed

Lines changed: 393 additions & 0 deletions
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.implement-resilient-ai-ready-infrastructure.introduction
3+
title: "Introduction"
4+
metadata:
5+
title: "Introduction"
6+
description: "Introduction."
7+
ms.date: 02/03/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 5
12+
content: |
13+
[!include[](includes/1-introduction.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.implement-resilient-ai-ready-infrastructure.configure-microsoft-foundry-hubs-region
3+
title: "Configure Microsoft Foundry hubs for multi-region resilience"
4+
metadata:
5+
title: "Configure Microsoft Foundry hubs for multi-region resilience"
6+
description: "Configure Microsoft Foundry hubs for multi-region resilience."
7+
ms.date: 02/03/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 12
12+
content: |
13+
[!include[](includes/2-configure-microsoft-foundry-hubs-region.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.implement-resilient-ai-ready-infrastructure.implement-geo-redundant-storage-data
3+
title: "Implement geo-redundant storage for AI data protection"
4+
metadata:
5+
title: "Implement geo-redundant storage for AI data protection"
6+
description: "Implement geo-redundant storage for AI data protection."
7+
ms.date: 02/03/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 13
12+
content: |
13+
[!include[](includes/3-implement-geo-redundant-storage-data.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.implement-resilient-ai-ready-infrastructure.deploy-azure-container-registry-replication
3+
title: "Deploy Azure Container Registry with geo-replication"
4+
metadata:
5+
title: "Deploy Azure Container Registry with geo-replication"
6+
description: "Deploy Azure Container Registry with geo-replication."
7+
ms.date: 02/03/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 11
12+
content: |
13+
[!include[](includes/4-deploy-azure-container-registry-replication.md)]
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.implement-resilient-ai-ready-infrastructure.exercise-configure-resilient-infrastructure
3+
title: "Configure resilient infrastructure"
4+
metadata:
5+
title: "Configure Resilient Infrastructure"
6+
description: "Configure Resilient Infrastructure."
7+
ms.date: 02/03/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 45
12+
content: |
13+
[!include[](includes/5-exercise-configure-resilient-infrastructure.md)]
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.implement-resilient-ai-ready-infrastructure.knowledge-check
3+
title: "Module assessment"
4+
metadata:
5+
title: "Knowledge check"
6+
description: "Your understanding of resilient AI infrastructure design helps you make critical architectural decisions that balance availability, cost, and operational complexity. Test your ability to apply these concepts by analyzing scenarios where you must select appropriate deployment patterns, storage configurations, and container registry strategies based on specific business requirements."
7+
ms.date: 02/03/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
module_assessment: false
12+
durationInMinutes: 3
13+
content: "Choose the best response for each of the following questions."
14+
quiz:
15+
questions:
16+
- content: "Contoso's fraud detection AI system must maintain high availability to meet regulatory requirements for real-time transaction processing across North America and Europe. The system serves 50,000 inference requests per second during peak hours with model updates deployed weekly. Network latency must stay under 50 milliseconds for users in both regions. Which Microsoft Foundry hub deployment pattern meets these requirements most cost-effectively?"
17+
choices:
18+
- content: "Deploy a single hub in East US with compute clusters scaled to handle global traffic, using Azure Front Door to route requests efficiently across continents"
19+
isCorrect: false
20+
explanation: "Active-active deployment with hubs in East US and West Europe meets the high availability target by eliminating single points of regional failure and provides the required sub-50ms latency by routing users to their nearest hub. The weekly model update frequency justifies the doubled infrastructure cost because it minimizes deployment risks through gradual regional rollout. A single hub in East US fails the latency requirement because cross-Atlantic network requests from Europe typically exceed 100ms, and Azure Front Door can't overcome physical distance limitations for compute-intensive inference operations. Active-passive deployment achieves lower cost but can't guarantee high availability because failover provisioning of compute resources in the secondary region takes 15-30 minutes, creating extended downtime during primary region failures that violates the availability SLA and impacts real-time fraud detection capabilities."
21+
- content: "Deploy active-active hubs in East US and West Europe with full compute capacity in both regions, configuring Azure Traffic Manager to route requests to the nearest hub based on user location"
22+
isCorrect: true
23+
explanation: "Active-active deployment with hubs in East US and West Europe meets the high availability target by eliminating single points of regional failure and provides the required sub-50ms latency by routing users to their nearest hub. The weekly model update frequency justifies the doubled infrastructure cost because it minimizes deployment risks through gradual regional rollout. A single hub in East US fails the latency requirement because cross-Atlantic network requests from Europe typically exceed 100ms, and Azure Front Door can't overcome physical distance limitations for compute-intensive inference operations. Active-passive deployment achieves lower cost but can't guarantee high availability because failover provisioning of compute resources in the secondary region takes 15-30 minutes, creating extended downtime during primary region failures that violates the availability SLA and impacts real-time fraud detection capabilities."
24+
- content: "Deploy a primary hub in East US with an active-passive secondary hub in West Europe, keeping the secondary hub without compute resources until failover occurs to reduce costs"
25+
isCorrect: false
26+
explanation: "Active-active deployment with hubs in East US and West Europe meets the high availability target by eliminating single points of regional failure and provides the required sub-50ms latency by routing users to their nearest hub. The weekly model update frequency justifies the doubled infrastructure cost because it minimizes deployment risks through gradual regional rollout. A single hub in East US fails the latency requirement because cross-Atlantic network requests from Europe typically exceed 100ms, and Azure Front Door can't overcome physical distance limitations for compute-intensive inference operations. Active-passive deployment achieves lower cost but can't guarantee high availability because failover provisioning of compute resources in the secondary region takes 15-30 minutes, creating extended downtime during primary region failures that violates the availability SLA and impacts real-time fraud detection capabilities."
27+
- content: "Your AI team stores 50 TB of training datasets in Azure Blob Storage that took eight months to collect and label from customer transactions. The datasets undergo incremental updates monthly with approximately 2 TB of new data. Compliance regulations require the ability to restore accidentally deleted data within 48 hours. The monthly training jobs can tolerate up to 30 minutes of lost progress if the storage region fails. Which storage configuration provides appropriate protection at the lowest cost?"
28+
choices:
29+
- content: "LRS with 365-day soft delete retention and daily backups to a separate storage account in a different region using AzCopy scheduled tasks"
30+
isCorrect: false
31+
explanation: "GRS with 30-day soft delete and resource lock meets all requirements cost-effectively: geo-redundancy protects against complete regional failure with typical replication lag under 15 minutes (within the 30-minute RPO tolerance), 30-day soft delete exceeds the 48-hour restore requirement for accidental deletions, and the resource lock prevents infrastructure-level deletion mistakes. LRS with manual backups creates operational complexity and potential gaps if backup jobs fail, provides slower restore times (hours to download and restore 50 TB), and ultimately costs more when accounting for backup storage, egress charges, and engineering time to maintain automation."
32+
- content: "GRS with 30-day soft delete retention and a CanNotDelete resource lock, accepting the 15-minute asynchronous replication lag as within acceptable data loss tolerance"
33+
isCorrect: true
34+
explanation: "GRS with 30-day soft delete and resource lock meets all requirements cost-effectively: geo-redundancy protects against complete regional failure with typical replication lag under 15 minutes (within the 30-minute RPO tolerance), 30-day soft delete exceeds the 48-hour restore requirement for accidental deletions, and the resource lock prevents infrastructure-level deletion mistakes. LRS with manual backups creates operational complexity and potential gaps if backup jobs fail, provides slower restore times (hours to download and restore 50 TB), and ultimately costs more when accounting for backup storage, egress charges, and engineering time to maintain automation."
35+
- content: "GZRS with 90-day soft delete retention to protect against both zone-level and region-level failures, ensuring maximum durability for irreplaceable training datasets"
36+
isCorrect: false
37+
explanation: "GRS with 30-day soft delete and resource lock meets all requirements cost-effectively: geo-redundancy protects against complete regional failure with typical replication lag under 15 minutes (within the 30-minute RPO tolerance), 30-day soft delete exceeds the 48-hour restore requirement for accidental deletions, and the resource lock prevents infrastructure-level deletion mistakes. LRS with manual backups creates operational complexity and potential gaps if backup jobs fail, provides slower restore times (hours to download and restore 50 TB), and ultimately costs more when accounting for backup storage, egress charges, and engineering time to maintain automation."
38+
- content: "Contoso deploys sentiment analysis models packaged as 8 GB Docker containers that update twice daily as new training data becomes available. The containers deploy to Microsoft Foundry compute clusters in East US (primary) and West US (failover). During the last deployment, West US clusters failed to pull the updated container for 12 minutes after the East US deployment succeeded, causing version inconsistency during that window. How should you optimize the container registry configuration to minimize version drift between regions?"
39+
choices:
40+
- content: "Switch from Azure Container Registry Premium to Standard tier in both regions with manual image push to each regional registry immediately after building to ensure simultaneous availability"
41+
isCorrect: false
42+
explanation: "Implementing a deployment pipeline that polls the West US replica for image availability before updating that region's compute clusters directly addresses the version drift problem by ensuring the container is fully replicated before deployment proceeds. The 12-minute replication window for an 8 GB image is normal for cross-country replication, and the polling approach adds minimal delay (typically 1-2 polling cycles after replication completes) while guaranteeing consistency. This approach maintains the cost-effectiveness of a single Premium registry with automatic geo-replication while preventing the version inconsistency issues. Switching to Standard tier eliminates geo-replication entirely and requires manual image pushes to separate registries in each region, increasing operational complexity, introducing potential human errors during manual pushes, and actually increasing costs because you pay for two separate Standard registries plus cross-region egress bandwidth for pushing images. Adding zone-redundant storage and a third replica region doesn't solve the replication timing issue because asynchronous replication still requires propagation time, Central US doesn't provide a geographical advantage as an intermediary between East and West US (images replicate directly between regions regardless of additional replicas)."
43+
- content: "Continue using ACR Premium with geo-replication but implement a deployment pipeline that polls the West US replica API every 30 seconds for image availability before updating the West US hub's compute clusters"
44+
isCorrect: true
45+
explanation: " Implementing a deployment pipeline that polls the West US replica for image availability before updating that region's compute clusters directly addresses the version drift problem by ensuring the container is fully replicated before deployment proceeds. The 12-minute replication window for an 8 GB image is normal for cross-country replication, and the polling approach adds minimal delay (typically 1-2 polling cycles after replication completes) while guaranteeing consistency. This approach maintains the cost-effectiveness of a single Premium registry with automatic geo-replication while preventing the version inconsistency issues. Switching to Standard tier eliminates geo-replication entirely and requires manual image pushes to separate registries in each region, increasing operational complexity, introducing potential human errors during manual pushes, and actually increasing costs because you pay for two separate Standard registries plus cross-region egress bandwidth for pushing images. Adding zone-redundant storage and a third replica region doesn't solve the replication timing issue because asynchronous replication still requires propagation time, Central US doesn't provide a geographical advantage as an intermediary between East and West US (images replicate directly between regions regardless of additional replicas)."
46+
- content: "Upgrade to ACR Premium with zone-redundant storage in the primary region and increase the number of replicas to three regions including Central US as an intermediary replication hop"
47+
isCorrect: false
48+
explanation: "Implementing a deployment pipeline that polls the West US replica for image availability before updating that region's compute clusters directly addresses the version drift problem by ensuring the container is fully replicated before deployment proceeds. The 12-minute replication window for an 8 GB image is normal for cross-country replication, and the polling approach adds minimal delay (typically 1-2 polling cycles after replication completes) while guaranteeing consistency. This approach maintains the cost-effectiveness of a single Premium registry with automatic geo-replication while preventing the version inconsistency issues. Switching to Standard tier eliminates geo-replication entirely and requires manual image pushes to separate registries in each region, increasing operational complexity, introducing potential human errors during manual pushes, and actually increasing costs because you pay for two separate Standard registries plus cross-region egress bandwidth for pushing images. Adding zone-redundant storage and a third replica region doesn't solve the replication timing issue because asynchronous replication still requires propagation time, Central US doesn't provide a geographical advantage as an intermediary between East and West US (images replicate directly between regions regardless of additional replicas)."
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.implement-resilient-ai-ready-infrastructure.summary
3+
title: "Summary"
4+
metadata:
5+
title: "Summary"
6+
description: "Summary."
7+
ms.date: 02/03/2026
8+
author: wwlpublish
9+
ms.author: bradj
10+
ms.topic: unit
11+
durationInMinutes: 2
12+
content: |
13+
[!include[](includes/7-summary.md)]
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
This module shows you how to configure Microsoft Foundry hubs across multiple regions, implement geo-redundant storage with soft delete protection for training datasets, and deploy Azure Container Registry with automatic image replication. By the end, you'll architect production-ready AI infrastructure that maintains availability during regional failures and protects critical data assets.
2+
3+
## Learning objectives
4+
5+
By the end of this module, you are able to:
6+
7+
- Configure Microsoft Foundry hubs for multi-region AI workload distribution
8+
- Implement geo-redundant Azure Blob Storage with soft delete and resource locks for AI data protection
9+
- Deploy Azure Container Registry Premium with geo-replication for AI model distribution
10+
- Evaluate infrastructure resilience strategies for enterprise AI workloads
11+
12+
## Prerequisites
13+
14+
Before starting this module, you should have:
15+
16+
- Experience managing Azure resources through the Azure portal or Azure CLI
17+
- Familiarity with Azure resource groups, regions, and basic networking concepts
18+
- Understanding of AI/ML lifecycle including model training, deployment, and inference
19+
20+
## More resources
21+
22+
- [Microsoft Foundry documentation](/azure/ai-studio/) - Official reference for Microsoft Foundry hub configuration and project management
23+
- [Azure Storage redundancy options](/azure/storage/common/storage-redundancy) - Comprehensive guide to storage replication strategies and durability guarantees

0 commit comments

Comments
 (0)