Skip to content

Commit ef363df

Browse files
authored
Merge pull request #309914 from Padmalathas/Muti-Region-Deployment
[NEW] Multi-Region HPC Cluster Deployment Guidance
2 parents e2acba9 + f088505 commit ef363df

7 files changed

Lines changed: 353 additions & 1 deletion
Lines changed: 350 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,350 @@
1+
---
2+
title: Deploy a multi‑region HPC cluster
3+
description: Guidance to plan, deploy, operate, and recover a multi‑region HPC cluster on Azure, including architecture options, prerequisites, step‑by‑step deployment, day‑2 operations, DR strategy, and caveats.
4+
author: jemorey
5+
ms.author: padmalathas
6+
ms.reviewer: xpillons
7+
ms.date: 12/31/2025
8+
---
9+
10+
# Deploy a multi‑region HPC cluster
11+
12+
High-performance computing (HPC) clusters are typically deployed within a single Azure region to minimize network latency and keep compute resources co-located with data. However, there are scenarios where deploying a multi-region HPC cluster becomes necessary or advantageous - whether for increased capacity, access to specialized hardware, data locality requirements, or disaster recovery (DR) planning.
13+
14+
This guide provides end-to-end guidance for HPC users to understand, plan, deploy, and operate multi-region HPC clusters on Azure, with a focus on Azure CycleCloud as the orchestration platform.
15+
16+
> [!NOTE]
17+
> Multi-region HPC deployments introduce additional complexity and cost. Not every workload benefits from this approach—evaluate your specific needs before proceeding.
18+
19+
## Why consider a multi‑region HPC cluster?
20+
21+
Multi-region HPC designs allow your workloads to span or switch between Azure regions, improving availability, resiliency, or scalability beyond what a single region can offer. The following table summarizes which approach might suit different requirements.
22+
23+
| Use Case | Description |
24+
|----------|-------------|
25+
| **Capacity and Scale** | A single region might lack sufficient cores or specific VM sizes during peak demand. Splitting loosely coupled workloads across regions provides access to more total cores or alternative VM SKUs. |
26+
| **Specialized Hardware Availability** | Not all Azure regions offer the same HPC VM types (for example, NDv4 GPUs, HB-series InfiniBand VMs). A secondary region can provide specialized compute resources unavailable in your primary region. |
27+
| **Data Localization/Proximity** | Large datasets might be tied to specific regions (for example, Azure Open Datasets in West US 2). Deploying compute in that region reduces data movement costs and latency. |
28+
| **Disaster Recovery (DR)** | For mission-critical workloads, multi-region deployment provides higher availability and a DR path in case of regional outages. |
29+
30+
### When to waive a multi-region
31+
32+
Highly tightly coupled parallel jobs (for example, MPI applications with frequent inter-node communication) might suffer significant performance degradation if spread across distant regions due to added network latency. Consider multi-region only when your workloads can tolerate increased latency and management overhead.
33+
34+
35+
## Architecture comparison
36+
37+
There are multiple architectural approaches for designing an HPC solution across regions. The best approach depends on your goals for load distribution and disaster recovery.
38+
39+
:::image type="content" source="../images/multi-region-hpc-architecture.png" alt-text="Screenshot of multi-region HPC architecture." lightbox="../images/multi-region-hpc-architecture.png":::
40+
41+
### Option 1: Active/Active clusters (independent clusters per region)
42+
43+
In this option, you can deploy two or more full HPC clusters in different regions, each actively running jobs. The work is divided between regions by project or workload type.
44+
45+
:::image type="content" source="../images/active-active-clusters.png" alt-text="Screenshot of Active/Active cluster option." lightbox="../images/active-active-clusters.png":::
46+
47+
- **Pros:** Maximum capacity, redundancy and high availability.
48+
- **Cons:** Higher cost/operations overhead, separate management and complex data sync.
49+
- **Best for:** Organizations needing maximum capacity and continuous availability.
50+
51+
### Option 2: Active/Passive (primary with disaster recovery failover)
52+
53+
In this option, an HPC cluster runs in a primary region with a standby environment in a secondary region for disaster recovery. The secondary region is pre-provisioned but doesn't run jobs during normal operations.
54+
55+
:::image type="content" source="../images/active-passive-disaster-recovery.png" alt-text="Screenshot of Active/Passive cluster option." lightbox="../images/active-passive-disaster-recovery.png":::
56+
57+
- **Pros:** Lower cost, focuses on business continuity at a reduced overhead.
58+
- **Cons:** Ongoing data replication, regular DR drills and non‑zero failover time.
59+
- **Best for:** Mission‑critical workloads with defined recovery time objective (RTO).
60+
61+
### Option 3: Single HPC control plane across regions
62+
63+
In this option, one scheduler/head node manages compute in multiple regions.
64+
65+
:::image type="content" source="../images/single-control-plane-multi-region.png" alt-text="Screenshot of single control plane across regions option." lightbox="../images/single-control-plane-multi-region.png":::
66+
67+
- **Pros:** Unified cluster partitions, single endpoint for users and no duplicate control systems.
68+
- **Cons:** Requires advanced network setup and creating new partitions in the remote regions, single head node can be a single point of failure and offers complex reliability.
69+
- **Best for:** Organizations wanting a unified management experience with regional compute pools.
70+
71+
### Architecture selection guide
72+
73+
| Requirement or goal | Recommended multi-region model |
74+
|-------------|-------------------|
75+
| Maximum capacity | Active/Active |
76+
| Cost efficiency | Active/Passive |
77+
| Simplified management | Single Control Plane |
78+
| Fastest failover | Active/Active |
79+
| Lowest RTO | Active/Active |
80+
| Lowest operational cost | Active/Passive |
81+
82+
## Reference deployment: multi-region HPC cluster with CycleCloud
83+
84+
This section outlines the process for setting up a multi-region HPC Slurm cluster using CycleCloud. In this example, the head node runs in Region A (primary), and additional compute nodes can be provisioned in Region B (secondary).
85+
86+
:::image type="content" source="../images/cyclecloud-multi-region-deployment.png" alt-text="Screenshot of Multi region cluster deployment." lightbox="../images/cyclecloud-multi-region-deployment.png":::
87+
88+
### Prerequisites
89+
90+
- Working CycleCloud installation in region A
91+
- Adequate quota for chosen VM sizes in both regions
92+
- VNet peering or connectivity between regions
93+
94+
### Step-by-step deployment
95+
96+
#### Step 1: Prepare cloud resources in both regions
97+
98+
```bash
99+
# Create resource groups (if not existing)
100+
az group create --name rg-hpc-regionA --location eastus
101+
az group create --name rg-hpc-regionB --location westus2
102+
103+
# Check quota availability
104+
az vm list-usage --location eastus --output table
105+
az vm list-usage --location westus2 --output table
106+
```
107+
108+
- Create or identify resource groups and virtual networks in each region
109+
- Deploy ancillary services (AD replica domain controller in Region B if using Active Directory)
110+
111+
#### Step 2: Deploy CycleCloud in region A with region-scoped identity
112+
113+
Follow standard Azure CycleCloud installation for your primary region using a managed identity or service principal limited to Region A's resources.
114+
115+
#### Step 3: Add a credential for region B in CycleCloud
116+
117+
```bash
118+
# Create service principal for Region B
119+
az ad sp create-for-rbac --name "cyclecloud-regionB-sp" \
120+
--role contributor \
121+
--scopes /subscriptions/<subscription-id>/resourceGroups/rg-hpc-regionB
122+
123+
# In CycleCloud CLI, add the new credential
124+
cyclecloud account create regionB-account
125+
```
126+
127+
Use the same CycleCloud locker (Blob Storage container) in both regions for cluster data. Alternatively, if you set up a CycleCloud locker in region B, you must create a private endpoint in region B to access the locker in region A.
128+
129+
#### Step 4: Connect networks and DNS across regions
130+
131+
```bash
132+
# Create VNet peering from VNET-1 to VNET-2
133+
az network vnet peering create -g rg-hpc-regionA -n VNET1ToVNET2 \
134+
--vnet-name vnet-hpc-regionA --remote-vnet vnet-hpc-regionB --allow-vnet-access
135+
136+
# Create VNet peering from VNET-2 to VNET-1
137+
az network vnet peering create -g rg-hpc-regionB -n VNET2ToVNET1 \
138+
--vnet-name vnet-hpc-regionB --remote-vnet vnet-hpc-regionA --allow-vnet-access
139+
140+
# Create Private DNS Zone and link to both VNets
141+
az network private-dns zone create -g rg-hpc-regionA -n hpc.internal
142+
az network private-dns link vnet create -g rg-hpc-regionA \
143+
-n regionA-link -z hpc.internal -v vnet-hpc-regionA -e true
144+
az network private-dns link vnet create -g rg-hpc-regionA \
145+
-n regionB-link -z hpc.internal -v vnet-hpc-regionB -e true
146+
```
147+
> [!NOTE]
148+
> In certain configurations, including Open OnDemand deployments, node‑level DNS resolution might require updating resolv.conf to query the private DNS zone resolver before the Azure resolver to support short‑name resolution. Currently, it's unclear about persistent solution.
149+
150+
#### Step 5: Customize the CycleCloud cluster template for multi-region
151+
152+
Modify your CycleCloud cluster template to define node pools for each region. Key parameters include:
153+
154+
- `Region`
155+
- `VpcId` (VNET)
156+
- `Subnets`
157+
- `AvailabilityZone`
158+
159+
Example parameters file (`slurm-multiregion-params.json`):
160+
161+
```json
162+
{
163+
"Credentials": "your-cyclecloud-credential",
164+
"PrimarySubnet": "rg-hpc-regionA/vnet-hpc-regionA/compute-subnet",
165+
"PrimaryRegion": "eastus",
166+
"SecondarySubnet": "rg-hpc-regionB/vnet-hpc-regionB/compute-subnet",
167+
"SecondaryRegion": "westus2",
168+
"HPCMachineType": "Standard_HB120rs_v3",
169+
"MaxHPCExecuteCoreCount": 1200,
170+
"HTCMachineType": "Standard_D16s_v5",
171+
"MaxHTCExecuteCoreCount": 400
172+
}
173+
```
174+
175+
#### Step 6: Import the template and deploy the cluster
176+
177+
```bash
178+
# Import cluster template
179+
cyclecloud import_cluster multi-region-slurm -c Slurm \
180+
-f slurm-multiregion-template.txt \
181+
-p slurm-multiregion-params.json
182+
183+
# Start the cluster
184+
cyclecloud start_cluster multi-region-slurm
185+
```
186+
187+
#### Step 7: Verify nodes in both regions
188+
189+
```bash
190+
# Check node status from scheduler
191+
sinfo
192+
193+
# View CycleCloud node status
194+
cyclecloud show_nodes multi-region-slurm
195+
```
196+
197+
#### Step 8: Test job submission and data access
198+
199+
```bash
200+
#!/bin/bash
201+
#SBATCH --job-name=mpiMultiRegion
202+
#SBATCH --partition=hpc-regionB
203+
#SBATCH -N 2
204+
#SBATCH -n 120
205+
#SBATCH --chdir /tmp
206+
#SBATCH --exclusive
207+
208+
set -x
209+
source /etc/profile.d/modules.sh
210+
module load mpi/hpcx
211+
212+
echo "SLURM_JOB_NODELIST = " $SLURM_JOB_NODELIST
213+
NPROCS=$SLURM_NTASKS
214+
215+
mpirun -n $NPROCS --report-bindings echo "hello world!"
216+
mv slurm-${SLURM_JOB_ID}.out $HOME
217+
```
218+
219+
> [!TIP]
220+
> Set an explicit working directory local to region B (`--chdir /tmp`) since the home directory is typically in region A.
221+
222+
223+
## Day-2 operations and management
224+
225+
### Reliability and job resiliency
226+
227+
> [!IMPORTANT]
228+
> CycleCloud and Slurm do not provide an SLA for individual job success. Implement application-level checkpoint/restart to recover from interruptions.
229+
230+
### Capacity management
231+
232+
- Monitor resource utilization continuously
233+
- Distribute work to alternate region when approaching quota limits
234+
- Leverage multiple VM instance types across regions
235+
- Review Azure updates on new HPC VM availability
236+
237+
### Security and access control
238+
239+
- Use region-specific managed identities
240+
- Restrict network access with NSGs even with VNet peering
241+
- Maintain separate Key Vault instances per region
242+
- Ensure data replication compliance with residency requirements
243+
244+
245+
## Disaster recovery strategy
246+
247+
### DR configuration options
248+
249+
| Configuration | Description | Recovery Speed |
250+
|---------------|-------------|----------------|
251+
| **Active/Active** | Continuous availability, duplicate all systems | Immediate |
252+
| **Active/Passive (Warm Standby)** | Small head node running, compute spun up on failover | Minutes to hours |
253+
| **Passive/Cold** | Manual deployment from backups | Hours to days |
254+
255+
### Data protection requirements
256+
257+
- Replicate critical data (input datasets, home directories, checkpoints, results)
258+
- Use geo-redundant storage or custom replication with versioning
259+
- Match replication frequency to recovery point objective (RPO) (for example, 4-hour RPO = replicate at least every 4 hours)
260+
261+
262+
## Validation and testing
263+
264+
### Test categories
265+
266+
#### Job submission tests
267+
268+
- Submit test jobs targeting each region's resources
269+
- Measure job start time and network throughput
270+
- Run small MPI jobs across nodes in different regions
271+
- Note performance impact from cross-region latency
272+
273+
#### Data consistency checks
274+
275+
- Test that replicated data in Region B is usable
276+
- Disconnect primary storage and attempt reads from secondary
277+
- Verify all data and metadata (permissions) are intact
278+
279+
#### End-to-end DR test
280+
281+
- Assume Region A is unavailable
282+
- Bring up HPC environment in Region B using DR procedures
283+
- Measure time to restore critical functionality
284+
- Verify RPO and RTO compliance
285+
- Fail back to Region A and synchronize changes
286+
287+
288+
## Caveats and considerations
289+
290+
### Performance caveats
291+
292+
| Caveat | Impact | Mitigation |
293+
|--------|--------|------------|
294+
| **Network bandwidth limits** | Large data transfers might bottleneck | Pre-stage data, and use compression |
295+
| **Working directory location** | Jobs in region B might have slow access to region A home dirs | Use a local working directory and mirror user home directories when required |
296+
297+
### Operational caveats
298+
299+
| Caveat | Impact | Mitigation |
300+
|--------|--------|------------|
301+
| **No automatic job completion SLA** | Jobs might fail without automatic recovery | Implement checkpointing and retry logic |
302+
| **Double management overhead** | Active/active requires managing two clusters | Use automation and infrastructure-as-code |
303+
| **CycleCloud UI limitations** | UI restricts single-region configuration | Use CLI with custom templates and parameters files |
304+
| **Name resolution complexity** | Nodes must resolve across regions | Configure private DNS zones linked to both VNets |
305+
306+
### Cost caveats
307+
308+
| Caveat | Impact | Mitigation |
309+
|--------|--------|------------|
310+
| **Egress charges** | Cross-region traffic incur additional costs | Process data in-region, use local storage |
311+
| **Idle secondary resources** | DR region incurs costs even when idle | Rely on autoscaling to deallocate idle compute nodes |
312+
| **Quota management** | Both regions need sufficient quota | Request increases early, use capacity reservations |
313+
314+
### Security caveats
315+
316+
| Caveat | Impact | Mitigation |
317+
|--------|--------|------------|
318+
| **Credential scope** | Cross-region credentials increase blast radius | Use region-specific managed identities |
319+
| **Data residency** | Replicating data might violate compliance | Verify regulatory requirements before replication |
320+
| **Network exposure** | VNet peering opens cross-region paths | Apply strict NSG rules |
321+
322+
323+
## Frequently asked questions (FAQs)
324+
325+
* Do I need a separate CycleCloud server in each region?
326+
327+
Not necessarily. CycleCloud can manage multiple regions from one instance using multiple credentials and lockers. However, for higher availability, some organizations run distinct CycleCloud installations in each region with identical configurations.
328+
329+
* How can I minimize data e-gress charges?
330+
- Keep shared storage synchronized using Azure backbone replication (GRS, ZRS)
331+
- Use region-specific CycleCloud lockers instead of a global store
332+
- Compress data before transfer
333+
- Transmit only incremental changes
334+
- Run jobs in the same region as their data
335+
336+
* What RPO/RTO should HPC workloads have?
337+
Typical targets:
338+
- **RPO**: Few hours (or checkpoint interval length)
339+
- **RTO**: 4-24 hours
340+
341+
For time-sensitive workloads (for example, weather forecasting with strict deadlines), near-zero RTO might require active/active setup.
342+
343+
* Are there SLA guarantees for multi-region HPC job completion?
344+
345+
No. There is no Microsoft SLA guaranteeing individual HPC job completion—single or multi-region. Azure infrastructure services (VMs, Virtual Machine Scale Set, Storage) have availability SLAs, but job-level recovery is your responsibility.
346+
347+
* How do I check if a region supports my HPC needs?
348+
- Consult [Azure regional services documentation](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/)
349+
- Check Azure portal for VM sizes and quotas per region
350+
- Engage Azure support for capacity planning on large deployments
169 KB
Loading
84.3 KB
Loading
159 KB
Loading
167 KB
Loading
77.2 KB
Loading

articles/cyclecloud/toc.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,9 @@
2323
- name: Clusters & Nodes
2424
href: ./concepts/clusters.md
2525
- name: Scheduling
26-
href: ./concepts/scheduling.md
26+
href: ./concepts/scheduling.md
27+
- name: Multi-region Cluster Deployment
28+
href: ./concepts/multi-region-cluster-deployment.md
2729
- name: Clusters Operations
2830
items:
2931
- name: User Management

0 commit comments

Comments
 (0)