|
| 1 | +--- |
| 2 | +title: Deploy a multi‑region HPC cluster |
| 3 | +description: Guidance to plan, deploy, operate, and recover a multi‑region HPC cluster on Azure, including architecture options, prerequisites, step‑by‑step deployment, day‑2 operations, DR strategy, and caveats. |
| 4 | +author: jemorey |
| 5 | +ms.author: padmalathas |
| 6 | +ms.reviewer: xpillons |
| 7 | +ms.date: 12/31/2025 |
| 8 | +--- |
| 9 | + |
| 10 | +# Deploy a multi‑region HPC cluster |
| 11 | + |
| 12 | +High-performance computing (HPC) clusters are typically deployed within a single Azure region to minimize network latency and keep compute resources co-located with data. However, there are scenarios where deploying a multi-region HPC cluster becomes necessary or advantageous - whether for increased capacity, access to specialized hardware, data locality requirements, or disaster recovery (DR) planning. |
| 13 | + |
| 14 | +This guide provides end-to-end guidance for HPC users to understand, plan, deploy, and operate multi-region HPC clusters on Azure, with a focus on Azure CycleCloud as the orchestration platform. |
| 15 | + |
| 16 | +> [!NOTE] |
| 17 | +> Multi-region HPC deployments introduce additional complexity and cost. Not every workload benefits from this approach—evaluate your specific needs before proceeding. |
| 18 | +
|
| 19 | +## Why consider a multi‑region HPC cluster? |
| 20 | + |
| 21 | +Multi-region HPC designs allow your workloads to span or switch between Azure regions, improving availability, resiliency, or scalability beyond what a single region can offer. The following table summarizes which approach might suit different requirements. |
| 22 | + |
| 23 | +| Use Case | Description | |
| 24 | +|----------|-------------| |
| 25 | +| **Capacity and Scale** | A single region might lack sufficient cores or specific VM sizes during peak demand. Splitting loosely coupled workloads across regions provides access to more total cores or alternative VM SKUs. | |
| 26 | +| **Specialized Hardware Availability** | Not all Azure regions offer the same HPC VM types (for example, NDv4 GPUs, HB-series InfiniBand VMs). A secondary region can provide specialized compute resources unavailable in your primary region. | |
| 27 | +| **Data Localization/Proximity** | Large datasets might be tied to specific regions (for example, Azure Open Datasets in West US 2). Deploying compute in that region reduces data movement costs and latency. | |
| 28 | +| **Disaster Recovery (DR)** | For mission-critical workloads, multi-region deployment provides higher availability and a DR path in case of regional outages. | |
| 29 | + |
| 30 | +### When to waive a multi-region |
| 31 | + |
| 32 | +Highly tightly coupled parallel jobs (for example, MPI applications with frequent inter-node communication) might suffer significant performance degradation if spread across distant regions due to added network latency. Consider multi-region only when your workloads can tolerate increased latency and management overhead. |
| 33 | + |
| 34 | + |
| 35 | +## Architecture comparison |
| 36 | + |
| 37 | +There are multiple architectural approaches for designing an HPC solution across regions. The best approach depends on your goals for load distribution and disaster recovery. |
| 38 | + |
| 39 | +:::image type="content" source="../images/multi-region-hpc-architecture.png" alt-text="Screenshot of multi-region HPC architecture." lightbox="../images/multi-region-hpc-architecture.png"::: |
| 40 | + |
| 41 | +### Option 1: Active/Active clusters (independent clusters per region) |
| 42 | + |
| 43 | +In this option, you can deploy two or more full HPC clusters in different regions, each actively running jobs. The work is divided between regions by project or workload type. |
| 44 | + |
| 45 | +:::image type="content" source="../images/active-active-clusters.png" alt-text="Screenshot of Active/Active cluster option." lightbox="../images/active-active-clusters.png"::: |
| 46 | + |
| 47 | +- **Pros:** Maximum capacity, redundancy and high availability. |
| 48 | +- **Cons:** Higher cost/operations overhead, separate management and complex data sync. |
| 49 | +- **Best for:** Organizations needing maximum capacity and continuous availability. |
| 50 | + |
| 51 | +### Option 2: Active/Passive (primary with disaster recovery failover) |
| 52 | + |
| 53 | +In this option, an HPC cluster runs in a primary region with a standby environment in a secondary region for disaster recovery. The secondary region is pre-provisioned but doesn't run jobs during normal operations. |
| 54 | + |
| 55 | +:::image type="content" source="../images/active-passive-disaster-recovery.png" alt-text="Screenshot of Active/Passive cluster option." lightbox="../images/active-passive-disaster-recovery.png"::: |
| 56 | + |
| 57 | +- **Pros:** Lower cost, focuses on business continuity at a reduced overhead. |
| 58 | +- **Cons:** Ongoing data replication, regular DR drills and non‑zero failover time. |
| 59 | +- **Best for:** Mission‑critical workloads with defined recovery time objective (RTO). |
| 60 | + |
| 61 | +### Option 3: Single HPC control plane across regions |
| 62 | + |
| 63 | +In this option, one scheduler/head node manages compute in multiple regions. |
| 64 | + |
| 65 | +:::image type="content" source="../images/single-control-plane-multi-region.png" alt-text="Screenshot of single control plane across regions option." lightbox="../images/single-control-plane-multi-region.png"::: |
| 66 | + |
| 67 | +- **Pros:** Unified cluster partitions, single endpoint for users and no duplicate control systems. |
| 68 | +- **Cons:** Requires advanced network setup and creating new partitions in the remote regions, single head node can be a single point of failure and offers complex reliability. |
| 69 | +- **Best for:** Organizations wanting a unified management experience with regional compute pools. |
| 70 | + |
| 71 | +### Architecture selection guide |
| 72 | + |
| 73 | +| Requirement or goal | Recommended multi-region model | |
| 74 | +|-------------|-------------------| |
| 75 | +| Maximum capacity | Active/Active | |
| 76 | +| Cost efficiency | Active/Passive | |
| 77 | +| Simplified management | Single Control Plane | |
| 78 | +| Fastest failover | Active/Active | |
| 79 | +| Lowest RTO | Active/Active | |
| 80 | +| Lowest operational cost | Active/Passive | |
| 81 | + |
| 82 | +## Reference deployment: multi-region HPC cluster with CycleCloud |
| 83 | + |
| 84 | +This section outlines the process for setting up a multi-region HPC Slurm cluster using CycleCloud. In this example, the head node runs in Region A (primary), and additional compute nodes can be provisioned in Region B (secondary). |
| 85 | + |
| 86 | +:::image type="content" source="../images/cyclecloud-multi-region-deployment.png" alt-text="Screenshot of Multi region cluster deployment." lightbox="../images/cyclecloud-multi-region-deployment.png"::: |
| 87 | + |
| 88 | +### Prerequisites |
| 89 | + |
| 90 | +- Working CycleCloud installation in region A |
| 91 | +- Adequate quota for chosen VM sizes in both regions |
| 92 | +- VNet peering or connectivity between regions |
| 93 | + |
| 94 | +### Step-by-step deployment |
| 95 | + |
| 96 | +#### Step 1: Prepare cloud resources in both regions |
| 97 | + |
| 98 | +```bash |
| 99 | +# Create resource groups (if not existing) |
| 100 | +az group create --name rg-hpc-regionA --location eastus |
| 101 | +az group create --name rg-hpc-regionB --location westus2 |
| 102 | + |
| 103 | +# Check quota availability |
| 104 | +az vm list-usage --location eastus --output table |
| 105 | +az vm list-usage --location westus2 --output table |
| 106 | +``` |
| 107 | + |
| 108 | +- Create or identify resource groups and virtual networks in each region |
| 109 | +- Deploy ancillary services (AD replica domain controller in Region B if using Active Directory) |
| 110 | + |
| 111 | +#### Step 2: Deploy CycleCloud in region A with region-scoped identity |
| 112 | + |
| 113 | +Follow standard Azure CycleCloud installation for your primary region using a managed identity or service principal limited to Region A's resources. |
| 114 | + |
| 115 | +#### Step 3: Add a credential for region B in CycleCloud |
| 116 | + |
| 117 | +```bash |
| 118 | +# Create service principal for Region B |
| 119 | +az ad sp create-for-rbac --name "cyclecloud-regionB-sp" \ |
| 120 | + --role contributor \ |
| 121 | + --scopes /subscriptions/<subscription-id>/resourceGroups/rg-hpc-regionB |
| 122 | + |
| 123 | +# In CycleCloud CLI, add the new credential |
| 124 | +cyclecloud account create regionB-account |
| 125 | +``` |
| 126 | + |
| 127 | +Use the same CycleCloud locker (Blob Storage container) in both regions for cluster data. Alternatively, if you set up a CycleCloud locker in region B, you must create a private endpoint in region B to access the locker in region A. |
| 128 | + |
| 129 | +#### Step 4: Connect networks and DNS across regions |
| 130 | + |
| 131 | +```bash |
| 132 | +# Create VNet peering from VNET-1 to VNET-2 |
| 133 | +az network vnet peering create -g rg-hpc-regionA -n VNET1ToVNET2 \ |
| 134 | + --vnet-name vnet-hpc-regionA --remote-vnet vnet-hpc-regionB --allow-vnet-access |
| 135 | + |
| 136 | +# Create VNet peering from VNET-2 to VNET-1 |
| 137 | +az network vnet peering create -g rg-hpc-regionB -n VNET2ToVNET1 \ |
| 138 | + --vnet-name vnet-hpc-regionB --remote-vnet vnet-hpc-regionA --allow-vnet-access |
| 139 | + |
| 140 | +# Create Private DNS Zone and link to both VNets |
| 141 | +az network private-dns zone create -g rg-hpc-regionA -n hpc.internal |
| 142 | +az network private-dns link vnet create -g rg-hpc-regionA \ |
| 143 | + -n regionA-link -z hpc.internal -v vnet-hpc-regionA -e true |
| 144 | +az network private-dns link vnet create -g rg-hpc-regionA \ |
| 145 | + -n regionB-link -z hpc.internal -v vnet-hpc-regionB -e true |
| 146 | +``` |
| 147 | +> [!NOTE] |
| 148 | +> In certain configurations, including Open OnDemand deployments, node‑level DNS resolution might require updating resolv.conf to query the private DNS zone resolver before the Azure resolver to support short‑name resolution. Currently, it's unclear about persistent solution. |
| 149 | +
|
| 150 | +#### Step 5: Customize the CycleCloud cluster template for multi-region |
| 151 | + |
| 152 | +Modify your CycleCloud cluster template to define node pools for each region. Key parameters include: |
| 153 | + |
| 154 | +- `Region` |
| 155 | +- `VpcId` (VNET) |
| 156 | +- `Subnets` |
| 157 | +- `AvailabilityZone` |
| 158 | + |
| 159 | +Example parameters file (`slurm-multiregion-params.json`): |
| 160 | + |
| 161 | +```json |
| 162 | +{ |
| 163 | + "Credentials": "your-cyclecloud-credential", |
| 164 | + "PrimarySubnet": "rg-hpc-regionA/vnet-hpc-regionA/compute-subnet", |
| 165 | + "PrimaryRegion": "eastus", |
| 166 | + "SecondarySubnet": "rg-hpc-regionB/vnet-hpc-regionB/compute-subnet", |
| 167 | + "SecondaryRegion": "westus2", |
| 168 | + "HPCMachineType": "Standard_HB120rs_v3", |
| 169 | + "MaxHPCExecuteCoreCount": 1200, |
| 170 | + "HTCMachineType": "Standard_D16s_v5", |
| 171 | + "MaxHTCExecuteCoreCount": 400 |
| 172 | +} |
| 173 | +``` |
| 174 | + |
| 175 | +#### Step 6: Import the template and deploy the cluster |
| 176 | + |
| 177 | +```bash |
| 178 | +# Import cluster template |
| 179 | +cyclecloud import_cluster multi-region-slurm -c Slurm \ |
| 180 | + -f slurm-multiregion-template.txt \ |
| 181 | + -p slurm-multiregion-params.json |
| 182 | + |
| 183 | +# Start the cluster |
| 184 | +cyclecloud start_cluster multi-region-slurm |
| 185 | +``` |
| 186 | + |
| 187 | +#### Step 7: Verify nodes in both regions |
| 188 | + |
| 189 | +```bash |
| 190 | +# Check node status from scheduler |
| 191 | +sinfo |
| 192 | + |
| 193 | +# View CycleCloud node status |
| 194 | +cyclecloud show_nodes multi-region-slurm |
| 195 | +``` |
| 196 | + |
| 197 | +#### Step 8: Test job submission and data access |
| 198 | + |
| 199 | +```bash |
| 200 | +#!/bin/bash |
| 201 | +#SBATCH --job-name=mpiMultiRegion |
| 202 | +#SBATCH --partition=hpc-regionB |
| 203 | +#SBATCH -N 2 |
| 204 | +#SBATCH -n 120 |
| 205 | +#SBATCH --chdir /tmp |
| 206 | +#SBATCH --exclusive |
| 207 | + |
| 208 | +set -x |
| 209 | +source /etc/profile.d/modules.sh |
| 210 | +module load mpi/hpcx |
| 211 | + |
| 212 | +echo "SLURM_JOB_NODELIST = " $SLURM_JOB_NODELIST |
| 213 | +NPROCS=$SLURM_NTASKS |
| 214 | + |
| 215 | +mpirun -n $NPROCS --report-bindings echo "hello world!" |
| 216 | +mv slurm-${SLURM_JOB_ID}.out $HOME |
| 217 | +``` |
| 218 | + |
| 219 | +> [!TIP] |
| 220 | +> Set an explicit working directory local to region B (`--chdir /tmp`) since the home directory is typically in region A. |
| 221 | +
|
| 222 | + |
| 223 | +## Day-2 operations and management |
| 224 | + |
| 225 | +### Reliability and job resiliency |
| 226 | + |
| 227 | +> [!IMPORTANT] |
| 228 | +> CycleCloud and Slurm do not provide an SLA for individual job success. Implement application-level checkpoint/restart to recover from interruptions. |
| 229 | +
|
| 230 | +### Capacity management |
| 231 | + |
| 232 | +- Monitor resource utilization continuously |
| 233 | +- Distribute work to alternate region when approaching quota limits |
| 234 | +- Leverage multiple VM instance types across regions |
| 235 | +- Review Azure updates on new HPC VM availability |
| 236 | + |
| 237 | +### Security and access control |
| 238 | + |
| 239 | +- Use region-specific managed identities |
| 240 | +- Restrict network access with NSGs even with VNet peering |
| 241 | +- Maintain separate Key Vault instances per region |
| 242 | +- Ensure data replication compliance with residency requirements |
| 243 | + |
| 244 | + |
| 245 | +## Disaster recovery strategy |
| 246 | + |
| 247 | +### DR configuration options |
| 248 | + |
| 249 | +| Configuration | Description | Recovery Speed | |
| 250 | +|---------------|-------------|----------------| |
| 251 | +| **Active/Active** | Continuous availability, duplicate all systems | Immediate | |
| 252 | +| **Active/Passive (Warm Standby)** | Small head node running, compute spun up on failover | Minutes to hours | |
| 253 | +| **Passive/Cold** | Manual deployment from backups | Hours to days | |
| 254 | + |
| 255 | +### Data protection requirements |
| 256 | + |
| 257 | +- Replicate critical data (input datasets, home directories, checkpoints, results) |
| 258 | +- Use geo-redundant storage or custom replication with versioning |
| 259 | +- Match replication frequency to recovery point objective (RPO) (for example, 4-hour RPO = replicate at least every 4 hours) |
| 260 | + |
| 261 | + |
| 262 | +## Validation and testing |
| 263 | + |
| 264 | +### Test categories |
| 265 | + |
| 266 | +#### Job submission tests |
| 267 | + |
| 268 | +- Submit test jobs targeting each region's resources |
| 269 | +- Measure job start time and network throughput |
| 270 | +- Run small MPI jobs across nodes in different regions |
| 271 | +- Note performance impact from cross-region latency |
| 272 | + |
| 273 | +#### Data consistency checks |
| 274 | + |
| 275 | +- Test that replicated data in Region B is usable |
| 276 | +- Disconnect primary storage and attempt reads from secondary |
| 277 | +- Verify all data and metadata (permissions) are intact |
| 278 | + |
| 279 | +#### End-to-end DR test |
| 280 | + |
| 281 | +- Assume Region A is unavailable |
| 282 | +- Bring up HPC environment in Region B using DR procedures |
| 283 | +- Measure time to restore critical functionality |
| 284 | +- Verify RPO and RTO compliance |
| 285 | +- Fail back to Region A and synchronize changes |
| 286 | + |
| 287 | + |
| 288 | +## Caveats and considerations |
| 289 | + |
| 290 | +### Performance caveats |
| 291 | + |
| 292 | +| Caveat | Impact | Mitigation | |
| 293 | +|--------|--------|------------| |
| 294 | +| **Network bandwidth limits** | Large data transfers might bottleneck | Pre-stage data, and use compression | |
| 295 | +| **Working directory location** | Jobs in region B might have slow access to region A home dirs | Use a local working directory and mirror user home directories when required | |
| 296 | + |
| 297 | +### Operational caveats |
| 298 | + |
| 299 | +| Caveat | Impact | Mitigation | |
| 300 | +|--------|--------|------------| |
| 301 | +| **No automatic job completion SLA** | Jobs might fail without automatic recovery | Implement checkpointing and retry logic | |
| 302 | +| **Double management overhead** | Active/active requires managing two clusters | Use automation and infrastructure-as-code | |
| 303 | +| **CycleCloud UI limitations** | UI restricts single-region configuration | Use CLI with custom templates and parameters files | |
| 304 | +| **Name resolution complexity** | Nodes must resolve across regions | Configure private DNS zones linked to both VNets | |
| 305 | + |
| 306 | +### Cost caveats |
| 307 | + |
| 308 | +| Caveat | Impact | Mitigation | |
| 309 | +|--------|--------|------------| |
| 310 | +| **Egress charges** | Cross-region traffic incur additional costs | Process data in-region, use local storage | |
| 311 | +| **Idle secondary resources** | DR region incurs costs even when idle | Rely on autoscaling to deallocate idle compute nodes | |
| 312 | +| **Quota management** | Both regions need sufficient quota | Request increases early, use capacity reservations | |
| 313 | + |
| 314 | +### Security caveats |
| 315 | + |
| 316 | +| Caveat | Impact | Mitigation | |
| 317 | +|--------|--------|------------| |
| 318 | +| **Credential scope** | Cross-region credentials increase blast radius | Use region-specific managed identities | |
| 319 | +| **Data residency** | Replicating data might violate compliance | Verify regulatory requirements before replication | |
| 320 | +| **Network exposure** | VNet peering opens cross-region paths | Apply strict NSG rules | |
| 321 | + |
| 322 | + |
| 323 | +## Frequently asked questions (FAQs) |
| 324 | + |
| 325 | +* Do I need a separate CycleCloud server in each region? |
| 326 | + |
| 327 | + Not necessarily. CycleCloud can manage multiple regions from one instance using multiple credentials and lockers. However, for higher availability, some organizations run distinct CycleCloud installations in each region with identical configurations. |
| 328 | + |
| 329 | +* How can I minimize data e-gress charges? |
| 330 | + - Keep shared storage synchronized using Azure backbone replication (GRS, ZRS) |
| 331 | + - Use region-specific CycleCloud lockers instead of a global store |
| 332 | + - Compress data before transfer |
| 333 | + - Transmit only incremental changes |
| 334 | + - Run jobs in the same region as their data |
| 335 | + |
| 336 | +* What RPO/RTO should HPC workloads have? |
| 337 | + Typical targets: |
| 338 | + - **RPO**: Few hours (or checkpoint interval length) |
| 339 | + - **RTO**: 4-24 hours |
| 340 | + |
| 341 | + For time-sensitive workloads (for example, weather forecasting with strict deadlines), near-zero RTO might require active/active setup. |
| 342 | + |
| 343 | +* Are there SLA guarantees for multi-region HPC job completion? |
| 344 | + |
| 345 | + No. There is no Microsoft SLA guaranteeing individual HPC job completion—single or multi-region. Azure infrastructure services (VMs, Virtual Machine Scale Set, Storage) have availability SLAs, but job-level recovery is your responsibility. |
| 346 | + |
| 347 | +* How do I check if a region supports my HPC needs? |
| 348 | + - Consult [Azure regional services documentation](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/) |
| 349 | + - Check Azure portal for VM sizes and quotas per region |
| 350 | + - Engage Azure support for capacity planning on large deployments |
0 commit comments