|
| 1 | +--- |
| 2 | +title: Submit job on CycleCloud with Slurm |
| 3 | +description: How to submit your first job on CycleCloud with Slurm |
| 4 | +author: xpillons |
| 5 | +ms.date: 12/01/2025 |
| 6 | +ms.author: padmalathas |
| 7 | +--- |
| 8 | + |
| 9 | +# Submit your first job on CycleCloud with Slurm |
| 10 | + |
| 11 | +This guide walks you through submitting and managing jobs on an Azure CycleCloud cluster running Slurm. Whether you're new to HPC or just new to Azure, you learn how to connect to your cluster, submit jobs, and monitor their progress. |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +Before starting, you need: |
| 16 | + |
| 17 | +- An Azure CycleCloud Workspace for Slurm environment already deployed |
| 18 | +- SSH key pair configured during deployment |
| 19 | +- Access to the CycleCloud VM or Azure Bastion |
| 20 | + |
| 21 | +For instructions on how to deploy, see [deployment quickstart](../../qs-deploy-ccws.md) to set up your environment. |
| 22 | + |
| 23 | +## Understanding the cluster components |
| 24 | + |
| 25 | +Your CycleCloud Slurm cluster has three main node types you interact with: |
| 26 | + |
| 27 | +| Node Type | What It Does | When You Use It | |
| 28 | +|-----------|--------------|-----------------| |
| 29 | +| **Authentication Node** | Entry point for users | Connect here to submit and manage jobs | |
| 30 | +| **Scheduler Node** | Manages job queue and resources | Runs automatically in the background | |
| 31 | +| **Compute Nodes** | Execute your workloads | Created on-demand when jobs need resources | |
| 32 | + |
| 33 | + |
| 34 | +## Understanding Slurm partitions |
| 35 | + |
| 36 | +Your cluster comes with preconfigured partitions (resource pools) designed for different workload types: |
| 37 | + |
| 38 | +| Partition | Best For | Typical VM Types | |
| 39 | +|-----------|----------|------------------| |
| 40 | +| **HTC** | Independent tasks that don't need to communicate (data processing, rendering) | General-purpose VMs (D-series) | |
| 41 | +| **HPC** | Tightly coupled parallel jobs using MPI (simulations, modeling) | HPC-optimized VMs with InfiniBand (HBv3, HBv4, HBv5) | |
| 42 | +| **GPU** | Machine learning, deep learning, GPU-accelerated computing | GPU VMs (NC-series, ND-series) | |
| 43 | + |
| 44 | +For partition configuration options, see [Slurm Scheduler Integration](../../slurm.md) for partition configuration options. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## Step 1: Connect to the authentication node |
| 49 | + |
| 50 | +You have two options for connecting to your cluster. |
| 51 | + |
| 52 | +### Option A: Connect via CycleCloud VM |
| 53 | + |
| 54 | +If you have access to the CycleCloud VM, use the CycleCloud CLI: |
| 55 | + |
| 56 | +```bash |
| 57 | +# First, SSH to the CycleCloud VM |
| 58 | +ssh -i ~/.ssh/your_private_key username@cyclecloud-vm-ip |
| 59 | + |
| 60 | +# Initialize CycleCloud (first time only) |
| 61 | +cyclecloud initialize |
| 62 | + |
| 63 | +# Connect to the login node |
| 64 | +cyclecloud connect login-1 -c your-cluster-name |
| 65 | +``` |
| 66 | + |
| 67 | +### Option B: Direct SSH Connection |
| 68 | + |
| 69 | +If you have the authentication node's IP address: |
| 70 | + |
| 71 | +```bash |
| 72 | +ssh -i ~/.ssh/your_private_key username@login-node-ip |
| 73 | +``` |
| 74 | + |
| 75 | +> [!NOTE] |
| 76 | +> If your cluster doesn't allow public IPs, connect through Azure Bastion. See [Azure Bastion documentation](/azure/bastion/bastion-overview) for setup instructions. |
| 77 | +
|
| 78 | +--- |
| 79 | + |
| 80 | +## Step 2: Check Cluster Status |
| 81 | + |
| 82 | +When you connect to the authentication node, check that the cluster is ready: |
| 83 | + |
| 84 | +```bash |
| 85 | +# View available partitions and their status |
| 86 | +sinfo |
| 87 | +``` |
| 88 | + |
| 89 | +Example output: |
| 90 | +``` |
| 91 | +PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| 92 | +htc* up infinite 10 idle~ htc-[1-10] |
| 93 | +hpc up infinite 10 idle~ hpc-[1-10] |
| 94 | +gpu up infinite 4 idle~ gpu-[1-4] |
| 95 | +``` |
| 96 | + |
| 97 | +**Reading the output:** |
| 98 | +- `idle~` means nodes are available but not yet running (CycleCloud starts them when needed) |
| 99 | +- `idle` means nodes are running and ready |
| 100 | +- `alloc` means nodes are currently running jobs |
| 101 | + |
| 102 | +For more detail: |
| 103 | + |
| 104 | +```bash |
| 105 | +# Detailed partition and node information |
| 106 | +sinfo -l |
| 107 | +``` |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +## Step 3: Create a job script |
| 112 | + |
| 113 | +Job scripts tell Slurm what resources you need and what commands to run. Create a simple test script: |
| 114 | + |
| 115 | +```bash |
| 116 | +# Create the script file |
| 117 | +nano hello-world.sh |
| 118 | +``` |
| 119 | + |
| 120 | +Paste this content: |
| 121 | + |
| 122 | +```bash |
| 123 | +#!/bin/bash |
| 124 | +#SBATCH --job-name=hello_world # Name to identify your job |
| 125 | +#SBATCH --output=hello_world.out # File for standard output |
| 126 | +#SBATCH --partition=hpc # Which partition to use |
| 127 | +#SBATCH --nodes=1 # Number of nodes needed |
| 128 | +#SBATCH --ntasks=1 # Number of tasks to run |
| 129 | +#SBATCH --time=00:05:00 # Maximum run time (HH:MM:SS) |
| 130 | + |
| 131 | +echo "Hello from $(hostname)" |
| 132 | +echo "Job started at $(date)" |
| 133 | +echo "Running on partition: $SLURM_JOB_PARTITION" |
| 134 | +sleep 10 |
| 135 | +echo "Job completed at $(date)" |
| 136 | +``` |
| 137 | + |
| 138 | +Save and exit (Ctrl+X, then Y, then Enter). |
| 139 | + |
| 140 | +### Common SBATCH options |
| 141 | + |
| 142 | +| Option | Description | Example | |
| 143 | +|--------|-------------|---------| |
| 144 | +| `--job-name` | Identifier for your job | `--job-name=my_analysis` | |
| 145 | +| `--output` | Where to write output | `--output=results_%j.out` (%j = job ID) | |
| 146 | +| `--partition` | Resource pool to use | `--partition=gpu` | |
| 147 | +| `--nodes` | Number of compute nodes | `--nodes=4` | |
| 148 | +| `--ntasks` | Total tasks across all nodes | `--ntasks=16` | |
| 149 | +| `--cpus-per-task` | CPUs for each task | `--cpus-per-task=4` | |
| 150 | +| `--mem` | Memory per node | `--mem=32G` | |
| 151 | +| `--time` | Maximum runtime | `--time=02:00:00` | |
| 152 | +| `--gres` | Generic resources (like GPUs) | `--gres=gpu:1` | |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## Step 4: Submit your job |
| 157 | + |
| 158 | +Submit the job to the queue: |
| 159 | + |
| 160 | +```bash |
| 161 | +sbatch hello-world.sh |
| 162 | +``` |
| 163 | + |
| 164 | +You see a confirmation with your job ID: |
| 165 | +``` |
| 166 | +Submitted batch job 1 |
| 167 | +``` |
| 168 | + |
| 169 | +### What Happens Behind the Scenes |
| 170 | + |
| 171 | +1. **Job queued** — Slurm adds your job to the queue. |
| 172 | +1. **Resources requested** — Slurm tells CycleCloud it needs compute nodes. |
| 173 | +1. **Nodes provisioned** — CycleCloud creates VMs in Azure (takes 2-5 minutes for new nodes). |
| 174 | +1. **Job runs** — Your script executes on the allocated nodes. |
| 175 | +1. **Nodes released** — After the job completes, idle nodes are eventually terminated to save costs. |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +## Step 5: Monitor your job |
| 180 | + |
| 181 | +### Check Job Status |
| 182 | + |
| 183 | +```bash |
| 184 | +# View all jobs in the queue |
| 185 | +squeue |
| 186 | +``` |
| 187 | + |
| 188 | +Example output: |
| 189 | +``` |
| 190 | +JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 191 | +1 hpc hello_world hpcuser R 0:05 1 hpc-1 |
| 192 | +``` |
| 193 | + |
| 194 | +**Job states:** |
| 195 | +- `PD` (Pending) — Waiting for resources |
| 196 | +- `CF` (Configuring) — Nodes being set up |
| 197 | +- `R` (Running) — Job is executing |
| 198 | +- `CG` (Completing) — Job finishing up |
| 199 | +- `CD` (Completed) — Job finished successfully |
| 200 | +- `F` (Failed) — Job encountered an error |
| 201 | + |
| 202 | +### Check Node Allocation |
| 203 | + |
| 204 | +```bash |
| 205 | +# See which nodes are allocated |
| 206 | +sinfo |
| 207 | + |
| 208 | +# View detailed job information |
| 209 | +scontrol show job 1 |
| 210 | +``` |
| 211 | + |
| 212 | +### Monitor from CycleCloud Web UI |
| 213 | + |
| 214 | +You can also monitor your cluster visually: |
| 215 | + |
| 216 | +1. Open your browser and go to `https://your-cyclecloud-vm-ip` |
| 217 | +1. Sign in with your CycleCloud credentials |
| 218 | +1. Select your cluster to see node status and provisioning progress |
| 219 | + |
| 220 | +--- |
| 221 | + |
| 222 | +## Step 6: View job results |
| 223 | + |
| 224 | +When your job finishes, check the output file: |
| 225 | + |
| 226 | +```bash |
| 227 | +# View the output |
| 228 | +cat hello_world.out |
| 229 | +``` |
| 230 | + |
| 231 | +Example output: |
| 232 | +``` |
| 233 | +Hello from hpc-1 |
| 234 | +Job started at Mon Dec 1 14:23:45 UTC 2025 |
| 235 | +Running on partition: hpc |
| 236 | +Job completed at Mon Dec 1 14:23:55 UTC 2025 |
| 237 | +``` |
| 238 | + |
| 239 | +### Output file location |
| 240 | + |
| 241 | +By default, the system creates output files in the directory where you submit the job. This directory is usually on a shared filesystem, such as Azure NetApp Files, so any node can access the outputs. |
| 242 | + |
| 243 | +--- |
| 244 | + |
| 245 | +## Practical examples |
| 246 | + |
| 247 | +### Example 1: Simple Python script |
| 248 | + |
| 249 | +```bash |
| 250 | +#!/bin/bash |
| 251 | +#SBATCH --job-name=python_analysis |
| 252 | +#SBATCH --output=analysis_%j.out |
| 253 | +#SBATCH --partition=htc |
| 254 | +#SBATCH --nodes=1 |
| 255 | +#SBATCH --ntasks=1 |
| 256 | +#SBATCH --time=01:00:00 |
| 257 | + |
| 258 | +module load python/3.9 # Load Python if using modules |
| 259 | +python my_analysis.py |
| 260 | +``` |
| 261 | + |
| 262 | +### Example 2: Multinode MPI job |
| 263 | + |
| 264 | +```bash |
| 265 | +#!/bin/bash |
| 266 | +#SBATCH --job-name=mpi_simulation |
| 267 | +#SBATCH --output=sim_%j.out |
| 268 | +#SBATCH --partition=hpc |
| 269 | +#SBATCH --nodes=4 |
| 270 | +#SBATCH --ntasks-per-node=120 |
| 271 | +#SBATCH --time=04:00:00 |
| 272 | + |
| 273 | +module load mpi/openmpi |
| 274 | +srun ./my_mpi_program |
| 275 | +``` |
| 276 | + |
| 277 | +### Example 3: GPU job |
| 278 | + |
| 279 | +```bash |
| 280 | +#!/bin/bash |
| 281 | +#SBATCH --job-name=ml_training |
| 282 | +#SBATCH --output=training_%j.out |
| 283 | +#SBATCH --partition=gpu |
| 284 | +#SBATCH --nodes=1 |
| 285 | +#SBATCH --ntasks=1 |
| 286 | +#SBATCH --gres=gpu:1 |
| 287 | +#SBATCH --time=08:00:00 |
| 288 | + |
| 289 | +module load cuda/11.8 |
| 290 | +python train_model.py |
| 291 | +``` |
| 292 | + |
| 293 | +--- |
| 294 | + |
| 295 | +## Common Slurm Commands Reference |
| 296 | + |
| 297 | +| Command | Description | |
| 298 | +|---------|-------------| |
| 299 | +| `sbatch script.sh` | Submit a batch job | |
| 300 | +| `squeue` | View job queue | |
| 301 | +| `squeue -u $USER` | View only your jobs | |
| 302 | +| `sinfo` | View partition and node status | |
| 303 | +| `scancel JOB_ID` | Cancel a job | |
| 304 | +| `scancel -u $USER` | Cancel all your jobs | |
| 305 | +| `scontrol show job JOB_ID` | Detailed job information | |
| 306 | +| `sacct -j JOB_ID` | Job accounting information (after completion) | |
| 307 | +| `srun command` | Run interactive command on allocated resources | |
| 308 | +| `salloc` | Request interactive resource allocation | |
| 309 | + |
| 310 | +--- |
| 311 | + |
| 312 | +## Troubleshooting |
| 313 | + |
| 314 | +### Job Stuck in Pending (PD) State |
| 315 | + |
| 316 | +Check the reason: |
| 317 | +```bash |
| 318 | +squeue -j JOB_ID -o "%j %T %r" |
| 319 | +``` |
| 320 | + |
| 321 | +Common reasons: |
| 322 | +- `(Resources)` — Waiting for nodes to be provisioned (normal, wait 2-5 minutes) |
| 323 | +- `(Priority)` — Other jobs have higher priority |
| 324 | +- `(QOSMaxJobsPerUserLimit)` — You reached your job limit |
| 325 | + |
| 326 | +### Nodes not starting |
| 327 | + |
| 328 | +Check CycleCloud for provisioning errors: |
| 329 | +```bash |
| 330 | +# From CycleCloud VM |
| 331 | +cyclecloud show_cluster your-cluster-name |
| 332 | +``` |
| 333 | + |
| 334 | +### Job failed immediately |
| 335 | + |
| 336 | +Check the output file and Slurm logs: |
| 337 | +```bash |
| 338 | +# View your output file |
| 339 | +cat your_output_file.out |
| 340 | + |
| 341 | +# Check Slurm's record |
| 342 | +sacct -j JOB_ID --format=JobID,State,ExitCode,Reason |
| 343 | +``` |
| 344 | + |
| 345 | +--- |
| 346 | + |
| 347 | +## Next steps |
| 348 | + |
| 349 | +- **Scale up:** Submitting jobs that use multiple nodes |
| 350 | +- **Use containers:** CycleCloud Workspace for Slurm includes Pyxis and Enroot for containerized workloads |
| 351 | +- **Set up job accounting:** Enable MySQL integration to track resource usage over time |
| 352 | +- **Explore Open OnDemand:** Access your cluster through a web interface |
0 commit comments