Skip to content

Commit 5bcc95b

Browse files
Merge pull request #308651 from Padmalathas/CycleCloud-Doc-Cleanup
CC Doc Cleanup
2 parents 833a3b9 + 7a8661e commit 5bcc95b

4 files changed

Lines changed: 362 additions & 10 deletions

File tree

articles/cyclecloud/download-cluster-templates.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Download Cluster Projects and Templates
33
description: Azure CycleCloud has built-in templates you can configure and edit to make your own custom templates.
44
author: adriankjohnson
5-
ms.date: 06/10/2025
5+
ms.date: 11/21/2025
66
ms.author: adjohnso
77
---
88

@@ -20,7 +20,6 @@ cyclecloud import_template -f templates/template-name.template.txt
2020

2121
| Project/template type | CycleCloud repo | Description |
2222
| --------------------- | ---------------- | ------------ | --- |
23-
| [![BeeGFS Logo](~/articles/cyclecloud/media/index/beegfs.png)](https://www.beegfs.io/content/) | [BeeGFS](https://github.com/Azure/cyclecloud-beegfs) | CycleCloud project to enable configuration, orchestration, and management of BeeGFS file systems in Azure CycleCloud HPC clusters. |
2423
| [![Grid Engine Logo](~/articles/cyclecloud/media/index/grid-engine.png)](http://gridscheduler.sourceforge.net/) | [Grid Engine](https://github.com/Azure/cyclecloud-gridengine) | Azure CycleCloud GridEngine cluster template. |
2524
| [![HPCPack logo](~/articles/cyclecloud/media/index/hpcpack.png)](/powershell/high-performance-computing/overview?view=hpc16-ps&preserve-view=true) | [HPC Pack](https://github.com/Azure/cyclecloud-hpcpack) | CycleCloud project that enables use of Microsoft HPC Pack job scheduler. |
2625
| [![HTCondor Logo](~/articles/cyclecloud/media/index/htcondor.png)](https://research.cs.wisc.edu/htcondor/) | [HTCondor](https://github.com/Azure/cyclecloud-htcondor) | Azure CycleCloud HTCondor cluster template. |
Lines changed: 352 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,352 @@
1+
---
2+
title: Submit job on CycleCloud with Slurm
3+
description: How to submit your first job on CycleCloud with Slurm
4+
author: xpillons
5+
ms.date: 12/01/2025
6+
ms.author: padmalathas
7+
---
8+
9+
# Submit your first job on CycleCloud with Slurm
10+
11+
This guide walks you through submitting and managing jobs on an Azure CycleCloud cluster running Slurm. Whether you're new to HPC or just new to Azure, you learn how to connect to your cluster, submit jobs, and monitor their progress.
12+
13+
## Prerequisites
14+
15+
Before starting, you need:
16+
17+
- An Azure CycleCloud Workspace for Slurm environment already deployed
18+
- SSH key pair configured during deployment
19+
- Access to the CycleCloud VM or Azure Bastion
20+
21+
For instructions on how to deploy, see [deployment quickstart](../../qs-deploy-ccws.md) to set up your environment.
22+
23+
## Understanding the cluster components
24+
25+
Your CycleCloud Slurm cluster has three main node types you interact with:
26+
27+
| Node Type | What It Does | When You Use It |
28+
|-----------|--------------|-----------------|
29+
| **Authentication Node** | Entry point for users | Connect here to submit and manage jobs |
30+
| **Scheduler Node** | Manages job queue and resources | Runs automatically in the background |
31+
| **Compute Nodes** | Execute your workloads | Created on-demand when jobs need resources |
32+
33+
34+
## Understanding Slurm partitions
35+
36+
Your cluster comes with preconfigured partitions (resource pools) designed for different workload types:
37+
38+
| Partition | Best For | Typical VM Types |
39+
|-----------|----------|------------------|
40+
| **HTC** | Independent tasks that don't need to communicate (data processing, rendering) | General-purpose VMs (D-series) |
41+
| **HPC** | Tightly coupled parallel jobs using MPI (simulations, modeling) | HPC-optimized VMs with InfiniBand (HBv3, HBv4, HBv5) |
42+
| **GPU** | Machine learning, deep learning, GPU-accelerated computing | GPU VMs (NC-series, ND-series) |
43+
44+
For partition configuration options, see [Slurm Scheduler Integration](../../slurm.md) for partition configuration options.
45+
46+
---
47+
48+
## Step 1: Connect to the authentication node
49+
50+
You have two options for connecting to your cluster.
51+
52+
### Option A: Connect via CycleCloud VM
53+
54+
If you have access to the CycleCloud VM, use the CycleCloud CLI:
55+
56+
```bash
57+
# First, SSH to the CycleCloud VM
58+
ssh -i ~/.ssh/your_private_key username@cyclecloud-vm-ip
59+
60+
# Initialize CycleCloud (first time only)
61+
cyclecloud initialize
62+
63+
# Connect to the login node
64+
cyclecloud connect login-1 -c your-cluster-name
65+
```
66+
67+
### Option B: Direct SSH Connection
68+
69+
If you have the authentication node's IP address:
70+
71+
```bash
72+
ssh -i ~/.ssh/your_private_key username@login-node-ip
73+
```
74+
75+
> [!NOTE]
76+
> If your cluster doesn't allow public IPs, connect through Azure Bastion. See [Azure Bastion documentation](/azure/bastion/bastion-overview) for setup instructions.
77+
78+
---
79+
80+
## Step 2: Check Cluster Status
81+
82+
When you connect to the authentication node, check that the cluster is ready:
83+
84+
```bash
85+
# View available partitions and their status
86+
sinfo
87+
```
88+
89+
Example output:
90+
```
91+
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
92+
htc* up infinite 10 idle~ htc-[1-10]
93+
hpc up infinite 10 idle~ hpc-[1-10]
94+
gpu up infinite 4 idle~ gpu-[1-4]
95+
```
96+
97+
**Reading the output:**
98+
- `idle~` means nodes are available but not yet running (CycleCloud starts them when needed)
99+
- `idle` means nodes are running and ready
100+
- `alloc` means nodes are currently running jobs
101+
102+
For more detail:
103+
104+
```bash
105+
# Detailed partition and node information
106+
sinfo -l
107+
```
108+
109+
---
110+
111+
## Step 3: Create a job script
112+
113+
Job scripts tell Slurm what resources you need and what commands to run. Create a simple test script:
114+
115+
```bash
116+
# Create the script file
117+
nano hello-world.sh
118+
```
119+
120+
Paste this content:
121+
122+
```bash
123+
#!/bin/bash
124+
#SBATCH --job-name=hello_world # Name to identify your job
125+
#SBATCH --output=hello_world.out # File for standard output
126+
#SBATCH --partition=hpc # Which partition to use
127+
#SBATCH --nodes=1 # Number of nodes needed
128+
#SBATCH --ntasks=1 # Number of tasks to run
129+
#SBATCH --time=00:05:00 # Maximum run time (HH:MM:SS)
130+
131+
echo "Hello from $(hostname)"
132+
echo "Job started at $(date)"
133+
echo "Running on partition: $SLURM_JOB_PARTITION"
134+
sleep 10
135+
echo "Job completed at $(date)"
136+
```
137+
138+
Save and exit (Ctrl+X, then Y, then Enter).
139+
140+
### Common SBATCH options
141+
142+
| Option | Description | Example |
143+
|--------|-------------|---------|
144+
| `--job-name` | Identifier for your job | `--job-name=my_analysis` |
145+
| `--output` | Where to write output | `--output=results_%j.out` (%j = job ID) |
146+
| `--partition` | Resource pool to use | `--partition=gpu` |
147+
| `--nodes` | Number of compute nodes | `--nodes=4` |
148+
| `--ntasks` | Total tasks across all nodes | `--ntasks=16` |
149+
| `--cpus-per-task` | CPUs for each task | `--cpus-per-task=4` |
150+
| `--mem` | Memory per node | `--mem=32G` |
151+
| `--time` | Maximum runtime | `--time=02:00:00` |
152+
| `--gres` | Generic resources (like GPUs) | `--gres=gpu:1` |
153+
154+
---
155+
156+
## Step 4: Submit your job
157+
158+
Submit the job to the queue:
159+
160+
```bash
161+
sbatch hello-world.sh
162+
```
163+
164+
You see a confirmation with your job ID:
165+
```
166+
Submitted batch job 1
167+
```
168+
169+
### What Happens Behind the Scenes
170+
171+
1. **Job queued** — Slurm adds your job to the queue.
172+
1. **Resources requested** — Slurm tells CycleCloud it needs compute nodes.
173+
1. **Nodes provisioned** — CycleCloud creates VMs in Azure (takes 2-5 minutes for new nodes).
174+
1. **Job runs** — Your script executes on the allocated nodes.
175+
1. **Nodes released** — After the job completes, idle nodes are eventually terminated to save costs.
176+
177+
---
178+
179+
## Step 5: Monitor your job
180+
181+
### Check Job Status
182+
183+
```bash
184+
# View all jobs in the queue
185+
squeue
186+
```
187+
188+
Example output:
189+
```
190+
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
191+
1 hpc hello_world hpcuser R 0:05 1 hpc-1
192+
```
193+
194+
**Job states:**
195+
- `PD` (Pending) — Waiting for resources
196+
- `CF` (Configuring) — Nodes being set up
197+
- `R` (Running) — Job is executing
198+
- `CG` (Completing) — Job finishing up
199+
- `CD` (Completed) — Job finished successfully
200+
- `F` (Failed) — Job encountered an error
201+
202+
### Check Node Allocation
203+
204+
```bash
205+
# See which nodes are allocated
206+
sinfo
207+
208+
# View detailed job information
209+
scontrol show job 1
210+
```
211+
212+
### Monitor from CycleCloud Web UI
213+
214+
You can also monitor your cluster visually:
215+
216+
1. Open your browser and go to `https://your-cyclecloud-vm-ip`
217+
1. Sign in with your CycleCloud credentials
218+
1. Select your cluster to see node status and provisioning progress
219+
220+
---
221+
222+
## Step 6: View job results
223+
224+
When your job finishes, check the output file:
225+
226+
```bash
227+
# View the output
228+
cat hello_world.out
229+
```
230+
231+
Example output:
232+
```
233+
Hello from hpc-1
234+
Job started at Mon Dec 1 14:23:45 UTC 2025
235+
Running on partition: hpc
236+
Job completed at Mon Dec 1 14:23:55 UTC 2025
237+
```
238+
239+
### Output file location
240+
241+
By default, the system creates output files in the directory where you submit the job. This directory is usually on a shared filesystem, such as Azure NetApp Files, so any node can access the outputs.
242+
243+
---
244+
245+
## Practical examples
246+
247+
### Example 1: Simple Python script
248+
249+
```bash
250+
#!/bin/bash
251+
#SBATCH --job-name=python_analysis
252+
#SBATCH --output=analysis_%j.out
253+
#SBATCH --partition=htc
254+
#SBATCH --nodes=1
255+
#SBATCH --ntasks=1
256+
#SBATCH --time=01:00:00
257+
258+
module load python/3.9 # Load Python if using modules
259+
python my_analysis.py
260+
```
261+
262+
### Example 2: Multinode MPI job
263+
264+
```bash
265+
#!/bin/bash
266+
#SBATCH --job-name=mpi_simulation
267+
#SBATCH --output=sim_%j.out
268+
#SBATCH --partition=hpc
269+
#SBATCH --nodes=4
270+
#SBATCH --ntasks-per-node=120
271+
#SBATCH --time=04:00:00
272+
273+
module load mpi/openmpi
274+
srun ./my_mpi_program
275+
```
276+
277+
### Example 3: GPU job
278+
279+
```bash
280+
#!/bin/bash
281+
#SBATCH --job-name=ml_training
282+
#SBATCH --output=training_%j.out
283+
#SBATCH --partition=gpu
284+
#SBATCH --nodes=1
285+
#SBATCH --ntasks=1
286+
#SBATCH --gres=gpu:1
287+
#SBATCH --time=08:00:00
288+
289+
module load cuda/11.8
290+
python train_model.py
291+
```
292+
293+
---
294+
295+
## Common Slurm Commands Reference
296+
297+
| Command | Description |
298+
|---------|-------------|
299+
| `sbatch script.sh` | Submit a batch job |
300+
| `squeue` | View job queue |
301+
| `squeue -u $USER` | View only your jobs |
302+
| `sinfo` | View partition and node status |
303+
| `scancel JOB_ID` | Cancel a job |
304+
| `scancel -u $USER` | Cancel all your jobs |
305+
| `scontrol show job JOB_ID` | Detailed job information |
306+
| `sacct -j JOB_ID` | Job accounting information (after completion) |
307+
| `srun command` | Run interactive command on allocated resources |
308+
| `salloc` | Request interactive resource allocation |
309+
310+
---
311+
312+
## Troubleshooting
313+
314+
### Job Stuck in Pending (PD) State
315+
316+
Check the reason:
317+
```bash
318+
squeue -j JOB_ID -o "%j %T %r"
319+
```
320+
321+
Common reasons:
322+
- `(Resources)` — Waiting for nodes to be provisioned (normal, wait 2-5 minutes)
323+
- `(Priority)` — Other jobs have higher priority
324+
- `(QOSMaxJobsPerUserLimit)` — You reached your job limit
325+
326+
### Nodes not starting
327+
328+
Check CycleCloud for provisioning errors:
329+
```bash
330+
# From CycleCloud VM
331+
cyclecloud show_cluster your-cluster-name
332+
```
333+
334+
### Job failed immediately
335+
336+
Check the output file and Slurm logs:
337+
```bash
338+
# View your output file
339+
cat your_output_file.out
340+
341+
# Check Slurm's record
342+
sacct -j JOB_ID --format=JobID,State,ExitCode,Reason
343+
```
344+
345+
---
346+
347+
## Next steps
348+
349+
- **Scale up:** Submitting jobs that use multiple nodes
350+
- **Use containers:** CycleCloud Workspace for Slurm includes Pyxis and Enroot for containerized workloads
351+
- **Set up job accounting:** Enable MySQL integration to track resource usage over time
352+
- **Explore Open OnDemand:** Access your cluster through a web interface

0 commit comments

Comments
 (0)