|
| 1 | +--- |
| 2 | +title: Running your first benchmark using STREAM |
| 3 | +descrition: Learn how to run the STREAM benchmark on Azure HPC virtual machines. |
| 4 | +author: padmalathas |
| 5 | +ms.author: padmalathas |
| 6 | +ms.topic: how-to |
| 7 | +ms.date: 02/24/2026 |
| 8 | +--- |
| 9 | + |
| 10 | +# Running your first benchmark: STREAM |
| 11 | + |
| 12 | +STREAM measures sustainable memory bandwidth, which is critical for memory-bound workloads like computational fluid dynamics (CFD), finite element analysis, and data analytics. STREAM is a simple, synthetic benchmark that measures memory bandwidth for four vector operations: |
| 13 | + |
| 14 | + |
| 15 | +| Operation | Description | Formula | |
| 16 | +|-----------|-------------|---------| |
| 17 | +| Copy | Measures transfer rates | a(i) = b(i) | |
| 18 | +| Scale | Adds simple arithmetic | a(i) = q × b(i) | |
| 19 | +| Add | Multiple load/store operations | a(i) = b(i) + c(i) | |
| 20 | +| Triad | Most representative | a(i) = b(i) + q × c(i) | |
| 21 | + |
| 22 | +The **Triad** result is the standard metric for comparing memory bandwidth across systems. |
| 23 | + |
| 24 | +**Time to complete**: 15-20 minutes |
| 25 | + |
| 26 | +## Prerequisites |
| 27 | + |
| 28 | +- An Azure HPC VM (HBv3, HBv4, HBv5, or HX-series recommended) |
| 29 | +- SSH access to the VM |
| 30 | +- Root or sudo privileges |
| 31 | + |
| 32 | +> [!TIP] |
| 33 | +> For best results, use Azure HPC marketplace images (AlmaLinux-HPC or Ubuntu-HPC) which include optimized compilers and libraries. |
| 34 | +
|
| 35 | +## Expected results by VM family |
| 36 | + |
| 37 | +Use these values to validate your results: |
| 38 | + |
| 39 | +| VM Series | STREAM Triad (GB/s) | Notes | |
| 40 | +|-----------|---------------------|-------| |
| 41 | +| HBv5 (with HBM) | ~7,000 | Uses HBM memory | |
| 42 | +| HBv4 | ~650-780 | DDR5 memory | |
| 43 | +| HBv3 | ~330-350 | DDR4 memory | |
| 44 | +| HBv2 | ~260 | DDR4 memory | |
| 45 | + |
| 46 | +If your results are significantly lower (more than 10% below), check your configuration. |
| 47 | + |
| 48 | +## Step 1: Connect to your VM |
| 49 | + |
| 50 | +Connect via SSH to your HPC VM: |
| 51 | + |
| 52 | +```bash |
| 53 | +ssh azureuser@<vm-public-ip> |
| 54 | +``` |
| 55 | + |
| 56 | +Or connect through your Slurm login node if using a cluster. |
| 57 | + |
| 58 | +## Step 2: Install dependencies |
| 59 | + |
| 60 | +### Option A: Using Azure HPC images (recommended) |
| 61 | + |
| 62 | +Azure HPC images include the necessary compilers. Verify GCC is available: |
| 63 | + |
| 64 | +```bash |
| 65 | +gcc --version |
| 66 | +``` |
| 67 | + |
| 68 | +### Option B: Manual installation |
| 69 | + |
| 70 | +If using a standard image, install build tools: |
| 71 | + |
| 72 | +```bash |
| 73 | +# AlmaLinux/RHEL |
| 74 | +sudo dnf groupinstall "Development Tools" -y |
| 75 | + |
| 76 | +# Ubuntu |
| 77 | +sudo apt update && sudo apt install build-essential -y |
| 78 | +``` |
| 79 | + |
| 80 | +## Step 3: Download and compile STREAM |
| 81 | + |
| 82 | +Clone the Azure benchmarking repository which includes optimized STREAM configurations: |
| 83 | + |
| 84 | +```bash |
| 85 | +# Create working directory |
| 86 | +mkdir -p ~/benchmarks && cd ~/benchmarks |
| 87 | + |
| 88 | +# Clone Azure benchmarking repository |
| 89 | +git clone https://github.com/Azure/woc-benchmarking.git |
| 90 | +cd woc-benchmarking/apps/hpc/stream |
| 91 | +``` |
| 92 | + |
| 93 | +Alternatively, download STREAM directly: |
| 94 | + |
| 95 | +```bash |
| 96 | +mkdir -p ~/benchmarks/stream && cd ~/benchmarks/stream |
| 97 | +wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c |
| 98 | +``` |
| 99 | + |
| 100 | +Compile with optimizations for AMD EPYC processors (used in HB-series): |
| 101 | + |
| 102 | +```bash |
| 103 | +gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=800000000 \ |
| 104 | + -DNTIMES=20 stream.c -o stream |
| 105 | +``` |
| 106 | + |
| 107 | +**Compiler flags explained**: |
| 108 | + |
| 109 | +| Flag | Purpose | |
| 110 | +|------|---------| |
| 111 | +| `-O3` | Maximum optimization level | |
| 112 | +| `-march=znver3` | Optimize for AMD Zen 3/4 architecture | |
| 113 | +| `-fopenmp` | Enable OpenMP for multi-threading | |
| 114 | +| `-DSTREAM_ARRAY_SIZE=800000000` | Array size (~6 GB per array, 18 GB total) | |
| 115 | +| `-DNTIMES=20` | Number of iterations | |
| 116 | + |
| 117 | +> [!IMPORTANT] |
| 118 | +> The array size must be large enough that data doesn't fit in cache. For HBv4/HBv5 with 1.5 GB L3 cache, use at least 800M elements. |
| 119 | +
|
| 120 | +## Step 4: Configure thread affinity |
| 121 | + |
| 122 | +Proper thread pinning is critical for accurate results. Set OpenMP environment variables: |
| 123 | + |
| 124 | +```bash |
| 125 | +# Get number of physical cores |
| 126 | +NCORES=$(lscpu | grep "^Core(s) per socket:" | awk '{print $4}') |
| 127 | +NSOCKETS=$(lscpu | grep "^Socket(s):" | awk '{print $2}') |
| 128 | +TOTAL_CORES=$((NCORES * NSOCKETS)) |
| 129 | + |
| 130 | +echo "Total physical cores: $TOTAL_CORES" |
| 131 | + |
| 132 | +# Set OpenMP configuration |
| 133 | +export OMP_NUM_THREADS=$TOTAL_CORES |
| 134 | +export OMP_PROC_BIND=spread |
| 135 | +export OMP_PLACES=cores |
| 136 | +``` |
| 137 | + |
| 138 | +For HBv4 (176 cores): |
| 139 | +```bash |
| 140 | +export OMP_NUM_THREADS=176 |
| 141 | +export OMP_PROC_BIND=spread |
| 142 | +export OMP_PLACES=cores |
| 143 | +``` |
| 144 | + |
| 145 | +For HBv5 (standard configuration): |
| 146 | +```bash |
| 147 | +export OMP_NUM_THREADS=176 |
| 148 | +export OMP_PROC_BIND=spread |
| 149 | +export OMP_PLACES=cores |
| 150 | +``` |
| 151 | + |
| 152 | +## Step 5: Run the benchmark |
| 153 | + |
| 154 | +Execute STREAM: |
| 155 | + |
| 156 | +```bash |
| 157 | +./stream |
| 158 | +``` |
| 159 | + |
| 160 | +**Sample output** (HBv4): |
| 161 | + |
| 162 | +``` |
| 163 | +------------------------------------------------------------- |
| 164 | +STREAM version $Revision: 5.10 $ |
| 165 | +------------------------------------------------------------- |
| 166 | +This system uses 8 bytes per array element. |
| 167 | +------------------------------------------------------------- |
| 168 | +Array size = 800000000 (elements), Offset = 0 (elements) |
| 169 | +Memory per array = 6103.5 MiB (= 5.96 GiB). |
| 170 | +Total memory required = 18310.5 MiB (= 17.88 GiB). |
| 171 | +Each kernel will be executed 20 times. |
| 172 | +------------------------------------------------------------- |
| 173 | +Number of Threads requested = 176 |
| 174 | +Number of Threads counted = 176 |
| 175 | +------------------------------------------------------------- |
| 176 | +Function Best Rate MB/s Avg time Min time Max time |
| 177 | +Copy: 753284.2 0.017157 0.016966 0.018884 |
| 178 | +Scale: 707935.3 0.018260 0.018045 0.019629 |
| 179 | +Add: 756972.9 0.025508 0.025318 0.027311 |
| 180 | +Triad: 757820.9 0.025464 0.025290 0.027212 |
| 181 | +------------------------------------------------------------- |
| 182 | +``` |
| 183 | + |
| 184 | +The **Triad Best Rate** (757,820.9 MB/s = ~740 GB/s) is the key result. |
| 185 | + |
| 186 | +## Step 6: Validate results |
| 187 | + |
| 188 | +Compare your Triad result against expected values: |
| 189 | + |
| 190 | +```bash |
| 191 | +# Quick validation script |
| 192 | +TRIAD_RESULT=757820 # Replace with your result in MB/s |
| 193 | +VM_TYPE="HBv4" # HBv2, HBv3, HBv4, or HBv5 |
| 194 | + |
| 195 | +case $VM_TYPE in |
| 196 | + "HBv5") EXPECTED=7000000 ;; |
| 197 | + "HBv4") EXPECTED=700000 ;; |
| 198 | + "HBv3") EXPECTED=330000 ;; |
| 199 | + "HBv2") EXPECTED=260000 ;; |
| 200 | +esac |
| 201 | + |
| 202 | +PERCENT=$(echo "scale=1; $TRIAD_RESULT * 100 / $EXPECTED" | bc) |
| 203 | +echo "Achieved $PERCENT% of expected bandwidth" |
| 204 | +``` |
| 205 | + |
| 206 | +**Results interpretation**: |
| 207 | + |
| 208 | +| Achievement | Interpretation | |
| 209 | +|-------------|----------------| |
| 210 | +| 95-105% | Excellent - VM performing as expected | |
| 211 | +| 85-95% | Good - Minor optimization possible | |
| 212 | +| 70-85% | Investigate - Check thread affinity, NUMA | |
| 213 | +| <70% | Problem - Check configuration | |
| 214 | + |
| 215 | +## Step 7: Run on multiple NUMA domains (advanced) |
| 216 | + |
| 217 | +For detailed NUMA analysis, run STREAM per NUMA domain: |
| 218 | + |
| 219 | +```bash |
| 220 | +# Check NUMA topology |
| 221 | +numactl --hardware |
| 222 | + |
| 223 | +# Run on NUMA node 0 only |
| 224 | +numactl --cpunodebind=0 --membind=0 \ |
| 225 | + OMP_NUM_THREADS=22 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream |
| 226 | + |
| 227 | +# Run on all NUMA domains (default full-node run) |
| 228 | +numactl --interleave=all \ |
| 229 | + OMP_NUM_THREADS=176 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream |
| 230 | +``` |
| 231 | + |
| 232 | +## Troubleshooting |
| 233 | + |
| 234 | +### Low bandwidth results |
| 235 | + |
| 236 | +**Symptom**: Results significantly below expected values |
| 237 | + |
| 238 | +**Solutions**: |
| 239 | + |
| 240 | +1. **Check thread count**: |
| 241 | + ```bash |
| 242 | + echo $OMP_NUM_THREADS |
| 243 | + # Should match physical core count |
| 244 | + ``` |
| 245 | + |
| 246 | +1. **Verify thread binding**: |
| 247 | + ```bash |
| 248 | + export OMP_DISPLAY_ENV=TRUE |
| 249 | + ./stream 2>&1 | head -20 |
| 250 | + ``` |
| 251 | + |
| 252 | +1. **Check for CPU frequency scaling**: |
| 253 | + ```bash |
| 254 | + cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor |
| 255 | + # Should be "performance" for benchmarking |
| 256 | + ``` |
| 257 | + |
| 258 | +1. **Verify NUMA memory policy**: |
| 259 | + ```bash |
| 260 | + numactl --show |
| 261 | + ``` |
| 262 | + |
| 263 | +### Array size too small |
| 264 | + |
| 265 | +**Symptom**: Results higher than expected (measuring cache, not memory) |
| 266 | + |
| 267 | +**Solution**: Increase `STREAM_ARRAY_SIZE` at compile time. Total memory used should be at least 4× the L3 cache size. |
| 268 | + |
| 269 | +```bash |
| 270 | +# Recompile with larger array |
| 271 | +gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=1000000000 \ |
| 272 | + -DNTIMES=20 stream.c -o stream |
| 273 | +``` |
| 274 | + |
| 275 | +### Inconsistent results |
| 276 | + |
| 277 | +**Symptom**: Large variation between runs |
| 278 | + |
| 279 | +**Solutions**: |
| 280 | + |
| 281 | +1. Ensure no other processes are running: |
| 282 | + ```bash |
| 283 | + top -b -n 1 | head -20 |
| 284 | + ``` |
| 285 | + |
| 286 | +1. Run more iterations: |
| 287 | + ```bash |
| 288 | + # Recompile with more iterations |
| 289 | + gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=800000000 \ |
| 290 | + -DNTIMES=50 stream.c -o stream |
| 291 | + ``` |
| 292 | + |
| 293 | +## Running STREAM in a Slurm job |
| 294 | + |
| 295 | +If using a Slurm cluster, create a job script: |
| 296 | + |
| 297 | +```bash |
| 298 | +cat << 'EOF' > stream-job.sh |
| 299 | +#!/bin/bash |
| 300 | +#SBATCH --job-name=stream |
| 301 | +#SBATCH --nodes=1 |
| 302 | +#SBATCH --ntasks=1 |
| 303 | +#SBATCH --cpus-per-task=176 |
| 304 | +#SBATCH --time=00:10:00 |
| 305 | +#SBATCH --partition=hpc |
| 306 | +#SBATCH --exclusive |
| 307 | +
|
| 308 | +# Set thread configuration |
| 309 | +export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK |
| 310 | +export OMP_PROC_BIND=spread |
| 311 | +export OMP_PLACES=cores |
| 312 | +
|
| 313 | +# Run STREAM |
| 314 | +cd ~/benchmarks/woc-benchmarking/apps/hpc/stream |
| 315 | +./stream |
| 316 | +EOF |
| 317 | + |
| 318 | +sbatch stream-job.sh |
| 319 | +``` |
| 320 | + |
| 321 | +## Automating with the Azure benchmarking scripts |
| 322 | + |
| 323 | +The Azure woc-benchmarking repository includes automation scripts: |
| 324 | + |
| 325 | +```bash |
| 326 | +cd ~/benchmarks/woc-benchmarking/apps/hpc/stream |
| 327 | + |
| 328 | +# View available scripts |
| 329 | +ls -la |
| 330 | + |
| 331 | +# Run automated benchmark (if available) |
| 332 | +./run_stream.sh |
| 333 | +``` |
| 334 | + |
0 commit comments