Skip to content

Commit 810205a

Browse files
authored
Add guide for running STREAM benchmark on Azure HPC
This document provides a comprehensive guide on running the STREAM benchmark on Azure HPC virtual machines, including prerequisites, installation steps, and expected results.
1 parent aaa6129 commit 810205a

1 file changed

Lines changed: 334 additions & 0 deletions

File tree

Lines changed: 334 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
---
2+
title: Running your first benchmark using STREAM
3+
descrition: Learn how to run the STREAM benchmark on Azure HPC virtual machines.
4+
author: padmalathas
5+
ms.author: padmalathas
6+
ms.topic: how-to
7+
ms.date: 02/24/2026
8+
---
9+
10+
# Running your first benchmark: STREAM
11+
12+
STREAM measures sustainable memory bandwidth, which is critical for memory-bound workloads like computational fluid dynamics (CFD), finite element analysis, and data analytics. STREAM is a simple, synthetic benchmark that measures memory bandwidth for four vector operations:
13+
14+
15+
| Operation | Description | Formula |
16+
|-----------|-------------|---------|
17+
| Copy | Measures transfer rates | a(i) = b(i) |
18+
| Scale | Adds simple arithmetic | a(i) = q × b(i) |
19+
| Add | Multiple load/store operations | a(i) = b(i) + c(i) |
20+
| Triad | Most representative | a(i) = b(i) + q × c(i) |
21+
22+
The **Triad** result is the standard metric for comparing memory bandwidth across systems.
23+
24+
**Time to complete**: 15-20 minutes
25+
26+
## Prerequisites
27+
28+
- An Azure HPC VM (HBv3, HBv4, HBv5, or HX-series recommended)
29+
- SSH access to the VM
30+
- Root or sudo privileges
31+
32+
> [!TIP]
33+
> For best results, use Azure HPC marketplace images (AlmaLinux-HPC or Ubuntu-HPC) which include optimized compilers and libraries.
34+
35+
## Expected results by VM family
36+
37+
Use these values to validate your results:
38+
39+
| VM Series | STREAM Triad (GB/s) | Notes |
40+
|-----------|---------------------|-------|
41+
| HBv5 (with HBM) | ~7,000 | Uses HBM memory |
42+
| HBv4 | ~650-780 | DDR5 memory |
43+
| HBv3 | ~330-350 | DDR4 memory |
44+
| HBv2 | ~260 | DDR4 memory |
45+
46+
If your results are significantly lower (more than 10% below), check your configuration.
47+
48+
## Step 1: Connect to your VM
49+
50+
Connect via SSH to your HPC VM:
51+
52+
```bash
53+
ssh azureuser@<vm-public-ip>
54+
```
55+
56+
Or connect through your Slurm login node if using a cluster.
57+
58+
## Step 2: Install dependencies
59+
60+
### Option A: Using Azure HPC images (recommended)
61+
62+
Azure HPC images include the necessary compilers. Verify GCC is available:
63+
64+
```bash
65+
gcc --version
66+
```
67+
68+
### Option B: Manual installation
69+
70+
If using a standard image, install build tools:
71+
72+
```bash
73+
# AlmaLinux/RHEL
74+
sudo dnf groupinstall "Development Tools" -y
75+
76+
# Ubuntu
77+
sudo apt update && sudo apt install build-essential -y
78+
```
79+
80+
## Step 3: Download and compile STREAM
81+
82+
Clone the Azure benchmarking repository which includes optimized STREAM configurations:
83+
84+
```bash
85+
# Create working directory
86+
mkdir -p ~/benchmarks && cd ~/benchmarks
87+
88+
# Clone Azure benchmarking repository
89+
git clone https://github.com/Azure/woc-benchmarking.git
90+
cd woc-benchmarking/apps/hpc/stream
91+
```
92+
93+
Alternatively, download STREAM directly:
94+
95+
```bash
96+
mkdir -p ~/benchmarks/stream && cd ~/benchmarks/stream
97+
wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
98+
```
99+
100+
Compile with optimizations for AMD EPYC processors (used in HB-series):
101+
102+
```bash
103+
gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=800000000 \
104+
-DNTIMES=20 stream.c -o stream
105+
```
106+
107+
**Compiler flags explained**:
108+
109+
| Flag | Purpose |
110+
|------|---------|
111+
| `-O3` | Maximum optimization level |
112+
| `-march=znver3` | Optimize for AMD Zen 3/4 architecture |
113+
| `-fopenmp` | Enable OpenMP for multi-threading |
114+
| `-DSTREAM_ARRAY_SIZE=800000000` | Array size (~6 GB per array, 18 GB total) |
115+
| `-DNTIMES=20` | Number of iterations |
116+
117+
> [!IMPORTANT]
118+
> The array size must be large enough that data doesn't fit in cache. For HBv4/HBv5 with 1.5 GB L3 cache, use at least 800M elements.
119+
120+
## Step 4: Configure thread affinity
121+
122+
Proper thread pinning is critical for accurate results. Set OpenMP environment variables:
123+
124+
```bash
125+
# Get number of physical cores
126+
NCORES=$(lscpu | grep "^Core(s) per socket:" | awk '{print $4}')
127+
NSOCKETS=$(lscpu | grep "^Socket(s):" | awk '{print $2}')
128+
TOTAL_CORES=$((NCORES * NSOCKETS))
129+
130+
echo "Total physical cores: $TOTAL_CORES"
131+
132+
# Set OpenMP configuration
133+
export OMP_NUM_THREADS=$TOTAL_CORES
134+
export OMP_PROC_BIND=spread
135+
export OMP_PLACES=cores
136+
```
137+
138+
For HBv4 (176 cores):
139+
```bash
140+
export OMP_NUM_THREADS=176
141+
export OMP_PROC_BIND=spread
142+
export OMP_PLACES=cores
143+
```
144+
145+
For HBv5 (standard configuration):
146+
```bash
147+
export OMP_NUM_THREADS=176
148+
export OMP_PROC_BIND=spread
149+
export OMP_PLACES=cores
150+
```
151+
152+
## Step 5: Run the benchmark
153+
154+
Execute STREAM:
155+
156+
```bash
157+
./stream
158+
```
159+
160+
**Sample output** (HBv4):
161+
162+
```
163+
-------------------------------------------------------------
164+
STREAM version $Revision: 5.10 $
165+
-------------------------------------------------------------
166+
This system uses 8 bytes per array element.
167+
-------------------------------------------------------------
168+
Array size = 800000000 (elements), Offset = 0 (elements)
169+
Memory per array = 6103.5 MiB (= 5.96 GiB).
170+
Total memory required = 18310.5 MiB (= 17.88 GiB).
171+
Each kernel will be executed 20 times.
172+
-------------------------------------------------------------
173+
Number of Threads requested = 176
174+
Number of Threads counted = 176
175+
-------------------------------------------------------------
176+
Function Best Rate MB/s Avg time Min time Max time
177+
Copy: 753284.2 0.017157 0.016966 0.018884
178+
Scale: 707935.3 0.018260 0.018045 0.019629
179+
Add: 756972.9 0.025508 0.025318 0.027311
180+
Triad: 757820.9 0.025464 0.025290 0.027212
181+
-------------------------------------------------------------
182+
```
183+
184+
The **Triad Best Rate** (757,820.9 MB/s = ~740 GB/s) is the key result.
185+
186+
## Step 6: Validate results
187+
188+
Compare your Triad result against expected values:
189+
190+
```bash
191+
# Quick validation script
192+
TRIAD_RESULT=757820 # Replace with your result in MB/s
193+
VM_TYPE="HBv4" # HBv2, HBv3, HBv4, or HBv5
194+
195+
case $VM_TYPE in
196+
"HBv5") EXPECTED=7000000 ;;
197+
"HBv4") EXPECTED=700000 ;;
198+
"HBv3") EXPECTED=330000 ;;
199+
"HBv2") EXPECTED=260000 ;;
200+
esac
201+
202+
PERCENT=$(echo "scale=1; $TRIAD_RESULT * 100 / $EXPECTED" | bc)
203+
echo "Achieved $PERCENT% of expected bandwidth"
204+
```
205+
206+
**Results interpretation**:
207+
208+
| Achievement | Interpretation |
209+
|-------------|----------------|
210+
| 95-105% | Excellent - VM performing as expected |
211+
| 85-95% | Good - Minor optimization possible |
212+
| 70-85% | Investigate - Check thread affinity, NUMA |
213+
| <70% | Problem - Check configuration |
214+
215+
## Step 7: Run on multiple NUMA domains (advanced)
216+
217+
For detailed NUMA analysis, run STREAM per NUMA domain:
218+
219+
```bash
220+
# Check NUMA topology
221+
numactl --hardware
222+
223+
# Run on NUMA node 0 only
224+
numactl --cpunodebind=0 --membind=0 \
225+
OMP_NUM_THREADS=22 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream
226+
227+
# Run on all NUMA domains (default full-node run)
228+
numactl --interleave=all \
229+
OMP_NUM_THREADS=176 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream
230+
```
231+
232+
## Troubleshooting
233+
234+
### Low bandwidth results
235+
236+
**Symptom**: Results significantly below expected values
237+
238+
**Solutions**:
239+
240+
1. **Check thread count**:
241+
```bash
242+
echo $OMP_NUM_THREADS
243+
# Should match physical core count
244+
```
245+
246+
1. **Verify thread binding**:
247+
```bash
248+
export OMP_DISPLAY_ENV=TRUE
249+
./stream 2>&1 | head -20
250+
```
251+
252+
1. **Check for CPU frequency scaling**:
253+
```bash
254+
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
255+
# Should be "performance" for benchmarking
256+
```
257+
258+
1. **Verify NUMA memory policy**:
259+
```bash
260+
numactl --show
261+
```
262+
263+
### Array size too small
264+
265+
**Symptom**: Results higher than expected (measuring cache, not memory)
266+
267+
**Solution**: Increase `STREAM_ARRAY_SIZE` at compile time. Total memory used should be at least 4× the L3 cache size.
268+
269+
```bash
270+
# Recompile with larger array
271+
gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=1000000000 \
272+
-DNTIMES=20 stream.c -o stream
273+
```
274+
275+
### Inconsistent results
276+
277+
**Symptom**: Large variation between runs
278+
279+
**Solutions**:
280+
281+
1. Ensure no other processes are running:
282+
```bash
283+
top -b -n 1 | head -20
284+
```
285+
286+
1. Run more iterations:
287+
```bash
288+
# Recompile with more iterations
289+
gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=800000000 \
290+
-DNTIMES=50 stream.c -o stream
291+
```
292+
293+
## Running STREAM in a Slurm job
294+
295+
If using a Slurm cluster, create a job script:
296+
297+
```bash
298+
cat << 'EOF' > stream-job.sh
299+
#!/bin/bash
300+
#SBATCH --job-name=stream
301+
#SBATCH --nodes=1
302+
#SBATCH --ntasks=1
303+
#SBATCH --cpus-per-task=176
304+
#SBATCH --time=00:10:00
305+
#SBATCH --partition=hpc
306+
#SBATCH --exclusive
307+
308+
# Set thread configuration
309+
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
310+
export OMP_PROC_BIND=spread
311+
export OMP_PLACES=cores
312+
313+
# Run STREAM
314+
cd ~/benchmarks/woc-benchmarking/apps/hpc/stream
315+
./stream
316+
EOF
317+
318+
sbatch stream-job.sh
319+
```
320+
321+
## Automating with the Azure benchmarking scripts
322+
323+
The Azure woc-benchmarking repository includes automation scripts:
324+
325+
```bash
326+
cd ~/benchmarks/woc-benchmarking/apps/hpc/stream
327+
328+
# View available scripts
329+
ls -la
330+
331+
# Run automated benchmark (if available)
332+
./run_stream.sh
333+
```
334+

0 commit comments

Comments
 (0)