Skip to content

Commit 3437ff4

Browse files
authored
Revise HPC performance overview with AI benchmarking details
Updated the date and enhanced the content to include AI benchmarking on Azure, key performance metrics, and best practices for benchmarking. Added sections on compute, memory, and network performance metrics, as well as AI-specific metrics.
1 parent 0421a9c commit 3437ff4

1 file changed

Lines changed: 128 additions & 73 deletions

File tree

  • articles/high-performance-computing/performance-benchmarking

articles/high-performance-computing/performance-benchmarking/overview.md

Lines changed: 128 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: "High-Performance Computing (HPC) performance and benchmarking overview"
33
description: Learn about understanding and measuring the performance concepts and benchmarking methologies.
44
author: padmalathas
55
ms.author: padmalathas
6-
ms.date: 01/01/2025
6+
ms.date: 02/23/2026
77
ms.topic: concept-article
88
ms.service: azure-virtual-machines
99
ms.subservice: hpc
@@ -12,87 +12,142 @@ ms.subservice: hpc
1212

1313
# High-Performance Computing (HPC) Performance and Benchmarking Overview
1414

15-
High-Performance Computing (HPC) systems are designed to process large amounts of data and perform complex calculations at high speeds. Understanding and measuring their performance is crucial for system optimization, procurement decisions, and ensuring applications meet performance requirements. This document provides a comprehensive overview of HPC performance concepts and benchmarking methodologies.
15+
This article introduces HPC- AI benchmarking on Azure. It is designed for architects, engineers, and decision-makers who need to:
16+
17+
- Evaluate Azure infrastructure for new or existing workloads
18+
- Establish performance baselines
19+
- Compare VM families using objective data
20+
- Optimize performance and cost efficiency
21+
22+
## Why benchmarking matters
23+
24+
Benchmarking provides evidence-based insights that support both technical and business decisions. It serves several critical purposes for HPC and AI workloads:
25+
26+
- Choose the right infrastructure: Match workload characteristics to the most suitable Azure VM family.
27+
- Validate performance: Confirm that deployed systems meet expected throughput and latency targets.
28+
- Optimize configurations: Identify bottlenecks across compute, memory, storage, and networking.
29+
- Analyze cost efficiency: Compare price–performance ratios across VM options.
30+
- Support procurement decisions: Provide repeatable, defensible performance data to stakeholders.
31+
1632

1733
## Key Performance Metrics
1834

19-
Understanding the fundamental metrics used to measure HPC system performance is essential for meaningful system evaluation and comparison. They provide objective measurements for comparison, identify system bottlenecks thereby enabling the performance tuning and help predict predict application performance. The performance
35+
Understanding the core metrics used to measure HPC system performance is essential for meaningful system evaluation and comparison. They provide objective measurements for comparison, identify system bottlenecks thereby enabling the performance tuning and help predict application performance. Metrics vary by workload type, but they generally fall into four categories.
36+
37+
# [Compute performance](#tab/computeperf)
38+
39+
Compute performance metrics describe the raw processing capability of a system and how effectively that capability is realized in practice. FLOPS (floating-point operations per second) are commonly used to quantify computational throughput and are often reported by benchmarks such as HPL (LINPACK). While peak performance represents the theoretical maximum capability of the hardware, sustained performance reflects what applications actually achieve under real workloads and is therefore a more meaningful indicator for most evaluations.
40+
41+
# [Memory performance](#tab/memoryperf)
42+
43+
Memory system efficiency is crucial for overall system performance as it determines how quickly data can be accessed and processed. Memory performance metrics capture how efficiently data moves between processors and memory, which is often a dominant factor in overall application performance. Memory bandwidth measures the rate at which data can be transferred and is especially critical for memory-bound workloads such as computational fluid dynamics. Memory latency reflects the delay between a request and data delivery, influencing scalability and responsiveness, while cache efficiency indicates how effectively applications reuse data to avoid expensive memory accesses.
44+
2045

21-
# [Processing Performance](#tab/processperf)
22-
HPC systems' computational capabilities are measured through various metrics that quantify their ability to execute calculations and instructions.
23-
- FLOPS (Floating-Point Operations Per Second): Measures the raw computational power of a system
24-
- Peak Performance: Theoretical maximum performance achievable by the system
25-
- Sustained Performance: Actual performance achieved during real-world operations
26-
- IPS (Instructions Per Second): Rate at which a processor executes instructions
46+
# [Network performance](#tab/networkperf)
2747

28-
# [Memory Performance](#tab/memoryperf)
29-
Memory system efficiency is crucial for overall system performance as it determines how quickly data can be accessed and processed.
30-
- Bandwidth: Rate of data transfer between memory and processor
31-
- Latency: Time delay between memory request and data delivery
32-
- Memory Hierarchy Performance: Cache hit rates and access times across different memory levels
48+
In a distributed computing environments, network performance metrics help evaluate the system's ability to communicate between nodes effectively. These metrics are essential for distributed and tightly coupled workloads that span multiple nodes. Network bandwidth defines the maximum data transfer rate between nodes, whereas network latency measures the time required for messages to travel across the interconnect. Message rate, often evaluated with MPI micro-benchmarks, indicates how well the system handles frequent, small communications, which is particularly important for communication-intensive HPC applications.
3349

34-
# [Network Performance](#tab/networkperf)
35-
In a distributed computing environments, network performance metrics help evaluate the system's ability to communicate between nodes effectively.
36-
- Bandwidth: Maximum data transfer rate between nodes
37-
- Latency: Time required for a message to travel between nodes
38-
- Message Rate: Number of messages that can be sent per second
39-
- Bisection Bandwidth: Worst-case bandwidth when the network is split into two equal parts
50+
# [AI specific metrics](#tab/aispecific)
51+
52+
AI workloads introduce additional performance considerations beyond traditional HPC metrics. Throughput measures how many samples or tokens are processed per second during training or inference and is a primary indicator of overall efficiency. Time to first token (TTFT) captures the latency before the first output token is generated and directly affects user experience for large language model inference. Scaling efficiency describes how well performance improves as additional GPUs or nodes are added, providing insight into how effectively the workload utilizes parallel resources.
4053

4154
---
4255

43-
## Benchmarking Categories
44-
Different types of benchmarks serve various purposes in evaluating system performance, from testing specific components to assessing real-world application performance.
45-
46-
|Synthetic Benchmarks <br> (Test specific system components or characteristics)|Application Benchmarks <br> (Real-world applications or their proxies)|Kernel Benchmarks <br> (Small, self-contained portions of applications)|
47-
|----------|-------------|------|
48-
|STREAM (memory bandwidth)|Weather Research and Forecasting (WRF)|NAS Parallel Benchmarks|
49-
|Intel MPI Benchmarks (network performance)|GROMACS (molecular dynamics)|DOE CORAL Benchmarks|
50-
|LINPACK (dense linear algebra)|NAMD (molecular dynamics)|ECP Proxy Applications|
51-
|HPCG (sparse linear algebra)|MILC (quantum chromodynamics)|
52-
53-
## Performance Analysis Methods
54-
Various techniques are employed to gather detailed performance data and identify bottlenecks in HPC systems and applications. Most commonly used methods are *profiling* wherein it collects runtime data to understand program behavior and resource utilization patterns, *tracing* method in which it captures details temporal information about program execution and the system behavior for in-depth analysis.
55-
56-
### Profiling
57-
- Time-based profiling: Sampling program counter at regular intervals
58-
- Event-based profiling: Collecting hardware counter data
59-
- Communication profiling: Analyzing message patterns and timing
60-
- I/O profiling: Measuring file system performance
61-
62-
### Tracing
63-
- Timeline analysis: Recording temporal behavior of events
64-
- Message tracing: Analyzing communication patterns
65-
- Hardware counter tracing: Recording hardware events over time
66-
67-
## Performance Optimization Techniques
68-
These strategies help maximize system efficiency and application performance across different aspects of HPC systems. The most effective techniques typically combine elements from all three categories, creating a balanced optimization strategy that considers the entire system's performance characteristics. Success often comes from identifying which combination of these techniques best matches your specific application and system architecture.
69-
70-
:::image type="content" source="../media/performance-techniques.jpg" alt-text="A screenshot of the effective techniques with combined elements.":::
71-
72-
## Best Practices for Benchmarking
73-
Following are established benchmarking practices ensures reliable and reproducible performance measurements.
74-
75-
### Methodology
76-
- To define clear objectives and metric
77-
- Select representative benchmarks
78-
- Ensure consistent testing conditions
79-
- Document all testing parameters
80-
- Perform multiple runs for statistical validity
81-
82-
### Common Pitfalls to Avoid
83-
- Insufficient warm-up periods
84-
- Inconsistent compiler options
85-
- Inadequate sample sizes
86-
- Unrealistic input datasets
87-
- Ignoring system variability
88-
89-
### Reporting Requirements
90-
- System configuration details'
91-
- Software stack information
92-
- Benchmark parameters
93-
- Raw results and statistical analysis
94-
- Environmental conditions
95-
- Optimization settings
56+
## Azure VM families for HPC and AI
57+
Azure provides specialized VM families tuned for different workload patterns.
58+
59+
### CPU-based HPC (HB-series)
60+
HB-series VMs are optimized for memory bandwidth and low-latency networking, making them well suited for traditional HPC workloads such as:
61+
62+
* Computational fluid dynamics (CFD)
63+
* Weather and climate modeling
64+
* Finite element analysis
65+
66+
Key characteristics include:
67+
68+
* High-core-count AMD EPYC processors
69+
* Large memory bandwidth (including HBM in newer generations)
70+
* High-speed InfiniBand networking
71+
72+
### GPU-based AI (ND-series)
73+
ND-series VMs are designed for GPU-accelerated workloads, including:
74+
75+
* Deep learning training
76+
* Large language model (LLM) inference
77+
* AI research and experimentation
78+
79+
These VMs feature:
80+
81+
* NVIDIA data center GPUs (H100, H200, Blackwell)
82+
* Large GPU memory capacity
83+
* High-bandwidth GPU-to-GPU and GPU-to-network interconnects
84+
85+
## Benchmarking categories
86+
Different benchmarks answer different questions. Select benchmarks based on the aspect of performance you want to evaluate.
87+
88+
### Synthetic benchmarks
89+
Synthetic benchmarks isolate specific system components and are useful for baseline validation:
90+
91+
* STREAM – Measures sustainable memory bandwidth
92+
* HPL (LINPACK) – Measures peak floating-point compute performance
93+
* HPCG – Evaluates performance for sparse linear algebra, closer to real-world HPC workloads
94+
* OSU Micro-Benchmarks – Validates MPI latency and bandwidth
95+
* NCCL tests – Measures GPU collective communication performance
96+
97+
### Application benchmarks
98+
Application benchmarks reflect real-world behavior and are often more representative:
99+
100+
* ANSYS Fluent – CFD solver performance
101+
* WRF – Weather and atmospheric modeling
102+
* GROMACS / NAMD – Molecular dynamics throughput
103+
* MLPerf Training – End-to-end AI training performance
104+
* MLPerf Inference – Model serving throughput and latency
105+
106+
107+
## Getting started
108+
109+
Follow this recommended path to begin benchmarking on Azure:
110+
111+
```
112+
1. Set up infrastructure
113+
└── Setting Up Your First HPC Cluster (CycleCloud + Slurm)
114+
115+
2. Run baseline benchmarks
116+
├── Running Your First Benchmark: STREAM (CPU/memory)
117+
└── Running NCCL Benchmarks (GPU communication)
118+
119+
3. Compare VM options
120+
├── CPU HPC VMs Comparison
121+
└── GPU AI VMs Comparison
122+
123+
4. Optimize for your workload
124+
└── Optimizing NCCL for Azure (AI training)
125+
```
126+
127+
## Best practices
128+
129+
Following are some guidelines for reliable and reproducible benchmarks:
130+
131+
### Before you benchmark
132+
133+
- **Use HPC/AI optimized images**: Start with Azure HPC images (AlmaLinux-HPC, Ubuntu-HPC) that include pre-configured drivers and libraries
134+
- **Verify driver versions**: Ensure GPU drivers, InfiniBand drivers, and NCCL versions are current
135+
- **Check topology**: Confirm NUMA configuration and GPU-to-NIC affinity
136+
137+
### During benchmarking
138+
139+
- **Warm-up runs**: Discard initial runs to allow caches to stabilize
140+
- **Multiple iterations**: Run at least 5 iterations and report median or average
141+
- **Consistent conditions**: Keep OS, drivers, and configurations identical across comparisons
142+
- **Document everything**: Record software versions, environment variables, and command-line parameters
143+
144+
### Common pitfalls to avoid
145+
146+
- Insufficient warm-up periods
147+
- Comparing different software versions
148+
- Ignoring NUMA topology
149+
- Using default configurations without optimization
150+
- Inadequate sample sizes
96151

97152
## Related resources
98153

0 commit comments

Comments
 (0)