Skip to content

Commit 2f89b7f

Browse files
bwicaksononvwilldeacon
authored andcommitted
perf: add NVIDIA Tegra410 C2C PMU
Adds NVIDIA C2C PMU support in Tegra410 SOC. This PMU is used to measure memory latency between the SOC and device memory, e.g GPU Memory (GMEM), CXL Memory, or memory on remote Tegra410 SOC. Reviewed-by: Ilkka Koskinen <[email protected]> Signed-off-by: Besar Wicaksono <[email protected]> Signed-off-by: Will Deacon <[email protected]>
1 parent 429b763 commit 2f89b7f

4 files changed

Lines changed: 1210 additions & 0 deletions

File tree

Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ metrics like memory bandwidth, latency, and utilization:
99
* PCIE
1010
* PCIE-TGT
1111
* CPU Memory (CMEM) Latency
12+
* NVLink-C2C
13+
* NV-CLink
14+
* NV-DLink
1215

1316
PMU Driver
1417
----------
@@ -369,3 +372,151 @@ see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
369372
Example usage::
370373

371374
perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
375+
376+
NVLink-C2C PMU
377+
--------------
378+
379+
This PMU monitors latency events of memory read/write requests that pass through
380+
the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available
381+
in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC).
382+
383+
The events and configuration options of this PMU device are available in sysfs,
384+
see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>.
385+
386+
The list of events:
387+
388+
* IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
389+
* IN_RD_REQ: the number of incoming read requests.
390+
* IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests.
391+
* IN_WR_REQ: the number of incoming write requests.
392+
* OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
393+
* OUT_RD_REQ: the number of outgoing read requests.
394+
* OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests.
395+
* OUT_WR_REQ: the number of outgoing write requests.
396+
* CYCLES: NVLink-C2C interface cycle counts.
397+
398+
The incoming events count the reads/writes from remote device to the SoC.
399+
The outgoing events count the reads/writes from the SoC to remote device.
400+
401+
The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer
402+
contains the information about the connected device.
403+
404+
When the C2C interface is connected to GPU(s), the user can use the
405+
"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU
406+
index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1.
407+
The PMU will monitor all GPUs by default if not specified.
408+
409+
When connected to another SoC, only the read events are available.
410+
411+
The events can be used to calculate the average latency of the read/write requests::
412+
413+
C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
414+
415+
IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
416+
IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
417+
418+
IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ
419+
IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
420+
421+
OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
422+
OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
423+
424+
OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ
425+
OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
426+
427+
Example usage:
428+
429+
* Count incoming traffic from all GPUs connected via NVLink-C2C::
430+
431+
perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/
432+
433+
* Count incoming traffic from GPU 0 connected via NVLink-C2C::
434+
435+
perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/
436+
437+
* Count incoming traffic from GPU 1 connected via NVLink-C2C::
438+
439+
perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/
440+
441+
* Count outgoing traffic to all GPUs connected via NVLink-C2C::
442+
443+
perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/
444+
445+
* Count outgoing traffic to GPU 0 connected via NVLink-C2C::
446+
447+
perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/
448+
449+
* Count outgoing traffic to GPU 1 connected via NVLink-C2C::
450+
451+
perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/
452+
453+
NV-CLink PMU
454+
------------
455+
456+
This PMU monitors latency events of memory read requests that pass through
457+
the NV-CLINK interface. Bandwidth events are not available in this PMU.
458+
In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410
459+
SoC and this PMU only counts read traffic.
460+
461+
The events and configuration options of this PMU device are available in sysfs,
462+
see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>.
463+
464+
The list of events:
465+
466+
* IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
467+
* IN_RD_REQ: the number of incoming read requests.
468+
* OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
469+
* OUT_RD_REQ: the number of outgoing read requests.
470+
* CYCLES: NV-CLINK interface cycle counts.
471+
472+
The incoming events count the reads from remote device to the SoC.
473+
The outgoing events count the reads from the SoC to remote device.
474+
475+
The events can be used to calculate the average latency of the read requests::
476+
477+
CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
478+
479+
IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
480+
IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
481+
482+
OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
483+
OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
484+
485+
Example usage:
486+
487+
* Count incoming read traffic from remote SoC connected via NV-CLINK::
488+
489+
perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/
490+
491+
* Count outgoing read traffic to remote SoC connected via NV-CLINK::
492+
493+
perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/
494+
495+
NV-DLink PMU
496+
------------
497+
498+
This PMU monitors latency events of memory read requests that pass through
499+
the NV-DLINK interface. Bandwidth events are not available in this PMU.
500+
In Tegra410 SoC, this PMU only counts CXL memory read traffic.
501+
502+
The events and configuration options of this PMU device are available in sysfs,
503+
see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>.
504+
505+
The list of events:
506+
507+
* IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory.
508+
* IN_RD_REQ: the number of read requests to CXL memory.
509+
* CYCLES: NV-DLINK interface cycle counts.
510+
511+
The events can be used to calculate the average latency of the read requests::
512+
513+
DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
514+
515+
IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
516+
IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ
517+
518+
Example usage:
519+
520+
* Count read events to CXL memory::
521+
522+
perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'

drivers/perf/Kconfig

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -318,4 +318,11 @@ config NVIDIA_TEGRA410_CMEM_LATENCY_PMU
318318
Enable perf support for CPU memory latency counters monitoring on
319319
NVIDIA Tegra410 SoC.
320320

321+
config NVIDIA_TEGRA410_C2C_PMU
322+
tristate "NVIDIA Tegra410 C2C PMU"
323+
depends on ARM64 && ACPI
324+
help
325+
Enable perf support for counters in NVIDIA C2C interface of NVIDIA
326+
Tegra410 SoC.
327+
321328
endmenu

drivers/perf/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,4 @@ obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
3636
obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
3737
obj-$(CONFIG_CXL_PMU) += cxl_pmu.o
3838
obj-$(CONFIG_NVIDIA_TEGRA410_CMEM_LATENCY_PMU) += nvidia_t410_cmem_latency_pmu.o
39+
obj-$(CONFIG_NVIDIA_TEGRA410_C2C_PMU) += nvidia_t410_c2c_pmu.o

0 commit comments

Comments
 (0)