@@ -9,6 +9,9 @@ metrics like memory bandwidth, latency, and utilization:
99* PCIE
1010* PCIE-TGT
1111* CPU Memory (CMEM) Latency
12+ * NVLink-C2C
13+ * NV-CLink
14+ * NV-DLink
1215
1316PMU Driver
1417----------
@@ -369,3 +372,151 @@ see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
369372Example usage::
370373
371374 perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
375+
376+ NVLink-C2C PMU
377+ --------------
378+
379+ This PMU monitors latency events of memory read/write requests that pass through
380+ the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available
381+ in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC).
382+
383+ The events and configuration options of this PMU device are available in sysfs,
384+ see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>.
385+
386+ The list of events:
387+
388+ * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
389+ * IN_RD_REQ: the number of incoming read requests.
390+ * IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests.
391+ * IN_WR_REQ: the number of incoming write requests.
392+ * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
393+ * OUT_RD_REQ: the number of outgoing read requests.
394+ * OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests.
395+ * OUT_WR_REQ: the number of outgoing write requests.
396+ * CYCLES: NVLink-C2C interface cycle counts.
397+
398+ The incoming events count the reads/writes from remote device to the SoC.
399+ The outgoing events count the reads/writes from the SoC to remote device.
400+
401+ The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer
402+ contains the information about the connected device.
403+
404+ When the C2C interface is connected to GPU(s), the user can use the
405+ "gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU
406+ index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1.
407+ The PMU will monitor all GPUs by default if not specified.
408+
409+ When connected to another SoC, only the read events are available.
410+
411+ The events can be used to calculate the average latency of the read/write requests::
412+
413+ C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
414+
415+ IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
416+ IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
417+
418+ IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ
419+ IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
420+
421+ OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
422+ OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
423+
424+ OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ
425+ OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
426+
427+ Example usage:
428+
429+ * Count incoming traffic from all GPUs connected via NVLink-C2C::
430+
431+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/
432+
433+ * Count incoming traffic from GPU 0 connected via NVLink-C2C::
434+
435+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/
436+
437+ * Count incoming traffic from GPU 1 connected via NVLink-C2C::
438+
439+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/
440+
441+ * Count outgoing traffic to all GPUs connected via NVLink-C2C::
442+
443+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/
444+
445+ * Count outgoing traffic to GPU 0 connected via NVLink-C2C::
446+
447+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/
448+
449+ * Count outgoing traffic to GPU 1 connected via NVLink-C2C::
450+
451+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/
452+
453+ NV-CLink PMU
454+ ------------
455+
456+ This PMU monitors latency events of memory read requests that pass through
457+ the NV-CLINK interface. Bandwidth events are not available in this PMU.
458+ In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410
459+ SoC and this PMU only counts read traffic.
460+
461+ The events and configuration options of this PMU device are available in sysfs,
462+ see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>.
463+
464+ The list of events:
465+
466+ * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
467+ * IN_RD_REQ: the number of incoming read requests.
468+ * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
469+ * OUT_RD_REQ: the number of outgoing read requests.
470+ * CYCLES: NV-CLINK interface cycle counts.
471+
472+ The incoming events count the reads from remote device to the SoC.
473+ The outgoing events count the reads from the SoC to remote device.
474+
475+ The events can be used to calculate the average latency of the read requests::
476+
477+ CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
478+
479+ IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
480+ IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
481+
482+ OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
483+ OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
484+
485+ Example usage:
486+
487+ * Count incoming read traffic from remote SoC connected via NV-CLINK::
488+
489+ perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/
490+
491+ * Count outgoing read traffic to remote SoC connected via NV-CLINK::
492+
493+ perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/
494+
495+ NV-DLink PMU
496+ ------------
497+
498+ This PMU monitors latency events of memory read requests that pass through
499+ the NV-DLINK interface. Bandwidth events are not available in this PMU.
500+ In Tegra410 SoC, this PMU only counts CXL memory read traffic.
501+
502+ The events and configuration options of this PMU device are available in sysfs,
503+ see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>.
504+
505+ The list of events:
506+
507+ * IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory.
508+ * IN_RD_REQ: the number of read requests to CXL memory.
509+ * CYCLES: NV-DLINK interface cycle counts.
510+
511+ The events can be used to calculate the average latency of the read requests::
512+
513+ DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
514+
515+ IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
516+ IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ
517+
518+ Example usage:
519+
520+ * Count read events to CXL memory::
521+
522+ perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'
0 commit comments