Skip to content

Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions"#26

Closed
mmhonap wants to merge 1 commit into
NVIDIA:nvidia_unstable-11.0from
mmhonap:dma-map-skip-fix-q11
Closed

Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions"#26
mmhonap wants to merge 1 commit into
NVIDIA:nvidia_unstable-11.0from
mmhonap:dma-map-skip-fix-q11

Conversation

@mmhonap

@mmhonap mmhonap commented Jun 23, 2026

Copy link
Copy Markdown

This reverts commit d814a45.

The commit added an early return in vfio_container_region_add() for any RAM-device section owned by a VFIO device
(vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL), skipping vfio_container_dma_map() for it. In practice that excludes every VFIO mmap subregion from the IOMMU IOAS (the SMMU Stage-2 page tables): PCI BAR windows, and the CXL.mem coherent device memory of a CXL Type-2 device.

The commit was added becuase for earlier testing during device boot, I was seeing these errors in the QEMU log:

qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?

While testing with kernel 6.17.9, I only had tested with the CXL mode and only CUDA tests. This was a testing miss on my side. During testing, I had missed to test UVM ATS tests and verification of PCI-E mode with my QEMU patches.

Earlier during debugging I misdiagnosed the issue to be the mapping was always failing as the backing VMAs are VM_IO | VM_PFNMAP, pin_user_pages() refused VM_IO pages, and IOMMU_IOAS_MAP therefore returns -EFAULT resulting in failure. For overcoming this failure, I had added the mapping skip part. This was not correct and the skip seems to break CUDA UVM ATS on CXL Type-2 passthrough.

The bug caused by mapping skip

In accelerated nested SMMUv3 mode the GPU translates shared virtual addresses through the hardware SMMU (Stage-1 is the guest page tables, Stage-2 is the host iommufd tables). When CUDA UVM migrates a managed buffer into the device's coherent memory, that page's guest-physical address lies in the CXL HDM window. The GPU reaches it with an ATS request, and to answer that request the SMMU must complete the Stage-1 and Stage-2 walk. With the HDM region skipped there is no Stage-2 entry, so the translation faults. The GPU posts a replayable fault, UVM services it and replays, the access faults again, and the GPU spins in a fault/service/replay livelock. The guest test hangs, and on cancel it reports "Xid 31 ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE". Non-ATS workloads and PCI-E-mode (nvgrace-gpu) passthrough are unaffected, because they never reach the unmapped path.

IOMMU_IOAS_MAP is now succeeding for VM_PFNMAP GPU device memory on kernel 6.17.13 because of the host-side workaround

68a70c30f8ce ("NVIDIA: SAUCE: WAR: iommufd/pages: Bypass PFNMAP")

That patch teaches iommufd's pfn_reader to handle VM_PFNMAP VMAs. Instead of pin_user_pages() (which does refuse VM_IO), it calls follow_fault_pfn(), which uses follow_pfnmap_start(), faults the lazily inserted PFN in through fixup_user_fault(), and takes the raw struct-page-less PFN. d814's premise was therefore only true on a kernel without this workaround. With it present, IOMMU_IOAS_MAP for the HDM region returns success (0), as the QEMU trace shows:

iommufd_backend_map_dma ... iova=0x80000000000 size=0x2330000000 ... readonly=0 (0)

I will check in next series of vfio-cxl QEMU support series if any additional fix is required for this.

(cherry picked from commit 65642d3)

…wned RAM-device regions"

This reverts commit d814a45.

The commit added an early return in vfio_container_region_add() for any
RAM-device section owned by a VFIO device
(vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL), skipping
vfio_container_dma_map() for it. In practice that excludes every VFIO mmap
subregion from the IOMMU IOAS (the SMMU Stage-2 page tables): PCI BAR
windows, and the CXL.mem coherent device memory of a CXL Type-2 device.

The commit was added becuase for earlier testing during device boot,
I was seeing these errors in the QEMU log:

qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?

While testing with kernel 6.17.9, I only had tested with the CXL mode
and only CUDA tests. This was a testing miss on my side.
During testing, I had missed to test UVM ATS tests and verification of
PCI-E mode with my QEMU patches.

Earlier during debugging I misdiagnosed the issue to be the mapping was
always failing as the backing VMAs are VM_IO | VM_PFNMAP, pin_user_pages()
refused VM_IO pages, and IOMMU_IOAS_MAP therefore returns -EFAULT resulting
in failure. For overcoming this failure, I had added the mapping skip
part. This was not correct and the skip seems to break CUDA UVM ATS on CXL
Type-2 passthrough.

The bug caused by mapping skip
------------------------------
In accelerated nested SMMUv3 mode the GPU translates shared virtual
addresses through the hardware SMMU (Stage-1 is the guest page tables,
Stage-2 is the host iommufd tables). When CUDA UVM migrates a managed
buffer into the device's coherent memory, that page's guest-physical
address lies in the CXL HDM window. The GPU reaches it with an ATS request,
and to answer that request the SMMU must complete the Stage-1 and Stage-2
walk. With the HDM region skipped there is no Stage-2 entry, so the
translation faults. The GPU posts a replayable fault, UVM services it and
replays, the access faults again, and the GPU spins in a
fault/service/replay livelock. The guest test hangs, and on cancel it
reports "Xid 31 ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE". Non-ATS workloads
and PCI-E-mode (nvgrace-gpu) passthrough are unaffected, because they never
reach the unmapped path.

IOMMU_IOAS_MAP is now succeeding for VM_PFNMAP GPU device memory on kernel
6.17.13 because of the host-side workaround

    68a70c30f8ce ("NVIDIA: SAUCE: WAR: iommufd/pages: Bypass PFNMAP")

That patch teaches iommufd's pfn_reader to handle VM_PFNMAP VMAs. Instead
of pin_user_pages() (which does refuse VM_IO), it calls follow_fault_pfn(),
which uses follow_pfnmap_start(), faults the lazily inserted PFN in through
fixup_user_fault(), and takes the raw struct-page-less PFN. d814's premise
was therefore only true on a kernel without this workaround. With it
present, IOMMU_IOAS_MAP for the HDM region returns success (0), as the QEMU
trace shows:

    iommufd_backend_map_dma ... iova=0x80000000000 size=0x2330000000 ... readonly=0 (0)

I will check in next series of vfio-cxl QEMU support series if any additional
fix is required for this.

Signed-off-by: Manish Honap <[email protected]>
(cherry picked from commit 65642d3)
@nvmochs

nvmochs commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

@mmhonap - Thanks for submitting this change.

Since the PR is just a pick from your branch that was used for 10.1, I went ahead and picked this from the merged version in nvidia_unstable-10.1 (it picked clean). I fixed up the reverted SHA in the commit message to match the SHA on the 11.0 branch.

Merged, closing PR.

54a7b9618b79 (HEAD -> nvidia_unstable-11.0, nvidia/nvidia_unstable-11.0) 1:11.0.0+nvidia-unstable4-1
e612c2241f3e Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions"

@nvmochs nvmochs closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants