Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions" by mmhonap · Pull Request #26 · NVIDIA/QEMU

mmhonap · 2026-06-23T05:36:55Z

This reverts commit d814a45.

The commit added an early return in vfio_container_region_add() for any RAM-device section owned by a VFIO device
(vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL), skipping vfio_container_dma_map() for it. In practice that excludes every VFIO mmap subregion from the IOMMU IOAS (the SMMU Stage-2 page tables): PCI BAR windows, and the CXL.mem coherent device memory of a CXL Type-2 device.

The commit was added becuase for earlier testing during device boot, I was seeing these errors in the QEMU log:

qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?

While testing with kernel 6.17.9, I only had tested with the CXL mode and only CUDA tests. This was a testing miss on my side. During testing, I had missed to test UVM ATS tests and verification of PCI-E mode with my QEMU patches.

Earlier during debugging I misdiagnosed the issue to be the mapping was always failing as the backing VMAs are VM_IO | VM_PFNMAP, pin_user_pages() refused VM_IO pages, and IOMMU_IOAS_MAP therefore returns -EFAULT resulting in failure. For overcoming this failure, I had added the mapping skip part. This was not correct and the skip seems to break CUDA UVM ATS on CXL Type-2 passthrough.

The bug caused by mapping skip

In accelerated nested SMMUv3 mode the GPU translates shared virtual addresses through the hardware SMMU (Stage-1 is the guest page tables, Stage-2 is the host iommufd tables). When CUDA UVM migrates a managed buffer into the device's coherent memory, that page's guest-physical address lies in the CXL HDM window. The GPU reaches it with an ATS request, and to answer that request the SMMU must complete the Stage-1 and Stage-2 walk. With the HDM region skipped there is no Stage-2 entry, so the translation faults. The GPU posts a replayable fault, UVM services it and replays, the access faults again, and the GPU spins in a fault/service/replay livelock. The guest test hangs, and on cancel it reports "Xid 31 ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE". Non-ATS workloads and PCI-E-mode (nvgrace-gpu) passthrough are unaffected, because they never reach the unmapped path.

IOMMU_IOAS_MAP is now succeeding for VM_PFNMAP GPU device memory on kernel 6.17.13 because of the host-side workaround

68a70c30f8ce ("NVIDIA: SAUCE: WAR: iommufd/pages: Bypass PFNMAP")

That patch teaches iommufd's pfn_reader to handle VM_PFNMAP VMAs. Instead of pin_user_pages() (which does refuse VM_IO), it calls follow_fault_pfn(), which uses follow_pfnmap_start(), faults the lazily inserted PFN in through fixup_user_fault(), and takes the raw struct-page-less PFN. d814's premise was therefore only true on a kernel without this workaround. With it present, IOMMU_IOAS_MAP for the HDM region returns success (0), as the QEMU trace shows:

iommufd_backend_map_dma ... iova=0x80000000000 size=0x2330000000 ... readonly=0 (0)

I will check in next series of vfio-cxl QEMU support series if any additional fix is required for this.

(cherry picked from commit 65642d3)

…wned RAM-device regions" This reverts commit d814a45. The commit added an early return in vfio_container_region_add() for any RAM-device section owned by a VFIO device (vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL), skipping vfio_container_dma_map() for it. In practice that excludes every VFIO mmap subregion from the IOMMU IOAS (the SMMU Stage-2 page tables): PCI BAR windows, and the CXL.mem coherent device memory of a CXL Type-2 device. The commit was added becuase for earlier testing during device boot, I was seeing these errors in the QEMU log: qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR? While testing with kernel 6.17.9, I only had tested with the CXL mode and only CUDA tests. This was a testing miss on my side. During testing, I had missed to test UVM ATS tests and verification of PCI-E mode with my QEMU patches. Earlier during debugging I misdiagnosed the issue to be the mapping was always failing as the backing VMAs are VM_IO | VM_PFNMAP, pin_user_pages() refused VM_IO pages, and IOMMU_IOAS_MAP therefore returns -EFAULT resulting in failure. For overcoming this failure, I had added the mapping skip part. This was not correct and the skip seems to break CUDA UVM ATS on CXL Type-2 passthrough. The bug caused by mapping skip ------------------------------ In accelerated nested SMMUv3 mode the GPU translates shared virtual addresses through the hardware SMMU (Stage-1 is the guest page tables, Stage-2 is the host iommufd tables). When CUDA UVM migrates a managed buffer into the device's coherent memory, that page's guest-physical address lies in the CXL HDM window. The GPU reaches it with an ATS request, and to answer that request the SMMU must complete the Stage-1 and Stage-2 walk. With the HDM region skipped there is no Stage-2 entry, so the translation faults. The GPU posts a replayable fault, UVM services it and replays, the access faults again, and the GPU spins in a fault/service/replay livelock. The guest test hangs, and on cancel it reports "Xid 31 ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE". Non-ATS workloads and PCI-E-mode (nvgrace-gpu) passthrough are unaffected, because they never reach the unmapped path. IOMMU_IOAS_MAP is now succeeding for VM_PFNMAP GPU device memory on kernel 6.17.13 because of the host-side workaround 68a70c30f8ce ("NVIDIA: SAUCE: WAR: iommufd/pages: Bypass PFNMAP") That patch teaches iommufd's pfn_reader to handle VM_PFNMAP VMAs. Instead of pin_user_pages() (which does refuse VM_IO), it calls follow_fault_pfn(), which uses follow_pfnmap_start(), faults the lazily inserted PFN in through fixup_user_fault(), and takes the raw struct-page-less PFN. d814's premise was therefore only true on a kernel without this workaround. With it present, IOMMU_IOAS_MAP for the HDM region returns success (0), as the QEMU trace shows: iommufd_backend_map_dma ... iova=0x80000000000 size=0x2330000000 ... readonly=0 (0) I will check in next series of vfio-cxl QEMU support series if any additional fix is required for this. Signed-off-by: Manish Honap <[email protected]> (cherry picked from commit 65642d3)

nvmochs · 2026-06-23T14:41:27Z

@mmhonap - Thanks for submitting this change.

Since the PR is just a pick from your branch that was used for 10.1, I went ahead and picked this from the merged version in nvidia_unstable-10.1 (it picked clean). I fixed up the reverted SHA in the commit message to match the SHA on the 11.0 branch.

Merged, closing PR.

54a7b9618b79 (HEAD -> nvidia_unstable-11.0, nvidia/nvidia_unstable-11.0) 1:11.0.0+nvidia-unstable4-1
e612c2241f3e Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions"

nvmochs closed this Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions"#26

Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions"#26
mmhonap wants to merge 1 commit into
NVIDIA:nvidia_unstable-11.0from
mmhonap:dma-map-skip-fix-q11

mmhonap commented Jun 23, 2026

Uh oh!

nvmochs commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mmhonap commented Jun 23, 2026

The bug caused by mapping skip

Uh oh!

nvmochs commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants