Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions"#26
Closed
mmhonap wants to merge 1 commit into
Closed
Conversation
…wned RAM-device regions" This reverts commit d814a45. The commit added an early return in vfio_container_region_add() for any RAM-device section owned by a VFIO device (vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL), skipping vfio_container_dma_map() for it. In practice that excludes every VFIO mmap subregion from the IOMMU IOAS (the SMMU Stage-2 page tables): PCI BAR windows, and the CXL.mem coherent device memory of a CXL Type-2 device. The commit was added becuase for earlier testing during device boot, I was seeing these errors in the QEMU log: qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR? While testing with kernel 6.17.9, I only had tested with the CXL mode and only CUDA tests. This was a testing miss on my side. During testing, I had missed to test UVM ATS tests and verification of PCI-E mode with my QEMU patches. Earlier during debugging I misdiagnosed the issue to be the mapping was always failing as the backing VMAs are VM_IO | VM_PFNMAP, pin_user_pages() refused VM_IO pages, and IOMMU_IOAS_MAP therefore returns -EFAULT resulting in failure. For overcoming this failure, I had added the mapping skip part. This was not correct and the skip seems to break CUDA UVM ATS on CXL Type-2 passthrough. The bug caused by mapping skip ------------------------------ In accelerated nested SMMUv3 mode the GPU translates shared virtual addresses through the hardware SMMU (Stage-1 is the guest page tables, Stage-2 is the host iommufd tables). When CUDA UVM migrates a managed buffer into the device's coherent memory, that page's guest-physical address lies in the CXL HDM window. The GPU reaches it with an ATS request, and to answer that request the SMMU must complete the Stage-1 and Stage-2 walk. With the HDM region skipped there is no Stage-2 entry, so the translation faults. The GPU posts a replayable fault, UVM services it and replays, the access faults again, and the GPU spins in a fault/service/replay livelock. The guest test hangs, and on cancel it reports "Xid 31 ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE". Non-ATS workloads and PCI-E-mode (nvgrace-gpu) passthrough are unaffected, because they never reach the unmapped path. IOMMU_IOAS_MAP is now succeeding for VM_PFNMAP GPU device memory on kernel 6.17.13 because of the host-side workaround 68a70c30f8ce ("NVIDIA: SAUCE: WAR: iommufd/pages: Bypass PFNMAP") That patch teaches iommufd's pfn_reader to handle VM_PFNMAP VMAs. Instead of pin_user_pages() (which does refuse VM_IO), it calls follow_fault_pfn(), which uses follow_pfnmap_start(), faults the lazily inserted PFN in through fixup_user_fault(), and takes the raw struct-page-less PFN. d814's premise was therefore only true on a kernel without this workaround. With it present, IOMMU_IOAS_MAP for the HDM region returns success (0), as the QEMU trace shows: iommufd_backend_map_dma ... iova=0x80000000000 size=0x2330000000 ... readonly=0 (0) I will check in next series of vfio-cxl QEMU support series if any additional fix is required for this. Signed-off-by: Manish Honap <[email protected]> (cherry picked from commit 65642d3)
Collaborator
|
@mmhonap - Thanks for submitting this change. Since the PR is just a pick from your branch that was used for 10.1, I went ahead and picked this from the merged version in nvidia_unstable-10.1 (it picked clean). I fixed up the reverted SHA in the commit message to match the SHA on the 11.0 branch. Merged, closing PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This reverts commit d814a45.
The commit added an early return in vfio_container_region_add() for any RAM-device section owned by a VFIO device
(vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL), skipping vfio_container_dma_map() for it. In practice that excludes every VFIO mmap subregion from the IOMMU IOAS (the SMMU Stage-2 page tables): PCI BAR windows, and the CXL.mem coherent device memory of a CXL Type-2 device.
The commit was added becuase for earlier testing during device boot, I was seeing these errors in the QEMU log:
qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?
While testing with kernel 6.17.9, I only had tested with the CXL mode and only CUDA tests. This was a testing miss on my side. During testing, I had missed to test UVM ATS tests and verification of PCI-E mode with my QEMU patches.
Earlier during debugging I misdiagnosed the issue to be the mapping was always failing as the backing VMAs are VM_IO | VM_PFNMAP, pin_user_pages() refused VM_IO pages, and IOMMU_IOAS_MAP therefore returns -EFAULT resulting in failure. For overcoming this failure, I had added the mapping skip part. This was not correct and the skip seems to break CUDA UVM ATS on CXL Type-2 passthrough.
The bug caused by mapping skip
In accelerated nested SMMUv3 mode the GPU translates shared virtual addresses through the hardware SMMU (Stage-1 is the guest page tables, Stage-2 is the host iommufd tables). When CUDA UVM migrates a managed buffer into the device's coherent memory, that page's guest-physical address lies in the CXL HDM window. The GPU reaches it with an ATS request, and to answer that request the SMMU must complete the Stage-1 and Stage-2 walk. With the HDM region skipped there is no Stage-2 entry, so the translation faults. The GPU posts a replayable fault, UVM services it and replays, the access faults again, and the GPU spins in a fault/service/replay livelock. The guest test hangs, and on cancel it reports "Xid 31 ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE". Non-ATS workloads and PCI-E-mode (nvgrace-gpu) passthrough are unaffected, because they never reach the unmapped path.
IOMMU_IOAS_MAP is now succeeding for VM_PFNMAP GPU device memory on kernel 6.17.13 because of the host-side workaround
That patch teaches iommufd's pfn_reader to handle VM_PFNMAP VMAs. Instead of pin_user_pages() (which does refuse VM_IO), it calls follow_fault_pfn(), which uses follow_pfnmap_start(), faults the lazily inserted PFN in through fixup_user_fault(), and takes the raw struct-page-less PFN. d814's premise was therefore only true on a kernel without this workaround. With it present, IOMMU_IOAS_MAP for the HDM region returns success (0), as the QEMU trace shows:
I will check in next series of vfio-cxl QEMU support series if any additional fix is required for this.
(cherry picked from commit 65642d3)