vfio-cxl: Program HDM decoder 0 with the per-device CFMWS base by mmhonap · Pull Request #27 · NVIDIA/QEMU

mmhonap · 2026-07-01T17:45:56Z

Consider a sample QEMU config as this:

-machine virt,accel=kvm,gic-version=3,hmat=on,cxl=on,ras=on,highmem-mmio-size=4T,\ cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=256G,
cxl-fmw.1.targets.0=cxl.2,cxl-fmw.1.size=256G,
cxl-fmw.2.targets.0=cxl.3,cxl-fmw.2.size=256G,
cxl-fmw.3.targets.0=cxl.4,cxl-fmw.3.size=256G \

-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1
-device cxl-rp,port=1,bus=cxl.1,id=rport0.1,chassis=4,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.1,id=smmuv3.0,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=0002:81:00.0,bus=rport0.1,id=dev0,iommufd=iommufd0 \

-device pxb-cxl,bus_nr=20,bus=pcie.0,id=cxl.2
-device cxl-rp,port=1,bus=cxl.2,id=rport0.2,chassis=5,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.2,id=smmuv3.1,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=0002:c1:00.0,bus=rport0.2,id=dev1,iommufd=iommufd0 \

-device pxb-cxl,bus_nr=28,bus=pcie.0,id=cxl.3
-device cxl-rp,port=1,bus=cxl.3,id=rport0.3,chassis=6,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.3,id=smmuv3.2,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=000a:81:00.0,bus=rport0.3,id=dev2,iommufd=iommufd0 \

-device pxb-cxl,bus_nr=36,bus=pcie.0,id=cxl.4
-device cxl-rp,port=1,bus=cxl.4,id=rport0.4,chassis=7,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.4,id=smmuv3.3,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=000a:e1:00.0,bus=rport0.4,id=dev3,iommufd=iommufd0 \

For above config, setup_locked_hdm() programmed every firmware-committed VFIO CXL device's HDM decoder 0 (and mapped its DPA window) at the global cxl_fmws_base, which is the base of the first CXL Fixed Memory Window only. With more than one passthrough GPU -- each behind its own pxb-cxl and its own single-target CFMWS -- every device was placed at window 0's base.

In the guest this makes all endpoint decoders report the same HPA, so the CXL core can assemble only one region: the second (and later) endpoints fail to attach ("failed to find decoder mapping", -ENXIO/-EINVAL) and their device drivers never initialize.

Select the base of the window that actually targets the device instead. Walk the device's ancestor bridges up to its root pxb-cxl host bridge and match it against each window's target. Matching is by topology, not by window index or device-creation order, so an N-GPU / N-window layout places each device in its own window.

The supported scope is one VFIO-CXL device per single-target CFMWS, and selection fails closed for anything outside it -- there is no global first-window fallback. vfio_cxl_find_fmws_base() counts the matching windows and reports the matched window's base, size and target count; vfio_cxl_program_locked_hdm() then requires a unique single-target match, and otherwise reports an error that the machine_done notifier turns into a fatal startup failure (so a misconfigured device cannot boot half set up). It rejects:

no CFMWS targets the device (even on a single-window machine),
the host bridge listed in more than one CFMWS (ambiguous),
an interleaved (multi-target) window (the flat 1:1 DPA mapping cannot honour interleave ways/granularity),
a matched window with no assigned base (it did not fit the PA space),
a CFMWS smaller than the device DPA region (would expose DPA outside the window advertised to the guest via CEDT).

Match against fw->targets[] (the target names, populated when the window object is created) rather than fw->target_hbs[] (the resolved pointers): target_hbs[] is filled in by cxl_fmws_link_targets(), which on the ARM virt machine runs from virt_machine_done() - another machine_done notifier that may execute after setup_locked_hdm(), leaving target_hbs[] NULL at this point. Resolving the target name on demand is independent of machine_done notifier ordering.

Verified with up to four GPUs passed through, each on its own single-target CFMWS: every endpoint lands in its own window with a distinct base:

/sys/bus/cxl/devices/decoder5.0/start = 0x80000000000
/sys/bus/cxl/devices/decoder6.0/start = 0x84000000000
/sys/bus/cxl/devices/decoder7.0/start = 0x88000000000
/sys/bus/cxl/devices/decoder8.0/start = 0x8c000000000

and all guest GPU drivers initialize.

RFC scope: one firmware-committed VFIO-CXL endpoint below each pxb-cxl targeted by a passthrough CFMWS. Multiple passthrough endpoints below the same host bridge would need per-endpoint decoder/window allocation and are not supported by this RFC.

Fixes: d19409f ("NVIDIA: VR: SAUCE: hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices")

…vice CFMWS base Consider a sample QEMU config as this: -machine virt,accel=kvm,gic-version=3,hmat=on,cxl=on,ras=on,highmem-mmio-size=4T,\ cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=256G,\ cxl-fmw.1.targets.0=cxl.2,cxl-fmw.1.size=256G,\ cxl-fmw.2.targets.0=cxl.3,cxl-fmw.2.size=256G,\ cxl-fmw.3.targets.0=cxl.4,cxl-fmw.3.size=256G \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=1,bus=cxl.1,id=rport0.1,chassis=4,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.1,id=smmuv3.0,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=0002:81:00.0,bus=rport0.1,id=dev0,iommufd=iommufd0 \ -device pxb-cxl,bus_nr=20,bus=pcie.0,id=cxl.2 \ -device cxl-rp,port=1,bus=cxl.2,id=rport0.2,chassis=5,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.2,id=smmuv3.1,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=0002:c1:00.0,bus=rport0.2,id=dev1,iommufd=iommufd0 \ -device pxb-cxl,bus_nr=28,bus=pcie.0,id=cxl.3 \ -device cxl-rp,port=1,bus=cxl.3,id=rport0.3,chassis=6,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.3,id=smmuv3.2,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=000a:81:00.0,bus=rport0.3,id=dev2,iommufd=iommufd0 \ -device pxb-cxl,bus_nr=36,bus=pcie.0,id=cxl.4 \ -device cxl-rp,port=1,bus=cxl.4,id=rport0.4,chassis=7,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.4,id=smmuv3.3,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=000a:e1:00.0,bus=rport0.4,id=dev3,iommufd=iommufd0 \ For above config, setup_locked_hdm() programmed every firmware-committed VFIO CXL device's HDM decoder 0 (and mapped its DPA window) at the global cxl_fmws_base, which is the base of the *first* CXL Fixed Memory Window only. With more than one passthrough GPU -- each behind its own pxb-cxl and its own single-target CFMWS -- every device was placed at window 0's base. In the guest this makes all endpoint decoders report the same HPA, so the CXL core can assemble only one region: the second (and later) endpoints fail to attach ("failed to find decoder mapping", -ENXIO/-EINVAL) and their device drivers never initialize. Select the base of the window that actually targets the device instead. Walk the device's ancestor bridges up to its root pxb-cxl host bridge and match it against each window's target. Matching is by topology, not by window index or device-creation order, so an N-GPU / N-window layout places each device in its own window. The supported scope is one VFIO-CXL device per single-target CFMWS, and selection fails closed for anything outside it -- there is no global first-window fallback. vfio_cxl_find_fmws_base() counts the matching windows and reports the matched window's base, size and target count; vfio_cxl_program_locked_hdm() then requires a unique single-target match, and otherwise reports an error that the machine_done notifier turns into a fatal startup failure (so a misconfigured device cannot boot half set up). It rejects: - no CFMWS targets the device (even on a single-window machine), - the host bridge listed in more than one CFMWS (ambiguous), - an interleaved (multi-target) window (the flat 1:1 DPA mapping cannot honour interleave ways/granularity), - a matched window with no assigned base (it did not fit the PA space), - a CFMWS smaller than the device DPA region (would expose DPA outside the window advertised to the guest via CEDT). Match against fw->targets[] (the target *names*, populated when the window object is created) rather than fw->target_hbs[] (the resolved pointers): target_hbs[] is filled in by cxl_fmws_link_targets(), which on the ARM virt machine runs from virt_machine_done() -- another machine_done notifier that may execute after setup_locked_hdm(), leaving target_hbs[] NULL at this point. Resolving the target name on demand is independent of machine_done notifier ordering. Verified with up to four GPUs passed through, each on its own single-target CFMWS: every endpoint lands in its own window with a distinct base -- /sys/bus/cxl/devices/decoder5.0/start = 0x80000000000 /sys/bus/cxl/devices/decoder6.0/start = 0x84000000000 /sys/bus/cxl/devices/decoder7.0/start = 0x88000000000 /sys/bus/cxl/devices/decoder8.0/start = 0x8c000000000 -- and all guest GPU drivers initialize. RFC scope: one firmware-committed VFIO-CXL endpoint below each pxb-cxl targeted by a passthrough CFMWS. Multiple passthrough endpoints below the same host bridge would need per-endpoint decoder/window allocation and are not supported by this RFC. Fixes: d19409f ("NVIDIA: VR: SAUCE: hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices") Signed-off-by: Manish Honap <[email protected]>

shamiali2008 · 2026-07-02T15:16:14Z

+            break;
+        }
+        bus = pci_get_bus(bridge);
+    }


Do we need this walk here to find the cxl host bridge? Does something like will work?

root_bus = pci_device_root_bus(pdev);
pxb = root_bus->parent_dev;

if (!pxb ||
!object_dynamic_cast(OBJECT(pxb), TYPE_PXB_CXL_DEV)) {
goto out;
}
for (iter = list; iter; iter = iter->next) {
CXLFixedWindow *fw = CXL_FMW(iter->data);
int i;
.....

Agreed, the walk is unnecessary. Since only a pxb-cxl can be a CFMWS target and the RFC scope is a directly-attached endpoint (no CXL switch), there's nothing between the endpoint and its host bridge that the walk needed to handle. Replaced it with your suggested form.

shamiali2008 · 2026-07-02T15:21:08Z

+                                              &nwindows, &ntargets);
+
+        if (matches == 1 && ntargets == 1) {
+            /*


RFC scope: one firmware-committed VFIO-CXL endpoint below each pxb-cxl
targeted by a passthrough CFMWS. Multiple passthrough endpoints below the
same host bridge would need per-endpoint decoder/window allocation and are
not supported by this RFC. --> Suppose you have multiple GPUs under same pxb-cxl and points to same CFWMS, that case will enter here right? But the commit log says not supported.

Yes, you are right. Multiple endpoints under the same pxb-cxl is contradicting the commit log. Let me handle this case.

nvmochs · 2026-07-02T20:04:01Z

@mmhonap In addition to the bus walk issue that Shameer pointed out, I also found this while reviewing with Codex:

Regular vfio-pci hotplug of a VFIO-CXL device can terminate an already-running VM if CFMWS validation or HDM programming fails.

The patch moves VFIO-CXL HDM setup into a machine_done notifier path. That is safe during startup because a fatal CXL topology error should stop QEMU before the guest runs. But vfio-pci is still hotpluggable by default, and only vfio-pci-nohotplug explicitly disables hotplug at hw/vfio/pci.c:4581. A cxl-rp secondary bus is also hotpluggable by default through the normal PCIe slot hotplug path.

On post-boot device_add vfio-pci,..., vfio_pci_realize() calls vfio_cxl_setup() at hw/vfio/pci.c:4054. vfio_cxl_setup() registers the machine_done notifier at hw/vfio/pci.c:3791. Since the machine is already in PHASE_MACHINE_READY, qemu_add_machine_init_done_notifier() immediately invokes the callback (hw/core/machine.c:1701). If vfio_cxl_program_locked_hdm() fails, setup_locked_hdm() calls exit(EXIT_FAILURE) at hw/vfio/pci.c:3684.

That means a bad hotplug attempt can kill the whole VM instead of returning a device_add error. The reported launch command uses vfio-pci-nohotplug, so it is not affected directly, but the shared vfio-pci implementation is.

Recommended fix: either reject VFIO-CXL hotplug explicitly in vfio_cxl_setup() when DEVICE(vdev)->hotplugged is true, or avoid notifier registration after machine-ready and instead call the programming function directly with normal errp propagation.

shamiali2008 reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vfio-cxl: Program HDM decoder 0 with the per-device CFMWS base#27

vfio-cxl: Program HDM decoder 0 with the per-device CFMWS base#27
mmhonap wants to merge 1 commit into
NVIDIA:nvidia_unstable-10.1from
mmhonap:cxl-fmws-fix

mmhonap commented Jul 1, 2026

Uh oh!

shamiali2008 Jul 2, 2026

Uh oh!

mmhonap Jul 2, 2026 •

edited

Loading

Uh oh!

shamiali2008 Jul 2, 2026

Uh oh!

mmhonap Jul 2, 2026

Uh oh!

nvmochs commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mmhonap commented Jul 1, 2026

Uh oh!

shamiali2008 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

mmhonap Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shamiali2008 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

mmhonap Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

nvmochs commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mmhonap Jul 2, 2026 •

edited

Loading