Skip to content

vfio-cxl: Program HDM decoder 0 with the per-device CFMWS base#27

Open
mmhonap wants to merge 1 commit into
NVIDIA:nvidia_unstable-10.1from
mmhonap:cxl-fmws-fix
Open

vfio-cxl: Program HDM decoder 0 with the per-device CFMWS base#27
mmhonap wants to merge 1 commit into
NVIDIA:nvidia_unstable-10.1from
mmhonap:cxl-fmws-fix

Conversation

@mmhonap

@mmhonap mmhonap commented Jul 1, 2026

Copy link
Copy Markdown

Consider a sample QEMU config as this:

-machine virt,accel=kvm,gic-version=3,hmat=on,cxl=on,ras=on,highmem-mmio-size=4T,\ cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=256G,
cxl-fmw.1.targets.0=cxl.2,cxl-fmw.1.size=256G,
cxl-fmw.2.targets.0=cxl.3,cxl-fmw.2.size=256G,
cxl-fmw.3.targets.0=cxl.4,cxl-fmw.3.size=256G \

-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1
-device cxl-rp,port=1,bus=cxl.1,id=rport0.1,chassis=4,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.1,id=smmuv3.0,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=0002:81:00.0,bus=rport0.1,id=dev0,iommufd=iommufd0 \

-device pxb-cxl,bus_nr=20,bus=pcie.0,id=cxl.2
-device cxl-rp,port=1,bus=cxl.2,id=rport0.2,chassis=5,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.2,id=smmuv3.1,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=0002:c1:00.0,bus=rport0.2,id=dev1,iommufd=iommufd0 \

-device pxb-cxl,bus_nr=28,bus=pcie.0,id=cxl.3
-device cxl-rp,port=1,bus=cxl.3,id=rport0.3,chassis=6,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.3,id=smmuv3.2,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=000a:81:00.0,bus=rport0.3,id=dev2,iommufd=iommufd0 \

-device pxb-cxl,bus_nr=36,bus=pcie.0,id=cxl.4
-device cxl-rp,port=1,bus=cxl.4,id=rport0.4,chassis=7,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.4,id=smmuv3.3,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=000a:e1:00.0,bus=rport0.4,id=dev3,iommufd=iommufd0 \

For above config, setup_locked_hdm() programmed every firmware-committed VFIO CXL device's HDM decoder 0 (and mapped its DPA window) at the global cxl_fmws_base, which is the base of the first CXL Fixed Memory Window only. With more than one passthrough GPU -- each behind its own pxb-cxl and its own single-target CFMWS -- every device was placed at window 0's base.

In the guest this makes all endpoint decoders report the same HPA, so the CXL core can assemble only one region: the second (and later) endpoints fail to attach ("failed to find decoder mapping", -ENXIO/-EINVAL) and their device drivers never initialize.

Select the base of the window that actually targets the device instead. Walk the device's ancestor bridges up to its root pxb-cxl host bridge and match it against each window's target. Matching is by topology, not by window index or device-creation order, so an N-GPU / N-window layout places each device in its own window.

The supported scope is one VFIO-CXL device per single-target CFMWS, and selection fails closed for anything outside it -- there is no global first-window fallback. vfio_cxl_find_fmws_base() counts the matching windows and reports the matched window's base, size and target count; vfio_cxl_program_locked_hdm() then requires a unique single-target match, and otherwise reports an error that the machine_done notifier turns into a fatal startup failure (so a misconfigured device cannot boot half set up). It rejects:

  • no CFMWS targets the device (even on a single-window machine),
  • the host bridge listed in more than one CFMWS (ambiguous),
  • an interleaved (multi-target) window (the flat 1:1 DPA mapping cannot honour interleave ways/granularity),
  • a matched window with no assigned base (it did not fit the PA space),
  • a CFMWS smaller than the device DPA region (would expose DPA outside the window advertised to the guest via CEDT).

Match against fw->targets[] (the target names, populated when the window object is created) rather than fw->target_hbs[] (the resolved pointers): target_hbs[] is filled in by cxl_fmws_link_targets(), which on the ARM virt machine runs from virt_machine_done() - another machine_done notifier that may execute after setup_locked_hdm(), leaving target_hbs[] NULL at this point. Resolving the target name on demand is independent of machine_done notifier ordering.

Verified with up to four GPUs passed through, each on its own single-target CFMWS: every endpoint lands in its own window with a distinct base:

/sys/bus/cxl/devices/decoder5.0/start = 0x80000000000
/sys/bus/cxl/devices/decoder6.0/start = 0x84000000000
/sys/bus/cxl/devices/decoder7.0/start = 0x88000000000
/sys/bus/cxl/devices/decoder8.0/start = 0x8c000000000

and all guest GPU drivers initialize.

RFC scope: one firmware-committed VFIO-CXL endpoint below each pxb-cxl targeted by a passthrough CFMWS. Multiple passthrough endpoints below the same host bridge would need per-endpoint decoder/window allocation and are not supported by this RFC.

Fixes: d19409f ("NVIDIA: VR: SAUCE: hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices")

…vice CFMWS base

Consider a sample QEMU config as this:

-machine virt,accel=kvm,gic-version=3,hmat=on,cxl=on,ras=on,highmem-mmio-size=4T,\
cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=256G,\
cxl-fmw.1.targets.0=cxl.2,cxl-fmw.1.size=256G,\
cxl-fmw.2.targets.0=cxl.3,cxl-fmw.2.size=256G,\
cxl-fmw.3.targets.0=cxl.4,cxl-fmw.3.size=256G \

-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=1,bus=cxl.1,id=rport0.1,chassis=4,pref64-reserve=2G,mem-reserve=1G \
-device arm-smmuv3,primary-bus=cxl.1,id=smmuv3.0,accel=on,ats=on,ril=on,pasid=on,oas=48 \
-device vfio-pci-nohotplug,host=0002:81:00.0,bus=rport0.1,id=dev0,iommufd=iommufd0 \

-device pxb-cxl,bus_nr=20,bus=pcie.0,id=cxl.2 \
-device cxl-rp,port=1,bus=cxl.2,id=rport0.2,chassis=5,pref64-reserve=2G,mem-reserve=1G \
-device arm-smmuv3,primary-bus=cxl.2,id=smmuv3.1,accel=on,ats=on,ril=on,pasid=on,oas=48 \
-device vfio-pci-nohotplug,host=0002:c1:00.0,bus=rport0.2,id=dev1,iommufd=iommufd0 \

-device pxb-cxl,bus_nr=28,bus=pcie.0,id=cxl.3 \
-device cxl-rp,port=1,bus=cxl.3,id=rport0.3,chassis=6,pref64-reserve=2G,mem-reserve=1G \
-device arm-smmuv3,primary-bus=cxl.3,id=smmuv3.2,accel=on,ats=on,ril=on,pasid=on,oas=48 \
-device vfio-pci-nohotplug,host=000a:81:00.0,bus=rport0.3,id=dev2,iommufd=iommufd0 \

-device pxb-cxl,bus_nr=36,bus=pcie.0,id=cxl.4 \
-device cxl-rp,port=1,bus=cxl.4,id=rport0.4,chassis=7,pref64-reserve=2G,mem-reserve=1G \
-device arm-smmuv3,primary-bus=cxl.4,id=smmuv3.3,accel=on,ats=on,ril=on,pasid=on,oas=48 \
-device vfio-pci-nohotplug,host=000a:e1:00.0,bus=rport0.4,id=dev3,iommufd=iommufd0 \

For above config, setup_locked_hdm() programmed every firmware-committed VFIO
CXL device's HDM decoder 0 (and mapped its DPA window) at the global
cxl_fmws_base, which is the base of the *first* CXL Fixed Memory Window only.
With more than one passthrough GPU -- each behind its own pxb-cxl and its own
single-target CFMWS -- every device was placed at window 0's base.

In the guest this makes all endpoint decoders report the same HPA, so the
CXL core can assemble only one region: the second (and later) endpoints
fail to attach ("failed to find decoder mapping", -ENXIO/-EINVAL) and
their device drivers never initialize.

Select the base of the window that actually targets the device instead.
Walk the device's ancestor bridges up to its root pxb-cxl host bridge and
match it against each window's target. Matching is by topology, not by
window index or device-creation order, so an N-GPU / N-window layout
places each device in its own window.

The supported scope is one VFIO-CXL device per single-target CFMWS, and
selection fails closed for anything outside it -- there is no global
first-window fallback. vfio_cxl_find_fmws_base() counts the matching
windows and reports the matched window's base, size and target count;
vfio_cxl_program_locked_hdm() then requires a unique single-target match,
and otherwise reports an error that the machine_done notifier turns into a
fatal startup failure (so a misconfigured device cannot boot half set up).
It rejects:
  - no CFMWS targets the device (even on a single-window machine),
  - the host bridge listed in more than one CFMWS (ambiguous),
  - an interleaved (multi-target) window (the flat 1:1 DPA mapping cannot
    honour interleave ways/granularity),
  - a matched window with no assigned base (it did not fit the PA space),
  - a CFMWS smaller than the device DPA region (would expose DPA outside
    the window advertised to the guest via CEDT).

Match against fw->targets[] (the target *names*, populated when the window
object is created) rather than fw->target_hbs[] (the resolved pointers):
target_hbs[] is filled in by cxl_fmws_link_targets(), which on the ARM
virt machine runs from virt_machine_done() -- another machine_done
notifier that may execute after setup_locked_hdm(), leaving target_hbs[]
NULL at this point. Resolving the target name on demand is independent of
machine_done notifier ordering.

Verified with up to four GPUs passed through, each on its own single-target
CFMWS: every endpoint lands in its own window with a distinct base --

  /sys/bus/cxl/devices/decoder5.0/start = 0x80000000000
  /sys/bus/cxl/devices/decoder6.0/start = 0x84000000000
  /sys/bus/cxl/devices/decoder7.0/start = 0x88000000000
  /sys/bus/cxl/devices/decoder8.0/start = 0x8c000000000

-- and all guest GPU drivers initialize.

RFC scope: one firmware-committed VFIO-CXL endpoint below each pxb-cxl
targeted by a passthrough CFMWS. Multiple passthrough endpoints below the
same host bridge would need per-endpoint decoder/window allocation and are
not supported by this RFC.

Fixes: d19409f ("NVIDIA: VR: SAUCE: hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices")
Signed-off-by: Manish Honap <[email protected]>
Comment thread hw/vfio/pci.c
break;
}
bus = pci_get_bus(bridge);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this walk here to find the cxl host bridge? Does something like will work?

root_bus = pci_device_root_bus(pdev);
pxb = root_bus->parent_dev;

if (!pxb ||
!object_dynamic_cast(OBJECT(pxb), TYPE_PXB_CXL_DEV)) {
goto out;
}
for (iter = list; iter; iter = iter->next) {
CXLFixedWindow *fw = CXL_FMW(iter->data);
int i;
.....

@mmhonap mmhonap Jul 2, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the walk is unnecessary. Since only a pxb-cxl can be a CFMWS target and the RFC scope is a directly-attached endpoint (no CXL switch), there's nothing between the endpoint and its host bridge that the walk needed to handle. Replaced it with your suggested form.

Comment thread hw/vfio/pci.c
&nwindows, &ntargets);

if (matches == 1 && ntargets == 1) {
/*

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC scope: one firmware-committed VFIO-CXL endpoint below each pxb-cxl
targeted by a passthrough CFMWS. Multiple passthrough endpoints below the
same host bridge would need per-endpoint decoder/window allocation and are
not supported by this RFC. --> Suppose you have multiple GPUs under same pxb-cxl and points to same CFWMS, that case will enter here right? But the commit log says not supported.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Multiple endpoints under the same pxb-cxl is contradicting the commit log. Let me handle this case.

@nvmochs

nvmochs commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

@mmhonap In addition to the bus walk issue that Shameer pointed out, I also found this while reviewing with Codex:

Regular vfio-pci hotplug of a VFIO-CXL device can terminate an already-running VM if CFMWS validation or HDM programming fails.

The patch moves VFIO-CXL HDM setup into a machine_done notifier path. That is safe during startup because a fatal CXL topology error should stop QEMU before the guest runs. But vfio-pci is still hotpluggable by default, and only vfio-pci-nohotplug explicitly disables hotplug at hw/vfio/pci.c:4581. A cxl-rp secondary bus is also hotpluggable by default through the normal PCIe slot hotplug path.

On post-boot device_add vfio-pci,..., vfio_pci_realize() calls vfio_cxl_setup() at hw/vfio/pci.c:4054. vfio_cxl_setup() registers the machine_done notifier at hw/vfio/pci.c:3791. Since the machine is already in PHASE_MACHINE_READY, qemu_add_machine_init_done_notifier() immediately invokes the callback (hw/core/machine.c:1701). If vfio_cxl_program_locked_hdm() fails, setup_locked_hdm() calls exit(EXIT_FAILURE) at hw/vfio/pci.c:3684.

That means a bad hotplug attempt can kill the whole VM instead of returning a device_add error. The reported launch command uses vfio-pci-nohotplug, so it is not affected directly, but the shared vfio-pci implementation is.

Recommended fix: either reject VFIO-CXL hotplug explicitly in vfio_cxl_setup() when DEVICE(vdev)->hotplugged is true, or avoid notifier registration after machine-ready and instead call the programming function directly with normal errp propagation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants