vfio-cxl: Program HDM decoder 0 with the per-device CFMWS base#27
vfio-cxl: Program HDM decoder 0 with the per-device CFMWS base#27mmhonap wants to merge 1 commit into
Conversation
…vice CFMWS base
Consider a sample QEMU config as this:
-machine virt,accel=kvm,gic-version=3,hmat=on,cxl=on,ras=on,highmem-mmio-size=4T,\
cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=256G,\
cxl-fmw.1.targets.0=cxl.2,cxl-fmw.1.size=256G,\
cxl-fmw.2.targets.0=cxl.3,cxl-fmw.2.size=256G,\
cxl-fmw.3.targets.0=cxl.4,cxl-fmw.3.size=256G \
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=1,bus=cxl.1,id=rport0.1,chassis=4,pref64-reserve=2G,mem-reserve=1G \
-device arm-smmuv3,primary-bus=cxl.1,id=smmuv3.0,accel=on,ats=on,ril=on,pasid=on,oas=48 \
-device vfio-pci-nohotplug,host=0002:81:00.0,bus=rport0.1,id=dev0,iommufd=iommufd0 \
-device pxb-cxl,bus_nr=20,bus=pcie.0,id=cxl.2 \
-device cxl-rp,port=1,bus=cxl.2,id=rport0.2,chassis=5,pref64-reserve=2G,mem-reserve=1G \
-device arm-smmuv3,primary-bus=cxl.2,id=smmuv3.1,accel=on,ats=on,ril=on,pasid=on,oas=48 \
-device vfio-pci-nohotplug,host=0002:c1:00.0,bus=rport0.2,id=dev1,iommufd=iommufd0 \
-device pxb-cxl,bus_nr=28,bus=pcie.0,id=cxl.3 \
-device cxl-rp,port=1,bus=cxl.3,id=rport0.3,chassis=6,pref64-reserve=2G,mem-reserve=1G \
-device arm-smmuv3,primary-bus=cxl.3,id=smmuv3.2,accel=on,ats=on,ril=on,pasid=on,oas=48 \
-device vfio-pci-nohotplug,host=000a:81:00.0,bus=rport0.3,id=dev2,iommufd=iommufd0 \
-device pxb-cxl,bus_nr=36,bus=pcie.0,id=cxl.4 \
-device cxl-rp,port=1,bus=cxl.4,id=rport0.4,chassis=7,pref64-reserve=2G,mem-reserve=1G \
-device arm-smmuv3,primary-bus=cxl.4,id=smmuv3.3,accel=on,ats=on,ril=on,pasid=on,oas=48 \
-device vfio-pci-nohotplug,host=000a:e1:00.0,bus=rport0.4,id=dev3,iommufd=iommufd0 \
For above config, setup_locked_hdm() programmed every firmware-committed VFIO
CXL device's HDM decoder 0 (and mapped its DPA window) at the global
cxl_fmws_base, which is the base of the *first* CXL Fixed Memory Window only.
With more than one passthrough GPU -- each behind its own pxb-cxl and its own
single-target CFMWS -- every device was placed at window 0's base.
In the guest this makes all endpoint decoders report the same HPA, so the
CXL core can assemble only one region: the second (and later) endpoints
fail to attach ("failed to find decoder mapping", -ENXIO/-EINVAL) and
their device drivers never initialize.
Select the base of the window that actually targets the device instead.
Walk the device's ancestor bridges up to its root pxb-cxl host bridge and
match it against each window's target. Matching is by topology, not by
window index or device-creation order, so an N-GPU / N-window layout
places each device in its own window.
The supported scope is one VFIO-CXL device per single-target CFMWS, and
selection fails closed for anything outside it -- there is no global
first-window fallback. vfio_cxl_find_fmws_base() counts the matching
windows and reports the matched window's base, size and target count;
vfio_cxl_program_locked_hdm() then requires a unique single-target match,
and otherwise reports an error that the machine_done notifier turns into a
fatal startup failure (so a misconfigured device cannot boot half set up).
It rejects:
- no CFMWS targets the device (even on a single-window machine),
- the host bridge listed in more than one CFMWS (ambiguous),
- an interleaved (multi-target) window (the flat 1:1 DPA mapping cannot
honour interleave ways/granularity),
- a matched window with no assigned base (it did not fit the PA space),
- a CFMWS smaller than the device DPA region (would expose DPA outside
the window advertised to the guest via CEDT).
Match against fw->targets[] (the target *names*, populated when the window
object is created) rather than fw->target_hbs[] (the resolved pointers):
target_hbs[] is filled in by cxl_fmws_link_targets(), which on the ARM
virt machine runs from virt_machine_done() -- another machine_done
notifier that may execute after setup_locked_hdm(), leaving target_hbs[]
NULL at this point. Resolving the target name on demand is independent of
machine_done notifier ordering.
Verified with up to four GPUs passed through, each on its own single-target
CFMWS: every endpoint lands in its own window with a distinct base --
/sys/bus/cxl/devices/decoder5.0/start = 0x80000000000
/sys/bus/cxl/devices/decoder6.0/start = 0x84000000000
/sys/bus/cxl/devices/decoder7.0/start = 0x88000000000
/sys/bus/cxl/devices/decoder8.0/start = 0x8c000000000
-- and all guest GPU drivers initialize.
RFC scope: one firmware-committed VFIO-CXL endpoint below each pxb-cxl
targeted by a passthrough CFMWS. Multiple passthrough endpoints below the
same host bridge would need per-endpoint decoder/window allocation and are
not supported by this RFC.
Fixes: d19409f ("NVIDIA: VR: SAUCE: hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices")
Signed-off-by: Manish Honap <[email protected]>
| break; | ||
| } | ||
| bus = pci_get_bus(bridge); | ||
| } |
There was a problem hiding this comment.
Do we need this walk here to find the cxl host bridge? Does something like will work?
root_bus = pci_device_root_bus(pdev);
pxb = root_bus->parent_dev;
if (!pxb ||
!object_dynamic_cast(OBJECT(pxb), TYPE_PXB_CXL_DEV)) {
goto out;
}
for (iter = list; iter; iter = iter->next) {
CXLFixedWindow *fw = CXL_FMW(iter->data);
int i;
.....
There was a problem hiding this comment.
Agreed, the walk is unnecessary. Since only a pxb-cxl can be a CFMWS target and the RFC scope is a directly-attached endpoint (no CXL switch), there's nothing between the endpoint and its host bridge that the walk needed to handle. Replaced it with your suggested form.
| &nwindows, &ntargets); | ||
|
|
||
| if (matches == 1 && ntargets == 1) { | ||
| /* |
There was a problem hiding this comment.
RFC scope: one firmware-committed VFIO-CXL endpoint below each pxb-cxl
targeted by a passthrough CFMWS. Multiple passthrough endpoints below the
same host bridge would need per-endpoint decoder/window allocation and are
not supported by this RFC. --> Suppose you have multiple GPUs under same pxb-cxl and points to same CFWMS, that case will enter here right? But the commit log says not supported.
There was a problem hiding this comment.
Yes, you are right. Multiple endpoints under the same pxb-cxl is contradicting the commit log. Let me handle this case.
|
@mmhonap In addition to the bus walk issue that Shameer pointed out, I also found this while reviewing with Codex: Regular vfio-pci hotplug of a VFIO-CXL device can terminate an already-running VM if CFMWS validation or HDM programming fails. The patch moves VFIO-CXL HDM setup into a machine_done notifier path. That is safe during startup because a fatal CXL topology error should stop QEMU before the guest runs. But vfio-pci is still hotpluggable by default, and only vfio-pci-nohotplug explicitly disables hotplug at hw/vfio/pci.c:4581. A cxl-rp secondary bus is also hotpluggable by default through the normal PCIe slot hotplug path. On post-boot device_add vfio-pci,..., vfio_pci_realize() calls vfio_cxl_setup() at hw/vfio/pci.c:4054. vfio_cxl_setup() registers the machine_done notifier at hw/vfio/pci.c:3791. Since the machine is already in PHASE_MACHINE_READY, qemu_add_machine_init_done_notifier() immediately invokes the callback (hw/core/machine.c:1701). If vfio_cxl_program_locked_hdm() fails, setup_locked_hdm() calls exit(EXIT_FAILURE) at hw/vfio/pci.c:3684. That means a bad hotplug attempt can kill the whole VM instead of returning a device_add error. The reported launch command uses vfio-pci-nohotplug, so it is not affected directly, but the shared vfio-pci implementation is. Recommended fix: either reject VFIO-CXL hotplug explicitly in vfio_cxl_setup() when DEVICE(vdev)->hotplugged is true, or avoid notifier registration after machine-ready and instead call the programming function directly with normal errp propagation. |
Consider a sample QEMU config as this:
-machine virt,accel=kvm,gic-version=3,hmat=on,cxl=on,ras=on,highmem-mmio-size=4T,\ cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=256G,
cxl-fmw.1.targets.0=cxl.2,cxl-fmw.1.size=256G,
cxl-fmw.2.targets.0=cxl.3,cxl-fmw.2.size=256G,
cxl-fmw.3.targets.0=cxl.4,cxl-fmw.3.size=256G \
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1
-device cxl-rp,port=1,bus=cxl.1,id=rport0.1,chassis=4,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.1,id=smmuv3.0,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=0002:81:00.0,bus=rport0.1,id=dev0,iommufd=iommufd0 \
-device pxb-cxl,bus_nr=20,bus=pcie.0,id=cxl.2
-device cxl-rp,port=1,bus=cxl.2,id=rport0.2,chassis=5,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.2,id=smmuv3.1,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=0002:c1:00.0,bus=rport0.2,id=dev1,iommufd=iommufd0 \
-device pxb-cxl,bus_nr=28,bus=pcie.0,id=cxl.3
-device cxl-rp,port=1,bus=cxl.3,id=rport0.3,chassis=6,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.3,id=smmuv3.2,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=000a:81:00.0,bus=rport0.3,id=dev2,iommufd=iommufd0 \
-device pxb-cxl,bus_nr=36,bus=pcie.0,id=cxl.4
-device cxl-rp,port=1,bus=cxl.4,id=rport0.4,chassis=7,pref64-reserve=2G,mem-reserve=1G \ -device arm-smmuv3,primary-bus=cxl.4,id=smmuv3.3,accel=on,ats=on,ril=on,pasid=on,oas=48 \ -device vfio-pci-nohotplug,host=000a:e1:00.0,bus=rport0.4,id=dev3,iommufd=iommufd0 \
For above config, setup_locked_hdm() programmed every firmware-committed VFIO CXL device's HDM decoder 0 (and mapped its DPA window) at the global cxl_fmws_base, which is the base of the first CXL Fixed Memory Window only. With more than one passthrough GPU -- each behind its own pxb-cxl and its own single-target CFMWS -- every device was placed at window 0's base.
In the guest this makes all endpoint decoders report the same HPA, so the CXL core can assemble only one region: the second (and later) endpoints fail to attach ("failed to find decoder mapping", -ENXIO/-EINVAL) and their device drivers never initialize.
Select the base of the window that actually targets the device instead. Walk the device's ancestor bridges up to its root pxb-cxl host bridge and match it against each window's target. Matching is by topology, not by window index or device-creation order, so an N-GPU / N-window layout places each device in its own window.
The supported scope is one VFIO-CXL device per single-target CFMWS, and selection fails closed for anything outside it -- there is no global first-window fallback. vfio_cxl_find_fmws_base() counts the matching windows and reports the matched window's base, size and target count; vfio_cxl_program_locked_hdm() then requires a unique single-target match, and otherwise reports an error that the machine_done notifier turns into a fatal startup failure (so a misconfigured device cannot boot half set up). It rejects:
Match against fw->targets[] (the target names, populated when the window object is created) rather than fw->target_hbs[] (the resolved pointers): target_hbs[] is filled in by cxl_fmws_link_targets(), which on the ARM virt machine runs from virt_machine_done() - another machine_done notifier that may execute after setup_locked_hdm(), leaving target_hbs[] NULL at this point. Resolving the target name on demand is independent of machine_done notifier ordering.
Verified with up to four GPUs passed through, each on its own single-target CFMWS: every endpoint lands in its own window with a distinct base:
/sys/bus/cxl/devices/decoder5.0/start = 0x80000000000
/sys/bus/cxl/devices/decoder6.0/start = 0x84000000000
/sys/bus/cxl/devices/decoder7.0/start = 0x88000000000
/sys/bus/cxl/devices/decoder8.0/start = 0x8c000000000
and all guest GPU drivers initialize.
RFC scope: one firmware-committed VFIO-CXL endpoint below each pxb-cxl targeted by a passthrough CFMWS. Multiple passthrough endpoints below the same host bridge would need per-endpoint decoder/window allocation and are not supported by this RFC.
Fixes: d19409f ("NVIDIA: VR: SAUCE: hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices")