Skip to content

Commit b4d37cd

Browse files
committed
KVM: Don't clobber irqfd routing type when deassigning irqfd
When deassigning a KVM_IRQFD, don't clobber the irqfd's copy of the IRQ's routing entry as doing so breaks kvm_arch_irq_bypass_del_producer() on x86 and arm64, which explicitly look for KVM_IRQ_ROUTING_MSI. Instead, to handle a concurrent routing update, verify that the irqfd is still active before consuming the routing information. As evidenced by the x86 and arm64 bugs, and another bug in kvm_arch_update_irqfd_routing() (see below), clobbering the entry type without notifying arch code is surprising and error prone. As a bonus, checking that the irqfd is active provides a convenient location for documenting _why_ KVM must not consume the routing entry for an irqfd that is in the process of being deassigned: once the irqfd is deleted from the list (which happens *before* the eventfd is detached), it will no longer receive updates via kvm_irq_routing_update(), and so KVM could deliver an event using stale routing information (relative to KVM_SET_GSI_ROUTING returning to userspace). As an even better bonus, explicitly checking for the irqfd being active fixes a similar bug to the one the clobbering is trying to prevent: if an irqfd is deactivated, and then its routing is changed, kvm_irq_routing_update() won't invoke kvm_arch_update_irqfd_routing() (because the irqfd isn't in the list). And so if the irqfd is in bypass mode, IRQs will continue to be posted using the old routing information. As for kvm_arch_irq_bypass_del_producer(), clobbering the routing type results in KVM incorrectly keeping the IRQ in bypass mode, which is especially problematic on AMD as KVM tracks IRQs that are being posted to a vCPU in a list whose lifetime is tied to the irqfd. Without the help of KASAN to detect use-after-free, the most common sympton on AMD is a NULL pointer deref in amd_iommu_update_ga() due to the memory for irqfd structure being re-allocated and zeroed, resulting in irqfd->irq_bypass_data being NULL when read by avic_update_iommu_vcpu_affinity(): BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 40cf2b9067 P4D 40cf2b9067 PUD 408362a067 PMD 0 Oops: Oops: 0000 [#1] SMP CPU: 6 UID: 0 PID: 40383 Comm: vfio_irq_test Tainted: G U W O 6.19.0-smp--5dddc257e6b2-irqfd #31 NONE Tainted: [U]=USER, [W]=WARN, [O]=OOT_MODULE Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 34.78.2-0 09/05/2025 RIP: 0010:amd_iommu_update_ga+0x19/0xe0 Call Trace: <TASK> avic_update_iommu_vcpu_affinity+0x3d/0x90 [kvm_amd] __avic_vcpu_load+0xf4/0x130 [kvm_amd] kvm_arch_vcpu_load+0x89/0x210 [kvm] vcpu_load+0x30/0x40 [kvm] kvm_arch_vcpu_ioctl_run+0x45/0x620 [kvm] kvm_vcpu_ioctl+0x571/0x6a0 [kvm] __se_sys_ioctl+0x6d/0xb0 do_syscall_64+0x6f/0x9d0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x46893b </TASK> ---[ end trace 0000000000000000 ]--- If AVIC is inhibited when the irfd is deassigned, the bug will manifest as list corruption, e.g. on the next irqfd assignment. list_add corruption. next->prev should be prev (ffff8d474d5cd588), but was 0000000000000000. (next=ffff8d8658f86530). ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:31! Oops: invalid opcode: 0000 [#1] SMP CPU: 128 UID: 0 PID: 80818 Comm: vfio_irq_test Tainted: G U W O 6.19.0-smp--f19dc4d680ba-irqfd #28 NONE Tainted: [U]=USER, [W]=WARN, [O]=OOT_MODULE Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 34.78.2-0 09/05/2025 RIP: 0010:__list_add_valid_or_report+0x97/0xc0 Call Trace: <TASK> avic_pi_update_irte+0x28e/0x2b0 [kvm_amd] kvm_pi_update_irte+0xbf/0x190 [kvm] kvm_arch_irq_bypass_add_producer+0x72/0x90 [kvm] irq_bypass_register_consumer+0xcd/0x170 [irqbypass] kvm_irqfd+0x4c6/0x540 [kvm] kvm_vm_ioctl+0x118/0x5d0 [kvm] __se_sys_ioctl+0x6d/0xb0 do_syscall_64+0x6f/0x9d0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> ---[ end trace 0000000000000000 ]--- On Intel and arm64, the bug is less noisy, as the end result is that the device keeps posting IRQs to the vCPU even after it's been deassigned. Note, the worst of the breakage can be traced back to commit cb21073 ("KVM: Pass new routing entries and irqfd when updating IRTEs"), as before that commit KVM would pull the routing information from the per-VM routing table. But as above, similar bugs have existed since support for IRQ bypass was added. E.g. if a routing change finished before irq_shutdown() invoked kvm_arch_irq_bypass_del_producer(), VMX and SVM would see stale routing information and potentially leave the irqfd in bypass mode. Alternatively, x86 could be fixed by explicitly checking irq_bypass_vcpu instead of irq_entry.type in kvm_arch_irq_bypass_del_producer(), and arm64 could be modified to utilize irq_bypass_vcpu in a similar manner. But (a) that wouldn't fix the routing updates bug, and (b) fixing core code doesn't preclude x86 (or arm64) from adding such code as a sanity check (spoiler alert). Fixes: f70c20a ("KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'") Fixes: cb21073 ("KVM: Pass new routing entries and irqfd when updating IRTEs") Fixes: a0d7e2f ("KVM: arm64: vgic-v4: Only attempt vLPI mapping for actual MSIs") Cc: [email protected] Cc: Marc Zyngier <[email protected]> Cc: Oliver Upton <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
1 parent cfbe371 commit b4d37cd

1 file changed

Lines changed: 24 additions & 20 deletions

File tree

virt/kvm/eventfd.c

Lines changed: 24 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -157,21 +157,28 @@ irqfd_shutdown(struct work_struct *work)
157157
}
158158

159159

160-
/* assumes kvm->irqfds.lock is held */
161-
static bool
162-
irqfd_is_active(struct kvm_kernel_irqfd *irqfd)
160+
static bool irqfd_is_active(struct kvm_kernel_irqfd *irqfd)
163161
{
162+
/*
163+
* Assert that either irqfds.lock or SRCU is held, as irqfds.lock must
164+
* be held to prevent false positives (on the irqfd being active), and
165+
* while false negatives are impossible as irqfds are never added back
166+
* to the list once they're deactivated, the caller must at least hold
167+
* SRCU to guard against routing changes if the irqfd is deactivated.
168+
*/
169+
lockdep_assert_once(lockdep_is_held(&irqfd->kvm->irqfds.lock) ||
170+
srcu_read_lock_held(&irqfd->kvm->irq_srcu));
171+
164172
return list_empty(&irqfd->list) ? false : true;
165173
}
166174

167175
/*
168176
* Mark the irqfd as inactive and schedule it for removal
169-
*
170-
* assumes kvm->irqfds.lock is held
171177
*/
172-
static void
173-
irqfd_deactivate(struct kvm_kernel_irqfd *irqfd)
178+
static void irqfd_deactivate(struct kvm_kernel_irqfd *irqfd)
174179
{
180+
lockdep_assert_held(&irqfd->kvm->irqfds.lock);
181+
175182
BUG_ON(!irqfd_is_active(irqfd));
176183

177184
list_del_init(&irqfd->list);
@@ -217,8 +224,15 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
217224
seq = read_seqcount_begin(&irqfd->irq_entry_sc);
218225
irq = irqfd->irq_entry;
219226
} while (read_seqcount_retry(&irqfd->irq_entry_sc, seq));
220-
/* An event has been signaled, inject an interrupt */
221-
if (kvm_arch_set_irq_inatomic(&irq, kvm,
227+
228+
/*
229+
* An event has been signaled, inject an interrupt unless the
230+
* irqfd is being deassigned (isn't active), in which case the
231+
* routing information may be stale (once the irqfd is removed
232+
* from the list, it will stop receiving routing updates).
233+
*/
234+
if (unlikely(!irqfd_is_active(irqfd)) ||
235+
kvm_arch_set_irq_inatomic(&irq, kvm,
222236
KVM_USERSPACE_IRQ_SOURCE_ID, 1,
223237
false) == -EWOULDBLOCK)
224238
schedule_work(&irqfd->inject);
@@ -585,18 +599,8 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
585599
spin_lock_irq(&kvm->irqfds.lock);
586600

587601
list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
588-
if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi) {
589-
/*
590-
* This clearing of irq_entry.type is needed for when
591-
* another thread calls kvm_irq_routing_update before
592-
* we flush workqueue below (we synchronize with
593-
* kvm_irq_routing_update using irqfds.lock).
594-
*/
595-
write_seqcount_begin(&irqfd->irq_entry_sc);
596-
irqfd->irq_entry.type = 0;
597-
write_seqcount_end(&irqfd->irq_entry_sc);
602+
if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi)
598603
irqfd_deactivate(irqfd);
599-
}
600604
}
601605

602606
spin_unlock_irq(&kvm->irqfds.lock);

0 commit comments

Comments
 (0)