Skip to content

Commit bfe9e31

Browse files
matt-auldrodrigovivi
authored andcommitted
drm/xe: always keep track of remap prev/next
During 3D workload, user is reporting hitting: [ 413.361679] WARNING: drivers/gpu/drm/xe/xe_vm.c:1217 at vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe], CPU#7: vkd3d_queue/9925 [ 413.361944] CPU: 7 UID: 1000 PID: 9925 Comm: vkd3d_queue Kdump: loaded Not tainted 7.0.0-070000rc3-generic #202603090038 PREEMPT(lazy) [ 413.361949] RIP: 0010:vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe] [ 413.362074] RSP: 0018:ffffd4c25c3df930 EFLAGS: 00010282 [ 413.362077] RAX: 0000000000000000 RBX: ffff8f3ee817ed10 RCX: 0000000000000000 [ 413.362078] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 413.362079] RBP: ffffd4c25c3df980 R08: 0000000000000000 R09: 0000000000000000 [ 413.362081] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f41fbf99380 [ 413.362082] R13: ffff8f3ee817e968 R14: 00000000ffffffef R15: ffff8f43d00bd380 [ 413.362083] FS: 00000001040ff6c0(0000) GS:ffff8f4696d89000(0000) knlGS:00000000330b0000 [ 413.362085] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 413.362086] CR2: 00007ddfc4747000 CR3: 00000002e6262005 CR4: 0000000000f72ef0 [ 413.362088] PKRU: 55555554 [ 413.362089] Call Trace: [ 413.362092] <TASK> [ 413.362096] xe_vm_bind_ioctl+0xa9a/0xc60 [xe] Which seems to hint that the vma we are re-inserting for the ops unwind is either invalid or overlapping with something already inserted in the vm. It shouldn't be invalid since this is a re-insertion, so must have worked before. Leaving the likely culprit as something already placed where we want to insert the vma. Following from that, for the case where we do something like a rebind in the middle of a vma, and one or both mapped ends are already compatible, we skip doing the rebind of those vma and set next/prev to NULL. As well as then adjust the original unmap va range, to avoid unmapping the ends. However, if we trigger the unwind path, we end up with three va, with the two ends never being removed and the original va range in the middle still being the shrunken size. If this occurs, one failure mode is when another unwind op needs to interact with that range, which can happen with a vector of binds. For example, if we need to re-insert something in place of the original va. In this case the va is still the shrunken version, so when removing it and then doing a re-insert it can overlap with the ends, which were never removed, triggering a warning like above, plus leaving the vm in a bad state. With that, we need two things here: 1) Stop nuking the prev/next tracking for the skip cases. Instead relying on checking for skip prev/next, where needed. That way on the unwind path, we now correctly remove both ends. 2) Undo the unmap va shrinkage, on the unwind path. With the two ends now removed the unmap va should expand back to the original size again, before re-insertion. v2: - Update the explanation in the commit message, based on an actual IGT of triggering this issue, rather than conjecture. - Also undo the unmap shrinkage, for the skip case. With the two ends now removed, the original unmap va range should expand back to the original range. v3: - Track the old start/range separately. vma_size/start() uses the va info directly. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7602 Fixes: 8f33b4f ("drm/xe: Avoid doing rebinds") Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: <[email protected]> # v6.8+ Reviewed-by: Matthew Brost <[email protected]> Link: https://patch.msgid.link/[email protected] (cherry picked from commit aec6969) Signed-off-by: Rodrigo Vivi <[email protected]>
1 parent 56781a4 commit bfe9e31

3 files changed

Lines changed: 28 additions & 10 deletions

File tree

drivers/gpu/drm/xe/xe_pt.c

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1442,9 +1442,9 @@ static int op_check_svm_userptr(struct xe_vm *vm, struct xe_vma_op *op,
14421442
err = vma_check_userptr(vm, op->map.vma, pt_update);
14431443
break;
14441444
case DRM_GPUVA_OP_REMAP:
1445-
if (op->remap.prev)
1445+
if (op->remap.prev && !op->remap.skip_prev)
14461446
err = vma_check_userptr(vm, op->remap.prev, pt_update);
1447-
if (!err && op->remap.next)
1447+
if (!err && op->remap.next && !op->remap.skip_next)
14481448
err = vma_check_userptr(vm, op->remap.next, pt_update);
14491449
break;
14501450
case DRM_GPUVA_OP_UNMAP:
@@ -2198,12 +2198,12 @@ static int op_prepare(struct xe_vm *vm,
21982198

21992199
err = unbind_op_prepare(tile, pt_update_ops, old);
22002200

2201-
if (!err && op->remap.prev) {
2201+
if (!err && op->remap.prev && !op->remap.skip_prev) {
22022202
err = bind_op_prepare(vm, tile, pt_update_ops,
22032203
op->remap.prev, false);
22042204
pt_update_ops->wait_vm_bookkeep = true;
22052205
}
2206-
if (!err && op->remap.next) {
2206+
if (!err && op->remap.next && !op->remap.skip_next) {
22072207
err = bind_op_prepare(vm, tile, pt_update_ops,
22082208
op->remap.next, false);
22092209
pt_update_ops->wait_vm_bookkeep = true;
@@ -2428,10 +2428,10 @@ static void op_commit(struct xe_vm *vm,
24282428

24292429
unbind_op_commit(vm, tile, pt_update_ops, old, fence, fence2);
24302430

2431-
if (op->remap.prev)
2431+
if (op->remap.prev && !op->remap.skip_prev)
24322432
bind_op_commit(vm, tile, pt_update_ops, op->remap.prev,
24332433
fence, fence2, false);
2434-
if (op->remap.next)
2434+
if (op->remap.next && !op->remap.skip_next)
24352435
bind_op_commit(vm, tile, pt_update_ops, op->remap.next,
24362436
fence, fence2, false);
24372437
break;

drivers/gpu/drm/xe/xe_vm.c

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2554,7 +2554,6 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
25542554
if (!err && op->remap.skip_prev) {
25552555
op->remap.prev->tile_present =
25562556
tile_present;
2557-
op->remap.prev = NULL;
25582557
}
25592558
}
25602559
if (op->remap.next) {
@@ -2564,11 +2563,13 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
25642563
if (!err && op->remap.skip_next) {
25652564
op->remap.next->tile_present =
25662565
tile_present;
2567-
op->remap.next = NULL;
25682566
}
25692567
}
25702568

2571-
/* Adjust for partial unbind after removing VMA from VM */
2569+
/*
2570+
* Adjust for partial unbind after removing VMA from VM. In case
2571+
* of unwind we might need to undo this later.
2572+
*/
25722573
if (!err) {
25732574
op->base.remap.unmap->va->va.addr = op->remap.start;
25742575
op->base.remap.unmap->va->va.range = op->remap.range;
@@ -2687,6 +2688,8 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
26872688

26882689
op->remap.start = xe_vma_start(old);
26892690
op->remap.range = xe_vma_size(old);
2691+
op->remap.old_start = op->remap.start;
2692+
op->remap.old_range = op->remap.range;
26902693

26912694
flags |= op->base.remap.unmap->va->flags & XE_VMA_CREATE_MASK;
26922695
if (op->base.remap.prev) {
@@ -2835,8 +2838,19 @@ static void xe_vma_op_unwind(struct xe_vm *vm, struct xe_vma_op *op,
28352838
xe_svm_notifier_lock(vm);
28362839
vma->gpuva.flags &= ~XE_VMA_DESTROYED;
28372840
xe_svm_notifier_unlock(vm);
2838-
if (post_commit)
2841+
if (post_commit) {
2842+
/*
2843+
* Restore the old va range, in case of the
2844+
* prev/next skip optimisation. Otherwise what
2845+
* we re-insert here could be smaller than the
2846+
* original range.
2847+
*/
2848+
op->base.remap.unmap->va->va.addr =
2849+
op->remap.old_start;
2850+
op->base.remap.unmap->va->va.range =
2851+
op->remap.old_range;
28392852
xe_vm_insert_vma(vm, vma);
2853+
}
28402854
}
28412855
break;
28422856
}

drivers/gpu/drm/xe/xe_vm_types.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,10 @@ struct xe_vma_op_remap {
373373
u64 start;
374374
/** @range: range of the VMA unmap */
375375
u64 range;
376+
/** @old_start: Original start of the VMA we unmap */
377+
u64 old_start;
378+
/** @old_range: Original range of the VMA we unmap */
379+
u64 old_range;
376380
/** @skip_prev: skip prev rebind */
377381
bool skip_prev;
378382
/** @skip_next: skip next rebind */

0 commit comments

Comments
 (0)