Skip to content

Commit 5ddcb50

Browse files
matt-auldgregkh
authored andcommitted
drm/xe/guc_submit: fix race around suspend_pending
commit 87651f3 upstream. Currently in some testcases we can trigger: xe 0000:03:00.0: [drm] Assertion `exec_queue_destroyed(q)` failed! .... WARNING: CPU: 18 PID: 2640 at drivers/gpu/drm/xe/xe_guc_submit.c:1826 xe_guc_sched_done_handler+0xa54/0xef0 [xe] xe 0000:03:00.0: [drm] *ERROR* GT1: DEREGISTER_DONE: Unexpected engine state 0x00a1, guc_id=57 Looking at a snippet of corresponding ftrace for this GuC id we can see: 162.673311: xe_sched_msg_add: dev=0000:03:00.0, gt=1 guc_id=57, opcode=3 162.673317: xe_sched_msg_recv: dev=0000:03:00.0, gt=1 guc_id=57, opcode=3 162.673319: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0x29, flags=0x0 162.674089: xe_exec_queue_kill: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0x29, flags=0x0 162.674108: xe_exec_queue_close: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa9, flags=0x0 162.674488: xe_exec_queue_scheduling_done: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa9, flags=0x0 162.678452: xe_exec_queue_deregister: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa1, flags=0x0 It looks like we try to suspend the queue (opcode=3), setting suspend_pending and triggering a disable_scheduling. The user then closes the queue. However the close will also forcefully signal the suspend fence after killing the queue, later when the G2H response for disable_scheduling comes back we have now cleared suspend_pending when signalling the suspend fence, so the disable_scheduling now incorrectly tries to also deregister the queue. This leads to warnings since the queue has yet to even be marked for destruction. We also seem to trigger errors later with trying to double unregister the same queue. To fix this tweak the ordering when handling the response to ensure we don't race with a disable_scheduling that didn't actually intend to perform an unregister. The destruction path should now also correctly wait for any pending_disable before marking as destroyed. Fixes: dd08ebf ("drm/xe: Introduce a new DRM driver for Intel GPUs") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3371 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241122161914.321263-6-matthew.auld@intel.com (cherry picked from commit f161809) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent 1c052c6 commit 5ddcb50

1 file changed

Lines changed: 15 additions & 2 deletions

File tree

drivers/gpu/drm/xe/xe_guc_submit.c

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1846,16 +1846,29 @@ static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
18461846
xe_gt_assert(guc_to_gt(guc), runnable_state == 0);
18471847
xe_gt_assert(guc_to_gt(guc), exec_queue_pending_disable(q));
18481848

1849-
clear_exec_queue_pending_disable(q);
18501849
if (q->guc->suspend_pending) {
18511850
suspend_fence_signal(q);
1851+
clear_exec_queue_pending_disable(q);
18521852
} else {
18531853
if (exec_queue_banned(q) || check_timeout) {
18541854
smp_wmb();
18551855
wake_up_all(&guc->ct.wq);
18561856
}
1857-
if (!check_timeout)
1857+
if (!check_timeout && exec_queue_destroyed(q)) {
1858+
/*
1859+
* Make sure to clear the pending_disable only
1860+
* after sampling the destroyed state. We want
1861+
* to ensure we don't trigger the unregister too
1862+
* early with something intending to only
1863+
* disable scheduling. The caller doing the
1864+
* destroy must wait for an ongoing
1865+
* pending_disable before marking as destroyed.
1866+
*/
1867+
clear_exec_queue_pending_disable(q);
18581868
deregister_exec_queue(guc, q);
1869+
} else {
1870+
clear_exec_queue_pending_disable(q);
1871+
}
18591872
}
18601873
}
18611874
}

0 commit comments

Comments
 (0)