Skip to content

Commit 867347b

Browse files
committed
sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()
Drop the setting of WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() and instead have callers manually add the flag prior to adding their structure to the queue. Blindly setting WQ_FLAG_EXCLUSIVE is flawed, as the nature of exclusive, priority waiters means that only the first waiter added will ever receive notifications. Pushing the flawed behavior to callers will allow fixing the problem one hypervisor at a time (KVM added the flawed API, and then KVM's code was copy+pasted nearly verbatim by Xen and Hyper-V), and will also allow for adding an API that provides true exclusivity, i.e. that guarantees at most one priority waiter is in the queue. Opportunistically add a comment in Hyper-V to call out the mess. Xen privcmd's irqfd_wakefup() doesn't actually operate in exclusive mode, i.e. can be "fixed" simply by dropping WQ_FLAG_EXCLUSIVE. And KVM is primed to switch to the aforementioned fully exclusive API, i.e. won't be carrying the flawed code for long. No functional change intended. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
1 parent 86e00cd commit 867347b

4 files changed

Lines changed: 12 additions & 2 deletions

File tree

drivers/hv/mshv_eventfd.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -368,6 +368,14 @@ static void mshv_irqfd_queue_proc(struct file *file, wait_queue_head_t *wqh,
368368
container_of(polltbl, struct mshv_irqfd, irqfd_polltbl);
369369

370370
irqfd->irqfd_wqh = wqh;
371+
372+
/*
373+
* TODO: Ensure there isn't already an exclusive, priority waiter, e.g.
374+
* that the irqfd isn't already bound to another partition. Only the
375+
* first exclusive waiter encountered will be notified, and
376+
* add_wait_queue_priority() doesn't enforce exclusivity.
377+
*/
378+
irqfd->irqfd_wait.flags |= WQ_FLAG_EXCLUSIVE;
371379
add_wait_queue_priority(wqh, &irqfd->irqfd_wait);
372380
}
373381

drivers/xen/privcmd.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -957,6 +957,7 @@ irqfd_poll_func(struct file *file, wait_queue_head_t *wqh, poll_table *pt)
957957
struct privcmd_kernel_irqfd *kirqfd =
958958
container_of(pt, struct privcmd_kernel_irqfd, pt);
959959

960+
kirqfd->wait.flags |= WQ_FLAG_EXCLUSIVE;
960961
add_wait_queue_priority(wqh, &kirqfd->wait);
961962
}
962963

kernel/sched/wait.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_
4040
{
4141
unsigned long flags;
4242

43-
wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
43+
wq_entry->flags |= WQ_FLAG_PRIORITY;
4444
spin_lock_irqsave(&wq_head->lock, flags);
4545
__add_wait_queue(wq_head, wq_entry);
4646
spin_unlock_irqrestore(&wq_head->lock, flags);
@@ -64,7 +64,7 @@ EXPORT_SYMBOL(remove_wait_queue);
6464
* the non-exclusive tasks. Normally, exclusive tasks will be at the end of
6565
* the list and any non-exclusive tasks will be woken first. A priority task
6666
* may be at the head of the list, and can consume the event without any other
67-
* tasks being woken.
67+
* tasks being woken if it's also an exclusive task.
6868
*
6969
* There are circumstances in which we can try to wake a task which has already
7070
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns

virt/kvm/eventfd.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,7 @@ static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
316316
init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
317317

318318
spin_release(&kvm->irqfds.lock.dep_map, _RET_IP_);
319+
irqfd->wait.flags |= WQ_FLAG_EXCLUSIVE;
319320
add_wait_queue_priority(wqh, &irqfd->wait);
320321
spin_acquire(&kvm->irqfds.lock.dep_map, 0, 0, _RET_IP_);
321322

0 commit comments

Comments
 (0)