Skip to content

Fix deadlock in beginJob when compiled debug#269

Open
Copilot wants to merge 2 commits into
mainfrom
copilot/eventdisplay-hang-beginjob-debug
Open

Fix deadlock in beginJob when compiled debug#269
Copilot wants to merge 2 commits into
mainfrom
copilot/eventdisplay-hang-beginjob-debug

Conversation

Copy link
Copy Markdown

Copilot AI commented May 29, 2026

Debug-compiled EventDisplay deadlocks in beginJob() at cv_.wait(lock). GDB shows only 2 threads with the app thread apparently never reaching its event loop.

Root cause: ABBA lock ordering between m_ (held by beginJob) and gSystemMutex (acquired inside XThreadTimer constructor via R__LOCKGUARD2). The app thread holds gSystemMutex in gSystem->ProcessEvents() while timer callbacks attempt to acquire m_. Debug timing makes this race deterministic. Additionally, bare cv_.wait() without a predicate loses notifications if the signal arrives before the wait begins.

Changes:

  • Restructured beginJob() to only hold m_ during cv_.wait(), not during XThreadTimer construction — eliminates lock ordering cycle
  • Added appStarted_ and eveSetupDone_ predicate flags for the two synchronization points
  • Switched to cv_.wait(lock, predicate) — handles both missed notifications and spurious wakeups
  • signalAppStart() and setup_eve() set their respective flags under lock before notifying
// Before: holds m_ across timer creation → lock ordering deadlock
std::unique_lock lock{m_};
appThread_ = std::thread{[this] { run_application(); }};
XThreadTimer sut([this]{ signalAppStart(); });  // acquires gSystemMutex
cv_.wait(lock);

// After: m_ only held for the wait
appThread_ = std::thread{[this] { run_application(); }};
XThreadTimer sut([this]{ signalAppStart(); });
{
    std::unique_lock lock{m_};
    cv_.wait(lock, [this]{ return appStarted_; });
}

The deadlock was caused by two issues:
1. Lock ordering: beginJob() held m_ while creating XThreadTimer (which
   acquires gSystemMutex internally), while the app thread could hold
   gSystemMutex (inside ProcessEvents) and try to acquire m_ via a timer
   callback - classic ABBA deadlock.
2. Bare cv_.wait() without predicate: susceptible to missed notifications
   if the signal was sent before the wait began.

Fix:
- Add appStarted_ and eveSetupDone_ boolean state flags
- Use predicate-based cv_.wait() to handle missed notifications
- Don't hold m_ during XThreadTimer creation to avoid lock ordering issue
Copilot AI changed the title [WIP] Fix EventDisplay hang in beginJob when compiled debug Fix deadlock in beginJob when compiled debug May 29, 2026
Copilot AI requested a review from oksuzian May 29, 2026 18:21
@brownd1978 brownd1978 marked this pull request as ready for review May 29, 2026 20:03
Copy link
Copy Markdown
Contributor

@brownd1978 brownd1978 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not change the behavior: Mu2eEventDisplay complied debug still hangs in beginJob at line 273.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EventDisplay hang in beginjob when compiled debug

3 participants