Guard root relaxation task exceptions#1404
Conversation
Signed-off-by: Rohithmatham12 <rohithmatham@gmail.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughConcurrent root LP and B&B tasks now capture exceptions into an std::exception_ptr and set the concurrent halt flag; the stored exception is rethrown after each taskwait/taskgroup synchronization so failures surface to the caller. A test for concurrent-root infeasibility was added. ChangesException propagation for concurrent root LP solve
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Hi maintainers, this PR is ready for review but the required Label Checker is currently blocked by missing labels. Could someone with repo permissions please add the category/breaking-status labels? I believe the appropriate labels are It also still needs the NVIDIA runner validation from copy-pr-bot before CI can start. Thanks! |
|
Thanks for you contribution, @Rohithmatham12! I do not know if this will solve the #1396 completely as we are also not catching the exception on the enclosing task. We should probably investigate why it is throwing an exception in the first place. In any case, I think this is a good addition. |
|
/ok to test fbc6ad9 |
Signed-off-by: Rohithmatham12 <rohithmatham@gmail.com>
|
Thanks for the review. I added a follow-up commit that also catches exceptions from the enclosing B&B OpenMP task in I also added a regression in |
|
Thanks for the contribution. Could you confirm that #1396 is fixed by following the repro instructions? |
|
Yes, this should address #1396: the crash path was caused by exceptions escaping OpenMP tasks in the opportunistic concurrent root solve. This PR now catches both the root-relaxation task exception and the enclosing B&B task exception, stops the companion heuristic path, and rethrows safely after task synchronization. I also added a regression covering the opportunistic infeasible-MIP case so CI validates that it returns |
|
It should address or it does address? We can run the repro script to confirm. |
|
It does address the identified crash path: exceptions from both the concurrent root-relaxation task and the enclosing B&B OpenMP task are now captured, the companion heuristic path is stopped, and the exception is rethrown only after task synchronization. Yes, running the repro script against the latest commit would be the right end-to-end confirmation. The added C++ regression is intended to cover the same opportunistic infeasible-MIP path in CI and verify that it returns |
|
Did you run the end-to-end confirmation? |
|
No, I have not run the exact Python repro end-to-end yet. I verified the code path by adding the C++ regression for the same opportunistic infeasible-MIP case, but the Python repro should still be run against the latest commit for final confirmation. If you can run it on the NVIDIA CI/runtime, I will follow up immediately on any failure. |
| root_vstatus_, | ||
| edge_norms_, | ||
| nullptr); | ||
| } catch (...) { |
There was a problem hiding this comment.
What is the underlying exception we are catching here? It might be best to address the root cause of that exception.
There was a problem hiding this comment.
Agreed that the root cause is the ideal fix if the repro exposes a specific solver exception. From the current public repro, the observable failure is process abort from an exception escaping OpenMP task execution in the opportunistic concurrent root path.
I added the exact Python repro as a subprocess-based pytest so NVIDIA CI can now validate the end-to-end path and expose whether there is still a deeper solver exception after task synchronization is fixed. If that test fails with a concrete exception/status, I will follow up by addressing that root cause rather than leaving this as only exception containment.
Signed-off-by: Rohithmatham12 <rohithmatham@gmail.com>
|
I added the #1396 Python repro as an end-to-end pytest in This should let NVIDIA GPU CI provide the requested confirmation on the latest PR commit ( |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py (1)
107-164: 💤 Low valueConsider adding a docstring to document the regression test.
This test addresses issue
#1396(opportunistic mode crashing on infeasible MILPs). A docstring would help future maintainers understand the specific crash scenario and why subprocess isolation is necessary.📝 Example docstring
def test_mip_opportunistic_infeasible_repro_returns_status(): + """Regression test for issue `#1396`: infeasible MILP should return Infeasible status. + + Verifies that an infeasible MILP returns status Infeasible in both deterministic + and opportunistic modes, without crashing (issue `#1396` caused SIGABRT in opportunistic + mode). Uses subprocess isolation to catch process-level crashes. + """ worker = textwrap.dedent(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py` around lines 107 - 164, Add a concise docstring at the top of the test_mip_opportunistic_infeasible_repro_returns_status function explaining that this is a regression test for issue `#1396` (opportunistic mode crashing on infeasible MILPs), why the test uses a subprocess (isolation to reproduce crash deterministically via the worker script and modes "deterministic"/"opportunistic"), and what the expected behavior is (process exits 0 and outputs "STATUS=Infeasible"). Reference the worker variable and the subprocess.run invocation in the docstring so future maintainers know where to look.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py`:
- Around line 107-164: Add a concise docstring at the top of the
test_mip_opportunistic_infeasible_repro_returns_status function explaining that
this is a regression test for issue `#1396` (opportunistic mode crashing on
infeasible MILPs), why the test uses a subprocess (isolation to reproduce crash
deterministically via the worker script and modes
"deterministic"/"opportunistic"), and what the expected behavior is (process
exits 0 and outputs "STATUS=Infeasible"). Reference the worker variable and the
subprocess.run invocation in the docstring so future maintainers know where to
look.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 62e1187d-b851-42a7-8bfe-6889d4358cd9
📒 Files selected for processing (1)
python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py
Signed-off-by: Rohithmatham12 <rohithmatham@gmail.com>
|
Follow-up: I addressed the regression-test documentation nit in the latest commit ( The PR now includes the #1396 Python repro as a subprocess-based pytest, so the end-to-end failure mode can be validated directly by CI: both deterministic and opportunistic runs must exit 0 and print |
|
We have plans to fix this issue by addressing the root cause as discussed at #1404 (comment). I don't think it's an effective use of our CI resources or developer bandwidth for you to debug only based on CI. |
|
Thanks for the review and for clarifying the preferred direction. Understood on not spending more CI or reviewer time on this exception-guarding PR if the team plans to address #1396 at the root cause. I will stop iterating on this closed PR. For context, the latest branch did include the enclosing-task guard plus C++/Python regression coverage for the reported opportunistic infeasible-MIP abort path, but I agree that a root-cause fix is the better long-term resolution. If external help would be useful once the intended root-cause approach is clear, I would be happy to follow up in #1396 or open a fresh, narrower PR aligned with that direction. |
Summary
Fixes #1396.
Testing
/opt/homebrew/opt/llvm/bin/clang-format -i cpp/src/branch_and_bound/branch_and_bound.cpp cpp/src/mip_heuristics/solver.cu cpp/tests/mip/termination_test.cupython3 -m py_compile python/cuopt/cuopt/tests/linear_programming/test_lp_solver.pygit diff --check