Skip to content

Guard root relaxation task exceptions#1404

Closed
Rohithmatham12 wants to merge 5 commits into
NVIDIA:mainfrom
Rohithmatham12:fix-root-relaxation-task-exceptions
Closed

Guard root relaxation task exceptions#1404
Rohithmatham12 wants to merge 5 commits into
NVIDIA:mainfrom
Rohithmatham12:fix-root-relaxation-task-exceptions

Conversation

@Rohithmatham12

@Rohithmatham12 Rohithmatham12 commented Jun 8, 2026

Copy link
Copy Markdown

Summary

  • Capture exceptions thrown by the concurrent root-relaxation dual-simplex OpenMP task.
  • Signal the root concurrent halt flag when the root task fails so the companion wait loop exits.
  • Capture exceptions from the enclosing B&B OpenMP task, preempt the concurrent heuristic solver, and rethrow after the taskgroup joins.
  • Add C++ and Python regression coverage for the opportunistic-mode infeasible MIP crash path in [BUG] std::terminate (process abort) on an infeasible MILP in opportunistic (default) mode #1396.

Fixes #1396.

Testing

  • /opt/homebrew/opt/llvm/bin/clang-format -i cpp/src/branch_and_bound/branch_and_bound.cpp cpp/src/mip_heuristics/solver.cu cpp/tests/mip/termination_test.cu
  • python3 -m py_compile python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py
  • git diff --check

Signed-off-by: Rohithmatham12 <rohithmatham@gmail.com>
@Rohithmatham12 Rohithmatham12 requested a review from a team as a code owner June 8, 2026 16:45
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3b5165ec-7dfe-40a9-8b21-fa84fa35b2ac

📥 Commits

Reviewing files that changed from the base of the PR and between 4985e2b and 177dcb3.

📒 Files selected for processing (1)
  • python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py

📝 Walkthrough

Walkthrough

Concurrent root LP and B&B tasks now capture exceptions into an std::exception_ptr and set the concurrent halt flag; the stored exception is rethrown after each taskwait/taskgroup synchronization so failures surface to the caller. A test for concurrent-root infeasibility was added.

Changes

Exception propagation for concurrent root LP solve

Layer / File(s) Summary
Branch-and-bound: add header and capture exceptions in root LP task
cpp/src/branch_and_bound/branch_and_bound.cpp
Adds <exception> and wraps the OpenMP root LP worker in try/catch to store exceptions in std::exception_ptr and set the root concurrent halt flag.
Branch-and-bound: rethrow stored exception at taskwait points
cpp/src/branch_and_bound/branch_and_bound.cpp
After each #pragma omp taskwait depend(in : root_status), the stored std::exception_ptr is checked and rethrown in the crossover, subsequent-branch, and non-crossover paths.
Heuristic solver: capture/rethrow around branch_and_bound->solve
cpp/src/mip_heuristics/solver.cu
Adds <exception>, wraps branch_and_bound->solve inside the OpenMP task with try/catch to store exceptions and preempt heuristics, then rethrows after the taskgroup barrier.
Tests: add concurrent-root-infeasible problem and test
cpp/tests/mip/termination_test.cu
Adds helper to build a CSR infeasible MILP and a GoogleTest concurrent_root_infeasible_returns_status asserting solver returns mip_termination_status_t::Infeasible.
Python: regression test running solver in subprocesses
python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py
Adds a subprocess-based regression test that runs an infeasible MILP in both determinism modes and asserts the printed STATUS is Infeasible.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title directly and accurately summarizes the main change: guarding (capturing and handling) exceptions from the root relaxation OpenMP task.
Description check ✅ Passed The description is detailed and clearly explains the changes made to fix the issue: exception capture at task level, halt flag signaling, B&B task exception handling, and regression test coverage.
Linked Issues check ✅ Passed The PR addresses the core requirements of issue #1396: capturing exceptions from concurrent root-relaxation tasks, preventing silent failures, and returning Infeasible status instead of aborting [#1396]. Both C++ and Python regression tests exercise the opportunistic infeasible-MIP path.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the root relaxation exception handling issue and adding appropriate regression tests; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@Rohithmatham12

Copy link
Copy Markdown
Author

Hi maintainers, this PR is ready for review but the required Label Checker is currently blocked by missing labels. Could someone with repo permissions please add the category/breaking-status labels? I believe the appropriate labels are bug and non-breaking.

It also still needs the NVIDIA runner validation from copy-pr-bot before CI can start. Thanks!

@nguidotti nguidotti added bug Something isn't working non-breaking Introduces a non-breaking change mip labels Jun 9, 2026
@nguidotti nguidotti added this to the 26.08 milestone Jun 9, 2026
@nguidotti

Copy link
Copy Markdown
Contributor

Thanks for you contribution, @Rohithmatham12! I do not know if this will solve the #1396 completely as we are also not catching the exception on the enclosing task. We should probably investigate why it is throwing an exception in the first place. In any case, I think this is a good addition.

@nguidotti

Copy link
Copy Markdown
Contributor

/ok to test fbc6ad9

@Rohithmatham12

Copy link
Copy Markdown
Author

Thanks for the review. I added a follow-up commit that also catches exceptions from the enclosing B&B OpenMP task in mip_heuristics/solver.cu, preempts the concurrent heuristic solver before the taskgroup joins, and rethrows on the dispatching thread.

I also added a regression in termination_test.cu for an opportunistic-mode infeasible MILP shaped after #1396, so CI should now cover the expected Infeasible status instead of an abort.

@mlubin

mlubin commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution. Could you confirm that #1396 is fixed by following the repro instructions?

@Rohithmatham12

Copy link
Copy Markdown
Author

Yes, this should address #1396: the crash path was caused by exceptions escaping OpenMP tasks in the opportunistic concurrent root solve. This PR now catches both the root-relaxation task exception and the enclosing B&B task exception, stops the companion heuristic path, and rethrows safely after task synchronization. I also added a regression covering the opportunistic infeasible-MIP case so CI validates that it returns Infeasible instead of aborting.

@mlubin

mlubin commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

It should address or it does address? We can run the repro script to confirm.

@Rohithmatham12

Copy link
Copy Markdown
Author

It does address the identified crash path: exceptions from both the concurrent root-relaxation task and the enclosing B&B OpenMP task are now captured, the companion heuristic path is stopped, and the exception is rethrown only after task synchronization.

Yes, running the repro script against the latest commit would be the right end-to-end confirmation. The added C++ regression is intended to cover the same opportunistic infeasible-MIP path in CI and verify that it returns Infeasible instead of aborting.

@mlubin

mlubin commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Did you run the end-to-end confirmation?

@Rohithmatham12

Copy link
Copy Markdown
Author

No, I have not run the exact Python repro end-to-end yet. I verified the code path by adding the C++ regression for the same opportunistic infeasible-MIP case, but the Python repro should still be run against the latest commit for final confirmation.

If you can run it on the NVIDIA CI/runtime, I will follow up immediately on any failure.

@nguidotti nguidotti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm if this actually fixes the issue before we can merge this

root_vstatus_,
edge_norms_,
nullptr);
} catch (...) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the underlying exception we are catching here? It might be best to address the root cause of that exception.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the root cause is the ideal fix if the repro exposes a specific solver exception. From the current public repro, the observable failure is process abort from an exception escaping OpenMP task execution in the opportunistic concurrent root path.

I added the exact Python repro as a subprocess-based pytest so NVIDIA CI can now validate the end-to-end path and expose whether there is still a deeper solver exception after task synchronization is fixed. If that test fails with a concrete exception/status, I will follow up by addressing that root cause rather than leaving this as only exception containment.

Signed-off-by: Rohithmatham12 <rohithmatham@gmail.com>
@Rohithmatham12 Rohithmatham12 requested a review from a team as a code owner June 9, 2026 16:23
@Rohithmatham12 Rohithmatham12 requested a review from tmckayus June 9, 2026 16:23
@Rohithmatham12

Copy link
Copy Markdown
Author

I added the #1396 Python repro as an end-to-end pytest in python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py. It runs the deterministic and opportunistic variants in separate subprocesses, matching the issue repro pattern, and asserts both return STATUS=Infeasible with exit code 0.

This should let NVIDIA GPU CI provide the requested confirmation on the latest PR commit (4985e2bfca2cbeb3add0d7a9acc0d4a0ad11fe9e).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py (1)

107-164: 💤 Low value

Consider adding a docstring to document the regression test.

This test addresses issue #1396 (opportunistic mode crashing on infeasible MILPs). A docstring would help future maintainers understand the specific crash scenario and why subprocess isolation is necessary.

📝 Example docstring
 def test_mip_opportunistic_infeasible_repro_returns_status():
+    """Regression test for issue `#1396`: infeasible MILP should return Infeasible status.
+    
+    Verifies that an infeasible MILP returns status Infeasible in both deterministic
+    and opportunistic modes, without crashing (issue `#1396` caused SIGABRT in opportunistic
+    mode). Uses subprocess isolation to catch process-level crashes.
+    """
     worker = textwrap.dedent(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py` around lines
107 - 164, Add a concise docstring at the top of the
test_mip_opportunistic_infeasible_repro_returns_status function explaining that
this is a regression test for issue `#1396` (opportunistic mode crashing on
infeasible MILPs), why the test uses a subprocess (isolation to reproduce crash
deterministically via the worker script and modes
"deterministic"/"opportunistic"), and what the expected behavior is (process
exits 0 and outputs "STATUS=Infeasible"). Reference the worker variable and the
subprocess.run invocation in the docstring so future maintainers know where to
look.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py`:
- Around line 107-164: Add a concise docstring at the top of the
test_mip_opportunistic_infeasible_repro_returns_status function explaining that
this is a regression test for issue `#1396` (opportunistic mode crashing on
infeasible MILPs), why the test uses a subprocess (isolation to reproduce crash
deterministically via the worker script and modes
"deterministic"/"opportunistic"), and what the expected behavior is (process
exits 0 and outputs "STATUS=Infeasible"). Reference the worker variable and the
subprocess.run invocation in the docstring so future maintainers know where to
look.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 62e1187d-b851-42a7-8bfe-6889d4358cd9

📥 Commits

Reviewing files that changed from the base of the PR and between c51ff79 and 4985e2b.

📒 Files selected for processing (1)
  • python/cuopt/cuopt/tests/linear_programming/test_lp_solver.py

Signed-off-by: Rohithmatham12 <rohithmatham@gmail.com>
@Rohithmatham12

Copy link
Copy Markdown
Author

Follow-up: I addressed the regression-test documentation nit in the latest commit (177dcb3e7fafa738d98ccd6a6d4a0d4967f3beb1) and replied inline on the root-cause question.

The PR now includes the #1396 Python repro as a subprocess-based pytest, so the end-to-end failure mode can be validated directly by CI: both deterministic and opportunistic runs must exit 0 and print STATUS=Infeasible rather than aborting. Could a maintainer please run validation on the latest SHA? That should provide the confirmation needed for the requested change.

@mlubin

mlubin commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

We have plans to fix this issue by addressing the root cause as discussed at #1404 (comment). I don't think it's an effective use of our CI resources or developer bandwidth for you to debug only based on CI.

@mlubin mlubin closed this Jun 9, 2026
@Rohithmatham12

Copy link
Copy Markdown
Author

Thanks for the review and for clarifying the preferred direction. Understood on not spending more CI or reviewer time on this exception-guarding PR if the team plans to address #1396 at the root cause.

I will stop iterating on this closed PR. For context, the latest branch did include the enclosing-task guard plus C++/Python regression coverage for the reported opportunistic infeasible-MIP abort path, but I agree that a root-cause fix is the better long-term resolution.

If external help would be useful once the intended root-cause approach is clear, I would be happy to follow up in #1396 or open a fresh, narrower PR aligned with that direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working mip non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] std::terminate (process abort) on an infeasible MILP in opportunistic (default) mode

4 participants