ci: enable --stepmgr for Frontier jobs to relieve slurmctld pressure by sbryngelson · Pull Request #1611 · MFlowCode/MFC

sbryngelson · 2026-06-18T21:26:02Z

What

Add #SBATCH --stepmgr to Frontier and Frontier-AMD CI job submissions, via the existing extra_sbatch hook in .github/scripts/submit-slurm-job.sh. Phoenix is unaffected.

Why

Each Frontier CI test/bench job is a single-node (-N 1) allocation that runs the full regression suite inside the allocation via ./mfc.sh test -a -j 32. Each of the ~560 test cases launches a separate srun per target (pre_process / simulation / post_process), with restart-roundtrip cases adding more — on the order of 1,700+ srun step-creates per job, up to 32 concurrent, all brokered by slurmctld. Across the test + bench + AMD matrix this reaches ~3,000 step-creates and congests the Frontier Slurm controller.

OLCF flagged this and temporarily limited the maintainer account to one running job. --stepmgr is the mechanism OLCF recommended: it delegates step management to each job's own slurmstepd instead of routing every srun through the central controller, which is the appropriate model for many-step single-allocation workloads.

Scope

One file changed; behavior identical except for the added SBATCH directive on Frontier.
No change to the number of srun calls or to test logic — this only changes who brokers the steps.

Validation

Confirm enable_stepmgr is active on the Frontier controller (coordinating with OLCF).
Run the Frontier test + bench jobs on this branch and confirm reduced controller load.

If --stepmgr proves insufficient, follow-ups in reserve: lower in-job concurrency (-j), or coalesce the three executables into a single srun per case.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Enables Slurm step management delegation (#SBATCH --stepmgr) for Frontier and Frontier-AMD CI job submissions to reduce slurmctld load from high-volume srun step creation.

Changes:

Add #SBATCH --stepmgr via extra_sbatch for Frontier CI jobs.
Document rationale for --stepmgr in the Frontier cluster block.

codecov · 2026-06-19T00:53:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.51%. Comparing base (b4be438) to head (a116bda).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1611   +/-   ##
=======================================
  Coverage   60.51%   60.51%           
=======================================
  Files          83       83           
  Lines       19905    19905           
  Branches     2950     2950           
=======================================
  Hits        12046    12046           
  Misses       5866     5866           
  Partials     1993     1993

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson · 2026-06-19T01:17:24Z

Validated on Frontier ✅

Ran live tests on Frontier (Slurm 25.11.5) to confirm --stepmgr is accepted, actually engaged, and works end-to-end with the real test suite.

Prerequisites confirmed present controller-side:

PrologFlags = Alloc,Contain,... — Contain is the hard requirement for per-job --stepmgr.
SlurmctldParameters = ...,rl_enable,rl_bucket_size=350,rl_refill_rate=64 — the controller already RPC-rate-limits, which is what was throttling our srun bursts. --stepmgr sidesteps it by having each job's slurmstepd broker its own steps.

Test 1 — stepmgr engages, step bursts don't hit the controller (job `4866187`)

StepMgrEnabled=Yes
SLURM_STEPMGR env = 'frontier01755'
80 sequential srun steps done in 6s, failures=0   (rate-limit/errors: none)
16 concurrent overlapping srun steps             (errors: none)

sacct: 4866187 COMPLETED 00:00:09 0:0

Test 2 — real MFC suite under `--stepmgr` (job `4866242`)

Command: ./mfc.sh test -j 8 --gpu acc --no-build --only 1D -- -c frontier, run inside an allocation carrying #SBATCH --stepmgr.

StepMgrEnabled=Yes
157 passed
  7 failed

sacct: 4866242 COMPLETED 00:02:12 0:0

157/164 1D tests passed, with every srun step (~470+: syscheck + pre_process + simulation per case) brokered by slurmstepd instead of slurmctld. The 7 failures are unrelated to stepmgr — they are gpu-acc-chem/syscheck: No such file or directory, i.e. the chemistry-enabled build variant simply wasn't compiled in my local test build. Zero stepmgr / step-creation / rate-limit errors throughout.

Conclusion

#SBATCH --stepmgr is honored on Frontier and the suite runs normally with controller step-RPC pressure removed — exactly the mitigation OLCF requested. Safe to merge for the Frontier/Frontier-AMD CI paths.

sbryngelson · 2026-06-19T01:22:34Z

Production incident data (Frontier accounting)

OLCF pointed at job 4816911 as a specific offender. Accounting confirms it and shows this is the normal CI fan-out, not a one-off:

The flagged job

4816911  MFC-test-gpu-omp-1-of-2  1 node  hackathon  COMPLETED  14m22s
steps recorded: 2103

A single-node test shard that ran ~half the suite in-process and issued 2,103 srun step-creates, the bulk bursting at startup when all in-job workers launch at once — exactly the "~1000 steps in a few seconds" OLCF observed.

It runs alongside many sibling shards. In one ~35-min window (Jun 16 ~10:00–10:35), these single-node test shards were live concurrently, each hammering slurmctld:

Job	Name	Steps
4816813	MFC-test-cpu-none	4170
4816895	MFC-test-cpu-none	4170
4816911	MFC-test-gpu-omp-1-of-2	2103
4816909	MFC-test-gpu-acc-1-of-2	2100
4816880	MFC-test-gpu-acc-2-of-2	2093
4816898	MFC-test-gpu-omp-2-of-2	2093
4816858	MFC-test-cpu-none-2-of-2	2086
4816882	MFC-test-cpu-none-2-of-2	2086
4817028	MFC-test-gpu-omp-2-of-2	2086

(Build/prebuild/run jobs are only ~2 steps each — negligible. The step pressure is entirely the in-allocation test/bench runs.)

Why this PR covers all of it: --stepmgr is added in the shared submit-slurm-job.sh wrapper, so every one of these shards (test + bench, cpu + gpu, acc + omp) gets slurmstepd-managed steps. The live validation above (jobs 4866187, 4866242) confirms the mechanism engages and the suite still passes. This removes the controller step-RPC load that these ~2,000-step shards were generating in aggregate.

Each CI test/bench job is a single-node allocation that runs the full regression suite in-process via ./mfc.sh test, launching one srun per target (pre_process/simulation/post_process) for ~560 cases = ~1700+ srun step-creates per job, up to 32 concurrent. This congests the Frontier Slurm controller. --stepmgr delegates step management to each job's slurmstepd instead of routing every srun through slurmctld, which is the appropriate mechanism for many-step single-allocation workloads. Added via the existing extra_sbatch hook for both frontier and frontier_amd; Phoenix unaffected.

Copilot AI review requested due to automatic review settings June 18, 2026 21:26

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread .github/scripts/submit-slurm-job.sh

Copilot started reviewing on behalf of sbryngelson June 18, 2026 22:01 View session

sbryngelson force-pushed the ci/frontier-stepmgr branch from 59fe0c7 to a116bda Compare June 19, 2026 01:30

sbryngelson merged commit 22d6e1f into master Jun 20, 2026
94 of 98 checks passed

sbryngelson deleted the ci/frontier-stepmgr branch June 20, 2026 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: enable --stepmgr for Frontier jobs to relieve slurmctld pressure#1611

ci: enable --stepmgr for Frontier jobs to relieve slurmctld pressure#1611
sbryngelson merged 1 commit into
masterfrom
ci/frontier-stepmgr

sbryngelson commented Jun 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

sbryngelson commented Jun 19, 2026

Uh oh!

sbryngelson commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

sbryngelson commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Scope

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

codecov Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sbryngelson commented Jun 19, 2026

Validated on Frontier ✅

Test 1 — stepmgr engages, step bursts don't hit the controller (job 4866187)

Test 2 — real MFC suite under --stepmgr (job 4866242)

Conclusion

Uh oh!

sbryngelson commented Jun 19, 2026

Production incident data (Frontier accounting)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

sbryngelson commented Jun 18, 2026 •

edited

Loading

codecov Bot commented Jun 19, 2026 •

edited

Loading

Test 1 — stepmgr engages, step bursts don't hit the controller (job `4866187`)

Test 2 — real MFC suite under `--stepmgr` (job `4866242`)